Aijmer & Altenberg - Advances in Corpus Linguistics

Introduction
Karin Aijmer and Bengt Altenberg
University of Gteborg and University of Lund
Corpus linguistics has made spectacular advances since the early 1960s when
computer corpora were first made available for research. The use of corpora has
spread to practically every branch of linguistics and has become indispensable in
many practical applications of linguistic research, from lexicography and
terminology extraction to information retrieval and computer-assisted translation.
Corpora have become bigger and more diversified: apart from large general-
purpose corpora, a number of specialised corpora are now being used for research
in such areas as historical linguistics, sociolinguistics, dialectology, LSP,
interlanguage research, contrastive linguistics and translation studies. In addition,
CD-ROM newspaper collections and the Internet have become increasingly
important resources for language study. Hand in hand with these developments a
variety of research tools have been created for exploring, annotating and
processing language data in various ways.
However, the most important achievement of corpus linguistics is
undoubtedly that it has put the use of language at the centre of linguistics. In
theoretical as well as practical approaches to language, computer corpora have
placed linguistics on a firm empirical footing, emphasising the functional and
communicative basis of language.
This volume contains twenty-two papers presented at the 23rd
International Conference on English Language Research on Computerized
Corpora of Modern and Medieval English (ICAME) held at Gteborg, Sweden, in
May 2002. They cover a wide range of topics and, though few of them represent
the technical or computational side of the discipline, they illustrate clearly the
diversity of research that is characteristic of corpus linguistics today. The
contributions have been divided into six broad and inevitably overlapping
categories, under the following headings:
The role of corpora in linguistic research

Exploring lexis, grammar and semantics
Discourse and pragmatics
Language change and language development
Cross-linguistic studies
Software development
We have chosen to call the volume Advances in Corpus Linguistics. This may
seem a bold title, as it suggests a systematic account of recent developments in
the field. However, advances in linguistics seldom take the form of big leaps.
2 Karin Aijmer and Bengt Altenberg
This is particularly true of corpus linguistics where each study can be seen as a
small step in the expansion of a vast and complex discipline, whether the focus is
theoretical, descriptive or methodological. Corpus linguistics is a constantly
changing field and ICAME conferences generally provide a good reflection of
this. The theme of the 2002 conference was The Theory and Use of Corpora. In
what ways can the present volume be said to represent advances in these
respects? Rather than presenting a summary of the individual contributions, we
will try to point out some issues and tendencies that we think are characteristic of
the volume as a whole.
The role of corpus linguistics and the relationship between data and theory
have been debated ever since the rise of corpus linguistics. The debate is also
clearly reflected in the present volume. That there is a need for such a debate may
suggest that corpus linguistics has not advanced in the past decades, but it can
also be regarded as a sign of the vitality of the field. A constant re-examination of
the goals of corpus linguistics and a critical discussion of theoretical and
methodological questions are necessary if corpus linguistics is to make significant
progress in the future.
The following issues are brought up for discussion in the first three
programmatic articles of the volume:
the problems of transcription and annotation

the role of intuition in corpus linguistics
corpus-based vs corpus-driven approaches
the relationship between data, description and theory
the conflict between lexical access and the need for research on grammar
and spoken language
These are old questions in corpus linguistics. Although they are largely
methodological in character, they all have theoretical relevance. They are also
closely related.
The transcription of speech and the grammatical annotation of corpora
both involve imposing an analysis of the corpus data. This also means that they
allow the researchers intuition and, in the case of annotation, a preconceived
theoretical model to play a role at an early stage in the research process. But as
Michael Halliday points out in his contribution, transcription and annotation are
different in nature: while transcription of prosodic features provides an essential
part of the meaning of spoken discourse, grammatical annotation adds a
received linguistic description to the data, a description that may be incomplete,
obsolete or incorrect and therefore bound to distort the analysis before it has
started. Halliday recognises the problems involved in prosodic transcription but
also emphasises the desirability of marking such meaningful features as
intonation and rhythm. John Sinclair makes a distinction between mark-up and
annotation and argues that both should be kept separate from the raw text.
According to Sinclair, annotation should be avoided except in corpora used for
Introduction 3
practical applications since it prevents the development of language theory and

description. In this respect, corpus linguistics has still to mature a little (p 55).
Intuition has been discredited in corpus linguistics. Does it have a place at
all and, if so, when is it allowed to play a role? Two contributors, John Sinclair
and Geoffrey Leech, touch on this issue. Both make a distinction between two
senses of intuition: (a) the knowledge of the language of the native speaker and
(b) the analytical expertise of the linguist. Intuition in the former sense is fallible
and unreliable and therefore to be distrusted, except possibly as a hunch to be
tested out in corpora. But in the latter sense intuition is indispensable. An
important task of the corpus linguist is to interpret the patterns of the data and
transform them into theoretical statements.
The distinction between corpus-driven and corpus-based approaches in
language research has been brought into focus and debated in recent years.
Briefly, the approaches can be said to differ in the role given to a theoretical
model in the course of a study. To many linguists the opposition is artificial or
irrelevant as long as a theoretical stance is introduced at some point in the
research process. Halliday accepts the distinction in principle, but cannot see a
clear boundary between theory and data; the borderline is fuzzy and corpus-
driven approaches are normally not entirely theory-free. He also rejects the idea
that corpus-driven linguistics is about parole (as has been maintained); all usage-
based linguistic research is concerned with both parole and langue, i.e. both
usage and system. Sinclair, on the other hand, strongly advocates the corpus-
driven approach on the grounds that corpus-based methods are at best concerned
with testing established theories, though generally no serious testing is done. In
contrast, the corpus-driven approach allows the data to control the analysis and
consequently to create or modify linguistic theories.
Geoffrey Leech, looking at language research in a similar perspective but
in slightly different terms, recognises three levels of investigation:
data (collection) - description - theory
Although corpus studies have a natural starting point in data, Leech objects to the
common assumption that corpus linguistics is concerned with mere data
collection and description. Explaining usage or changes in usage in Leechs
case inevitably involves theoretical considerations. The explanation of usage
may be language-internal or language-external, i.e. motivated by social factors.
As Leech demonstrates, corpus linguistics is naturally suited to usage-based
conceptions of linguistics which (unlike the Chomskyan paradigm) assume that
there is a bridge between the study of naturally-occurring data and the cognitive
and social workings of language (p 78).
Another problem that is no doubt familiar to most corpus linguists but
seldom discussed is the fact that corpus research is by necessity biassed in the
direction of lexis. Corpora are organised lexically and accessed via the
orthographic word. As a result, phenomena at the lexical end of the
lexicogrammatical continuum are more accessible than those at the grammatical
3
end. As Halliday points out, this problem is especially acute in the study of the
spoken language where meaning is more highly grammaticalized and more
covert. What is needed, according to Halliday, are ways of designing a corpus for
the study of phenomena at the grammatical end of the continuum. This need is
especially great in the area of spoken language where, prototypically, meaning is
made and the frontiers of meaning potential are extended (p 11).
To judge from the present volume, Hallidays appeal for research on the
grammar of spoken discourse is warranted. The great majority of the studies
represented in the volume either focus on the lexical end of the continuum or
explore grammar or text via lexis. Moreover, few of them are specifically devoted
to spoken discourse as such. Two exceptions are Bernard De Clercks
examination of the pragmatic function of lets in the spoken part of ICE-GB and
Clive Souters study of childrens vocabulary in the Polytechnic of Wales (PoW)
Corpus. However, the focus De Clercks investigation is on the functional
variation of lets utterances in different speech categories and it is rather an
example of another important use of corpora, viz. the exploration of language
variation. Similarly, Souters aim is to demonstrate the usefulness of a small but
richly annotated corpus for studies of childrens vocabulary development and, in
particular, how this is affected by such extra-linguistic factors as sex and age.
Several other contributors explore register or regional variation. Jonathan
Charteris-Black uses corpora to compare metaphors in British and American
political discourse and Peter Tan, Vincent Ooi and Andy Chiang investigate the
spoken character of personal advertisements placed on the Web by ESL
speakers in South East Asia, using spoken and written portions of the Singapore
component of ICE as a standard of comparison.
More directly concerned with the structure of discourse are the papers by
Michael Hoey and Hilde Hasselgrd. Both argue that corpus-linguistic techniques
can be used to study patterns in text. However, their starting points are different.
Hoey claims that every lexical item is primed for use in textual organisation (p
174) and consequently examines textual patterns via lexis. Hasselgrd, on the
other hand, starts with a grammatical construction. Her paper investigates the
discourse and information structural functions of it-cleft constructions with an
adverbial in focus position.
As mentioned, the corpus linguist now has access to a wide variety of
corpora, ranging from very large corpora (the Cobuild Bank of English, the
British National Corpus) and carefully designed and annotated million-word
corpora in the tradition of the Brown and LOB corpora (e.g. Frown, FLOB and
the regional variants of ICE) to various smaller corpora collected for specific
purposes (e.g. the Helsinki Corpus, the ICLE corpus, the PoW Corpus). Many of
these are tagged and parsed, permitting the user easy retrieval of specific
grammatical categories. In addition, there is a rapidly growing number of
multilingual corpora with English as one of the languages compared. The
usefulness of all these types of corpora is amply illustrated in the present volume.
Yet, for certain purposes in particular the study of specific domains or
genres that are absent from, or insufficiently represented in, the general-purpose
Introduction 5
corpora the researcher has to collect his/her own corpus. Here material available
on the Web has proved to be a useful additional resource. No less than six of the
contributions to the present volume make use of such material (Charteris-Black,
Kbler, Renouf et al., Tan et al., Hoey, Tognini Bonelli and Manca). However,
using the Web as an unrestricted language resource presents several problems. As
Antoinette Renouf, Andrew Kehoe and David Mezquiriz point out in their
contribution, the nature of the Web as a random accumulation of heterogeneous
texts, many being less conventionally text-like, poses problems for the corpus
linguist who tries to access it through existing search engines (p. 403). Reporting
on a project designed to develop a user-friendly and more selective search tool for
the Web (WebCorp), they discuss some of the difficulties involved and how these
might be overcome or reduced. Their report is the only contribution representing
software development in the volume.
The volume also illustrates a variety of methodological approaches and, in
particular, that the choice of method is to a large extent determined by the
purpose of the study. One well-established method is the use of concordances
where syntagmatic lexicogrammatical patterns are revealed and make it possible
for the researcher to classify and describe the data in general theoretical terms.
This approach is of course especially useful in studies focusing on the lexical end
of the continuum and when the researcher knows which word or expression to
start from. Sometimes, however, there is no obvious lexical starting point. A case
in point is the study of metaphor. In this case the researcher first has to make an
educated guess about which lexical items are likely to serve as vehicles of
metaphors of a certain type (e.g. body parts, terms of war, etc), make a tentative
list of potentially rewarding items, and adjust the list after pilot searches in the
selected corpus material. An example of a study based on such intuitive
sampling is Charteris-Blacks comparison of metaphors in British and American
political discourse mentioned above.
Another example of the methodological problems facing the corpus
linguist is Thomas Kohnens investigation of the history of English directive
speech acts. With speech acts there is no predictable link between form and
function and consequently no systematic and reliable way of retrieving relevant
forms. Kohnen gives a summary of some of the methodological problems
involved and advocates a procedure called structured eclecticism. The method
implies the deliberate selection of typical patterns, such as the use of the
imperative or a performative clause, which are then traced throughout the history
of English. Kohnens diachronic study is also a good illustration of how several
corpora can be combined to throw light on linguistic change (in Kohnens case
the Helsinki Corpus, the electronic version of the Middle English Dictionary and
the Brown and LOB corpora). Another illustration is Geoffrey Leechs
examination of recent changes in English grammar on the basis of data from six
corpora spanning the last four decades of the 20th century.
However, diachronic change can also be demonstrated on the basis of
synchronic variation in recent corpora. Liselotte Brems investigates signs of
delexicalization and synchronic grammaticalization revealed by patterns in the
5
use of measure nouns in the Cobuild Corpus. In a similar fashion, Gran Kjellmer
combines information from the OED with indications of synchronic variation in
recent corpora (the Cobuild Corpus and the BNC) to explain referential changes
of reflexive pronouns through the centuries.
Contrastive studies based on multilingual corpora require special
methodologies of their own. Here the languages compared serve as mirror images
of each other, highlighting cross-linguistic differences and similarities. For those
concerned with contrastive lexicology, such as Helge Dyvik and ke Viberg,
translation corpora clearly reveal such phenomena as overlapping polysemy,
diverging meaning extensions and language-specific lexical relations (synonymy,
hyponymy, etc). The procedure used in these studies is truly corpus-driven,
although theoretical frameworks guide the analysis at different stages. For Anna-
Lena Fredriksson, who investigates the notion of clausal theme in an English-
Swedish perspective, parallel corpus data help to define a tertium comparationis
and to identify a cross-linguistic theoretical model.
In contrastive research based on corpora of comparable texts from
different languages (rather than translations) the method has to be different. Here
a comparison must be made between typical expressions of concepts and
functions used in comparable situations in the compared languages. This is well
illustrated in Elena Tognini Bonellis and Elena Mancas comparison of
meanings encoded in English and Italian descriptions of farmhouse holidays on
the Web.
Natalie Kblers contribution is also cross-linguistic in character but has a
more clearly defined applied purpose. It reports on an experiment in corpus-
driven learning in the area of cross-linguistic lexicography. Trawling the Web by
means of the WebCorp tool (described by Renouf et al) and comparing the results
with data from multilingual corpora, students are taught to evaluate different
methods and sources for the purpose of building customised dictionaries for
machine translation.
Interlanguage studies on the basis of learner corpora such as the International
Corpus of Learner Language (ICLE) also require a special contrastive methodology.
Patterns of usage in the learners production that deviate from those of native
English writers may be due to contrastive differences between the learners L1 and
the target language. Conversely, contrastive differences can be used to formulate
hypotheses about interlanguage problems that can be checked against data in learner
corpora. As a result, research on learner corpora generally require comparisons with
corpora representing both the learners native language and the target language. This
is well illustrated in Roumania Blagoevas study of the use of demonstrative
pronouns by advanced Bulgarian learners of English.
Corpus linguistics can be combined with different theoretical approaches.
Whether corpus-driven or corpus-based, most of the contributions make some
link with theory. The aim of Joybrato Mukherjees paper on the verb give, for
example, is to bridge the gap between corpus-based research into actual language
use and cognitive grammar. Caroline David attempts to refine existing syntactico-
semantic classifications of putting verbs on the basis of corpus-data. Similarly,
Introduction 7
Peter Willemses study of pseudo-definite NPs in existential constructions is a

usage-informed attempt to create a more exhaustive and refined classification of
different types of pseudo-definiteness than has previously been achieved.
Although the present volume can only give a limited picture of the
advances of corpus linguistics in recent years, the contributions give clear
evidence of the variety and vitality of the field. Electronic corpora are now
exploited for a wide range of purposes. New types of corpora are being created
and new techniques developed to serve the demands of an expanding circle of
scholars who may have different interests and theoretical backgrounds but who
have a common desire to explore the nature of language by studying its use in
authentic texts. The theoretical, methodological and pedagogical issues addressed
in the present volume demonstrate clearly the steady advance of an expanding
discipline inspired by an empirical, usage-based approach to the study of
language.
7
The spoken language corpus: a foundation for grammatical
theory
M.A.K. Halliday
University of Sydney
1. Introductory
I felt rather daunted when Professor Karin Aijmer invited me to talk at this
Conference, because it is fifteen years since I retired from my academic
appointment and, although I continue to follow new developments with interest, I
would certainly not pretend to keep up to date especially since I belong to that
previous era when one could hope to be a generalist in the field of language
study, something that is hardly any longer possible today. But I confess that I was
also rather delighted, because if there is one topic that is particularly close to my
heart it is that of the vast potential that resides in a corpus of spoken language.
This is probably the main source from which new insights can now be expected
to flow.
I have always had greater interest in the spoken language, because that in
my view is the mainspring of semogenesis: where, prototypically, meaning is
made and the frontiers of meaning potential are extended. But until the coming of
the tape recorder we had no means of capturing spoken language and pinning it
down. Since my own career as a language teacher began before tape recorders
were invented (or at least before the record companies could no longer stop them
being produced), I worked hard to train myself in storing and writing down
conversation as it occurred; but there are obviously severe limits on the size of
corpus you can compile like that. Of course, to accumulate enough spoken
language in a form in which it could be managed in very large quantities, we
needed a second great technical innovation, the computer; but in celebrating the
computerized corpus we should not forget that it was the tape recorder that broke
through the sound barrier (the barrier to arresting speech sound, that is) and made
the enterprise of spoken language research possible. It is ironical, I think, that
now that the technology of speech recording is so good that we can eavesdrop on
almost any occasion and kind of spoken discourse, we have ethics committees
and privacy protection agencies denying us access, or preventing us from making
use of what we record. (Hence my homage to Svartvik and Quirk, which I still
continue to plunder as a source of open-ended spontaneous dialogue.)
So my general question, in this paper, is this: what can we actually learn,
about spoken language and, more significantly, about language, by using a
computerized corpus on a scale such as can now be obtained? What I was
suggesting by my title, of course (and the original title had the phrase at the
foundation of grammatics, which perhaps makes the point more forcefully), was
that we can learn a great deal: that a spoken language corpus does lie at the
12 M.A.K. Halliday
foundation of grammatics, using grammatics to mean the theoretical study of

lexicogrammar this being located, in turn, in the context of a general theory of
language. (I had found it necessary to introduce this term because of the
confusion that constantly arose between grammar as one component of a
language and grammar as the systematic description of that component.) In this
sense, the spoken language corpus is a primary resource for enabling us to
theorize about the lexicogrammatical stratum in language and thereby about
language as a whole.
I can see no place for an opposition between theory and data, in the sense
of a clear boundary between data-gathering and theory construction. I
remember wondering, when I was reading Isaac Newtons Optics, what would
have happened to physics if Newton, observing light passing through different
media and measuring the refraction, had said of himself Im just a data-gatherer;
I leave the theorizing to others. What was new, of course, was that earlier
physicists had not been able to observe and measure very much because the
technology wasnt available; so they were forced to theorize without having
adequate data. Galileo and Newton were able to observe experimentally; but this
did not lead them to set up an opposition between observation and theory
between the different stages in a single enterprise of extending the boundaries of
knowledge. Now, until the arrival of the tape recorder and the computer, linguists
were in much the same state as pre-Renaissance physicists: they had to invent, to
construct their database without access to the phenomena on which they most
depended. Linguistics can now hope to advance beyond its pre-scientific age; but
it will be greatly hindered if we think of data and theory as realms apart, or divide
the world of scholarship into those who dig and those who spin.
It is not the case, of course, that linguists have had no data at all. They
have always had plenty of written language text, starting with texts of high
cultural value, the authors whose works survived from classical times. This
already provoked disputation, in Europe, between text-based scholars and
theoreticians; we find this satirized in the late medieval fable of the Battle of the
Seven Arts, fought out between the Auctores and the Artes. But the auctores
embodied the notion of the text as a model (author as authority); this was written
language as object with value, rather than just as specimen to be used as evidence.
And this in turn reflects the nature of written language: it is language produced
under attention, discourse that is self-conscious and self-monitored. This does
not, of course, invalidate it as data; it means merely that written texts tell us about
written language, and we have to be cautious in arguing from this to the
potentiality of language as a whole. After all, speech evolved first, in the species;
speech develops first, in the individual; and, at least until the electronic age,
people did far more talking than writing throughout their lives.
2. Spoken and written

Throughout most of the history of linguistics, therefore, there has been no choice.
To study text, as data, meant studying written text; and written text had to serve
The spoken language corpus 13
as the window, not just into written language but into language. Now, thanks to
the new technology, things have changed; we might want to say: well, now, we
can study written texts, which will tell us about written language, and we can
study spoken texts, which will tell us about spoken language.
But where, then, do we find out about language? One view might be:
theres no such thing as language, only language as spoken and language as
written; so we describe the two separately, with a different grammar for each, and
the two descriptions together will tell us all we need to know. The issue of same
or different grammars has been much discussed, for example by David Brazil,
Geoffrey Leech and Michael Stubbs; there is obviously no one right answer it
depends on the context and the purpose, on what you are writing the grammar for.
The notion there is no such thing as language; there are only , whether only
dialects, only registers, only individual speakers or even only speech
events is a familiar one; it represents a backing away from theory, in the name of
a resistance to totalizing, but it is itself an ideological and indeed theoretical
stance (cf. Martins 1993 observations on ethnomethodology). And while of all
such attempts to narrow down the ultimate domain of a linguistic theory the
separation into spoken language and written language is the most plausible, it still
leaves language out of account, and hence renders our conception of semantics
particularly impoverished it is the understanding of the meaning-making power
of language that suffers most from such a move.
It was perhaps in the so-called modern era that the idea of spoken
language and written language as distinct semiotic systems made most sense,
because that was the age of print, when the two were relatively insulated one
from the other although the spoken standard language of the nation state was
already a bit of a hybrid. Now, however, when text is written electronically, and
is presented in temporal sequence on the screen (and, on the other hand, more and
more of speech is prepared for being addressed to people unknown to the
speaker), the two are tending to get mixed up, and the spoken/written distinction
is increasingly blurred. But even without this mixing, there is reason for
postulating a language, such as English, as a more abstract entity encompassing
both spoken and written varieties. There is nothing strange about the existence of
such varieties; a language is an inherently variable system, and the spoken/written
variable is simply one among many, unique only in that it involves distinct
modalities. But it is just this difference of modality, between the visual-synoptic
of writing and aural-dynamic of speech, that gives the spoken corpus its special
value not to mention, of course, its own very special problems!
I think it is not necessary, in the present context, to spend time and energy
disposing of a myth, one that has done so much to impede, and then to distract,
the study of spoken language: namely the myth that spoken language is lacking in
structure. The spoken language is every bit as highly organized as the written it
couldnt function if it wasnt. But whereas in writing you can cross out all the
mistakes and discard the preliminary drafts, leaving only the finished product to
offer to the reader, in speaking you cannot do this; so those who first transcribed
spoken dialogue triumphantly pointed to all the hesitations, the false starts and the
14 M.A.K. Halliday
backtrackings that they had included in their transcription (under the pretext of
faithfulness to the data), and cited these as evidence for the inferiority of the
spoken word a view to which they were already ideologically committed. It
was, in fact, a severe distortion of the essential nature of speech; a much more
faithful transcription is a rendering in ordinary orthography, including ordinary
punctuation. The kind of false exoticism which is imposed on speech in the act of
reducing it to writing, under the illusion of being objective, still sometimes gets in
the way, foregrounding all the trivia and preventing the serious study of language
in its spoken form. (But not, I think, in the corridors of corpus linguistics!)
3. Spoken language and the corpus

Now what the spoken corpus does for the spoken language is, in the first instance,
the same as what it does for the written: it amasses large quantities of text and
processes it to make it accessible for study. Some kinds of spoken language can
be fairly easily obtained: radio and television interviews, for example, or
proceedings in courts of law, and these figured already in the earliest COBUILD
corpus of twenty million words (eighteen million written and two million
spoken). The London-Lund corpus (alone, I think, at that time) included a
considerable amount of spontaneous conversation, much of it being then
published in the Corpus of English Conversation I referred to earlier (see Svartvik
and Quirk 1980). Ronald Carter and Mike McCarthy, in their CANCODE corpus at
Nottingham, work with five million words of natural speech; on a comparable
scale is the UTS -Macquarie corpus in Sydney, which includes a component of
spoken language in the workplace that formed the basis of Suzanne Eggins and
Diana Slades (1997) Analysing Casual Conversation. Already in the 1960s there
was a valuable corpus of childrens speech, some of it in the form of interview
with an adult but some of children talking amongst themselves, at the Nuffield
Foreign Language Teaching Materials Project under the direction of Sam Spicer
in Leeds; and in the 1980s Robin Fawcett assembled a database of primary school
childrens language in the early years of his Computational Linguistics Unit at the
(then) Polytechnic of Wales.
These are, I am well aware, just the exemplars that are known to me, in a
worldwide enterprise of spoken language corpus research, in English and no
doubt in many other languages besides. What all these projects have in common,
as far as I know, is that the spoken text, as well as being stored as speech, is also
always transcribed into written form. There are numerous different conventions
of transcribing spoken English; I remember a workshop on the grammar of casual
conversation, about twenty years ago, in which we looked into eight systems then
in current use (Hasan 1985), and there must be many more in circulation now.
What I have not seen, though such a thing may exist, is any systematic discussion
of what all these different systems imply about the nature of spoken language,
what sort of order (or lack of order) they impose on it or, in general terms, of
what it means to transcribe spoken discourse into writing. And this is in fact an
extraordinarily complex question.
In English we talk about reducing spoken language to writing, in a

metaphor which suggests that something is lost; and so of course it is. We know
that the melody and rhythm of speech, which are highly meaningful features of
the spoken language, are largely absent; and it is ironical that many of the
transcription systems the majority at the time when I looked into them
abandoned the one feature of writing that gives some indication of those
prosodies, namely punctuation. Of course punctuation is not a direct marker of
prosody, because in the evolution of written language it has taken on a life of its
own, and now usually (again referring to English) embodies a compromise
between the prosodic and the compositional (constituent) dimensions of
grammatical structure; but it does give a significant amount of prosodic
information, as anyone is aware who reads aloud from a written text, and it is
perverse to refuse to use it under the pretext of not imposing patterns on the data
rather as if one insisted on using only black and white reproductions of
representational art, so as not to impose colours on the flowers, or on the clothing
of the ladies at court. The absence of punctuation merely exaggerates the dogs
dinner image that is being projected on to spoken language.
There are transcriptions which include prosodic information; and these are
of two kinds: those, like Svartvik and Quirk (deriving from the work of Quirk and
Crystal in the 1960s), which give a detailed account of the prosodic movement in
terms of pitch, loudness and tempo, and those (like my own) which mark just
those systemic features of intonation and rhythm which have been shown to be
functional in carrying meaning as realizations of selections in the grammar, in
the same way that, in a tone language, they would be realizations of selections in
vocabulary. I use this kind of transcription because I want to bring out how
systems which occur only in the spoken language not only are regularly and
predictably meaningful but also are integrated with other, recognized grammatical
systems (those marked by morphology or ordering or class selection) in a manner
no different from the way these latter are integrated with each other. (Texts 14
illustrate some different conventions of transcription: Text 1 from a tape
recording made and transcribed about 1960; Text 2 from Svartvik and Quirk
1980; Text 3 an orthographic (and somewhat reduced) version of Text 2; Text 4
from Grimshaw 1994.)
Thus there is a gap in the information about spoken discourse that is
embodied in our standard orthographies; and since one major function of a
spoken language corpus is to show these prosodically-realized systems at work, it
seems to me that any mode of transcription used with such a corpus should at
least incorporate prosodic features in some systematic way. They are not optional
extras; in some languages at least, but probably in all, intonation and rhythm are
meaningful in an entirely systematic fashion.
But while it is fairly obvious what an orthographic transcription leaves
out, it is perhaps less obvious what it puts in. Orthographies impose their own
kind of determinacy, of a kind that belongs to the written language: a constituent-
like organization which is not really a feature of speech. Words are given clear
boundaries, with beginnings and endings often somewhat arbitrarily assigned;
16 M.A.K. Halliday
and punctuation, while in origin marking patterns of prosodic movement, has

been preempted to mark off larger grammatical units (there is considerable
variation in practice: some writers do still use it more as a prosodic device). It is
true that spoken language is also compositional: the written sentence, for
example, is derived from the clause complex of natural speech; but its
components are not so much constituents in a constituent hierarchy as movements
in a choreographic sequence. The written sentence knows where its going when
it starts; the spoken clause complex does not. (Text 3 illustrates this second
point.)
But writing imposes determinacy also on the paradigmatic axis, by its
decisions about what are, or are not, tokens of the same type. Here the effect of
reducing speech to writing depends largely on the nature of the script. There is
already variation here on the syntagmatic axis, because different scripts impose
different forms of constituency: in Chinese, and also in Vietnamese, the unit
bounded by spaces is the morpheme; in European languages it is the word,
though with room for considerable variation regarding what a word is; in
Japanese it is a mixture of the morpheme and the syllable, though you can
generally tell which morpheme begins a new word. On the paradigmatic axis,
Chinese, as a morphemic script, is the most determinate: it leaves no room for
doubt about what are and what are not regarded as tokens of the same type. But
even English and French, though in principle having a phonological script, have
strong morphemic tendencies; they have numerous homonyms at the morpho-
syllabic interface, which the writing system typically keeps apart. Such writing
systems mask the indeterminacy in the spoken language, so that (for example)
pairs like mysticism / misty schism, or icicle / eye sickle, which in speech are
separated only by minor rhythmic differences, come to be quite unrelated in their
written forms James Joyce made brilliant use of this as a semogenic resource
(but as a resource for the written language). But even in languages with a more
purely phonological script, such as Russian or Italian, the writing system enforces
regularities, policing the text to protect it from all the forms of meaningful
variation which contribute so much to the richness and potency of speech.
So transcribing spoken discourse especially spontaneous conversation
into written form in order to observe it, and to use the observations as a basis for
theorizing language, is a little bit problematic. Transcribing is translating, and
translating is transforming; I think to compile and interpret an extensive spoken
corpus inevitably raises questions about the real nature of this transformation.
4. Some features of the spoken language

I would like to refer briefly to a number of features which have been investigated
in corpus studies, with reference to what they suggest about the properties of
language as a whole. I will group these under seven headings; but they are not in
any systematic order just the order in which I found it easiest to move along
from each one to the next.
4.1 Patterns in casual conversation

Eggins and Slade, in their book Analysing Casual Conversation (1997), studied
patterns at four strata: lexicogrammatical, semantic, discoursal and generic. The
first two showed up as highly patterned in the interpersonal domain (interpersonal
metafunction), particularly in mood and modality. At the level of genre they
recognized a cline from story-telling to chat, with opinion and gossip in between;
of the ten genres of conversation that they ranged along this cline, they were able
to assign generic structures to seven of them: these were narrative, anecdote,
exemplum, recount, observation/comment, opinion and gossip. Of the other three,
joke-telling they had not enough data to explore; the other two, sending up and
chat, they said cannot be characterized in generic terms. Their analysis, based
on a spoken corpus, suggests that casual conversation is far from lacking in
structural order.
4.2 Pattern forming and re-forming

Ronald Carter, in a recent paper Language and creativity: the evidence from
spoken English (2002), was highlighting, as the title makes clear, the creative
potential of the spoken language, especially casual speech. He referred to its
pattern forming and re-forming, emphasizing particularly the re-forming that
takes place in the course of dialogue: one speaker sets up some kind of
lexicogrammatical pattern, perhaps involving a regular collocation, an idiom or
clich, or some proverbial echo; the interlocutor builds on it but then deflects,
re-forms it into something new, with a different pattern of lexicogrammatical
wording. This will usually not all happen in one dyadic exchange; it may be
spread across long passages of dialogue, with several speakers involved; but it
can happen very quickly, as illustrated in one or two of Carters examples from
the CANCODE corpus:
[Two students are talking about the landlord of a mutual friend]

A: Yes, he must have a bob or two.
B: Whatever he does he makes money out of it just like that.
A: Bobs your uncle.
B: Hes quite a lot of money, erm, tied up in property and things. Hes
got a finger in all kinds of pies and houses and stuff.
[Two colleagues, who are social workers, are discussing a third
colleague who has a tendency to become too involved in individual
cases]
A: I dont know but she seems to have picked up all kinds of lame
ducks and traumas along the way.
B: That -- thats her vocation.
A: Perhaps it is. She should have been a counsellor.
B: Yeah but the trouble with her is she puts all her socialist carts before
the horses.
18 M.A.K. Halliday
4.3 Patterns in words and phrases

There might seem to be some contradiction between this and Michael Stubbs
observation, in his Words and Phrases: corpus studies of lexical semantics
(2000), that a high proportion of language use is routinized, conventional and
idiomatic, at least when this is applied to spoken language. Of course, one way
in which both could be true would be if speech was found to consist largely of
routinized stuff with occasional flashes of creativity in between; but I dont think
this is how the two features are to be reconciled. Rather, it seems to me that it is
often precisely in the use of routinized, conventional and idiomatic features that
speakers creativity is displayed. (I shall come back to this point later.) But, as
Stubbs anticipated in his earlier work (1996), and has demonstrated in his more
recent study (of extended lexical units), it is only through amassing a corpus of
speech that we gain access to the essential regularities that must be present if they
can be played with in this fashion. There can be no meaning in departing from a
norm unless there is a norm already in place to be departed from.
4.4 Patterns in grammar

Michael Stubbs book is subtitled corpus studies in lexical semantics; Susan
Hunston and Gill Francis (1999) is Pattern Grammar: a corpus-driven approach
to the lexical grammar of English: one lexical semantics, the other lexical
grammar. I have written about Hunston and Francis book elsewhere (2001);
what they are doing, in my view, is very successfully extending the grammar in
greater detail (greater delicacy) across the middle ground where lexis and
grammar meet. There is no conflict here with theoretical grammar, at least in my
own understanding of the nature of theory; indeed they make considerable use of
established grammatical categories. But this region of the grammar, with its
highly complex network of microcategories, could not be penetrated without
benefit of a corpus and again, it has to include a spoken corpus, because it is in
speech that these patterns are most likely to be evolving and being ongoingly
renewed.
4.5 The grammar of appraisal

Eggins and Slade referred to, and also demonstrated in the course of their
analysis, the centrality, in many types of casual conversation, of the interpersonal
component in meaning. Our understanding of the interpersonal metafunction
derives particularly from the work of Jim Martin: his book English Text: system
and structure (1992), several articles (e.g. 1998), and a new book co-authored
with Peter White (forthcoming). Martin focussed especially on the area of
appraisal, comprising appreciation, affect, judgment and amplification all
those systems whereby speakers organize their personal opinions, their likes and
dislikes, and their degree and kind of involvement in what they are saying. These
features have always been difficult to investigate: partly for ideological reasons
they werent recognized as a systematic component of meaning; but also because

they are realized by a bewildering mixture of lexicogrammatical resources:
morphology, prosody (intonation and rhythm), words of all classes, closed and
open, and the ordering of elements in a structure. Martin has shown how these
meanings are in fact grammaticalized that is, they are systemic in their
operation; but to demonstrate this you need access to a large amount of data, and
this needs to be largely spoken discourse. Not that appraisal does not figure in
written language it does, even if often more disguised (see Hunston 1993); but
it is in speech that its systemic potential is more richly exploited.
4.6 Non-standard patterns

There is a long tradition of stigmatizing grammatical patterns that do not conform
to the canons of written language. This arose, naturally enough, because
grammatics evolved mainly in the study of written language (non-written cultures
often developed theories of rhetoric, but never theories of grammar), and then
because grammarians, like lexicographers, were seen as guardians of a nations
linguistic morals. I dont think I need take up time arguing this point here. But,
precisely because there are patterns which dont occur in writing, we need a
corpus of spoken language to reveal them. I dont mean the highly publicized
grammatical errors beloved of correspondents to the newspapers; these are
easily manufactured, without benefit of a corpus, and I suspect that that kind of
attention to linguistic table manners is a peculiarly English phenomenon
perhaps shared by the French, Ive heard it said. I mean the more interesting and
productive innovations which pass unnoticed in speech but have not (yet) found
their way into the written language and are often hard to construct with
conscious thought; for example, from my own observations:
Its been going tove been being taken out for a long time. [of a package
left on the back seat of the car]
All the system was somewhat disorganized, because of not being sitting in
the front of the screen. [cf. because I wasnt sitting ]
Drrr is the noise which when you say it to a horse the horse goes faster.
Excuse me is that one of those rubby-outy things? [pointing to an object
on a high shelf in a shop]
And then at the end I had one left over, which youre bound to have at
least one that doesnt go.
Thats because I prefer small boats, which other people dont necessarily
like them.
This court wont serve. [cf. its impossible to serve from this court]
20 M.A.K. Halliday
4.7 Grammatical intricacy

Many years ago I started measuring lexical density, which I defined as the
number of lexical items (content words) per ranking (non-embedded) clause. I
found a significant difference between speech and writing: in my written
language samples the mean value was around six lexical words per clause, while
in the samples of spoken language it was around two. There was of course a great
deal of variation among different registers, and Jean Ure (1971) showed that the
values for a range of text types were located along a continuum. She however
counted lexical words as a proportion of total running words, which gives a
somewhat different result, because spoken language is more clausal (more and
shorter clauses) whereas written language is more nominal (clauses longer and
fewer). Michael Stubbs, using a computerized corpus, followed Jean Ures model,
reasonably enough since mine makes it necessary to identify clauses, and hence
requires a sophisticated parsing programme. But the clause-based comparison is
more meaningful in relation to the contrast between spoken and written discourse.
What turned out to be no less interesting was what I called grammatical
intricacy, quantified as the number of ranking clauses in the clause complex. A
clause complex is any sequence of structurally related ranking clauses; it is the
spoken analogue of (and of course the underlying origin of) what we recognize in
written language as a sentence. In spontaneous spoken language the clause
complex often became extraordinarily long and intricate (see Texts 3 and 5). If
we analyse one of these in terms of its hypotactic and paratactic nexuses, we get a
sense of its complexity. Now, it is very seldom that we find anything like these in
writing. In speech, they tend to appear in the longer monologic turns that occur
within a dialogue (that is, they are triggered dialogically, but constructed by a
single speaker, rather than across turns). Since dialogue also usually has a lot of
very short turns, of just one clause, which is often a minor clause which doesnt
enter into complex structures in any case, there is no sense in calculating a mean
value for this kind of intricacy. What one can say is, that the more intricate a
given clause complex is, the more likely it is that it happened in speech rather
than in writing. But the fuller picture will only emerge from more corpus studies
of naturally occurring spoken language (cf. Matthiessen 2002: 295 ff.).
5. Some problems with a spoken corpus

So let me turn now to some of the problems faced by corpus linguists when they
want to probe more deeply into the mysteries of spoken language. One
problematic area Ive mentioned already: that of representing spoken language in
writing; I would like to add some more observations under this heading. As I
remarked, there are many different conventions used in transcribing, and all of
them distort in some way or other.
The lack of prosodic markers is an obvious and serious omission, but
one that can be rectified in one way or another. In another few decades it may be
possible to devise speech recognition systems that can actually assign prosodic
features patterns of intonation and rhythm at the phonological level (that is,
identifying them as meaningful options); meanwhile we might explore the value
of something which is technically possible already but less useful for
lexicogrammar and semantics, namely annotation of speech at the phonetic level
based on analysis of the fundamental parameters of frequency, amplitude and
duration.
But, as I suggested, a more serious problem is that of over-transcribing,
especially of a kind which brings with it a false flavour of the exotic: speech is
made to look quaint, with all its repetitions, false starts, clearings of the throat and
the like solemnly incorporated into the text. This practice, which is regrettably
widespread, not only imparts a spurious quaintness to the discourse one can
perhaps teach oneself to disregard that but, more worryingly, obscures, by
burying them in the clutter, the really meaningful sleights of tongue on which
spoken language often relies: swift changes of direction, structures which Eggins
and Slade call abandoned clauses, phonological and morphological play and
other moments of semiotic inventiveness. Of course, the line between these and
simple mistakes is hard to draw; but that doesnt mean we neednt try. Try getting
yourself recorded surreptitiously, if you can, in some sustained but very casual
encounter, and see which of the funny bits you would cut out and which you
would leave in as a faithful record of your own discourse.
But even with the best will, and the best skill, in the world, a fundamental
problem remains. Spoken language isnt meant to be written down, and any
visual representation distorts it in some way or other. The problem is analogous,
in a way, to that of choreographers trying to develop notations for the dance: they
work as aids to memory, when you want to teach complex routines, or to preserve
a particular choreographers version of a ballet for future generations of dancers.
But you wouldnt analyse a dance by working on its transcription into written
symbols. Naturally, many of the patterns of spoken language are recognizable in
orthographic form; but many others are not types of continuity and
discontinuity, variations in tempo, paralinguistic features of tamber (voice
quality), degrees of (un)certainty and (dis)approval and for these one needs to
work directly with the spoken text. And we are still some way off from being able
to deal with such things automatically.
The other major problem lies in the nature of language itself; it is a
problem for all corpus research, although more acute with the spoken language:
this is what we might call the lexicogrammatical bind. Looking along the
lexicogrammatical continuum (and I shall assume this unified view, well set out
by Michael Stubbs (1996) among the principles of Sinclairs and my approach,
as opposed to the bricks-&-mortar view of a lexicon plus rules of syntax) if we
look along the continuum from grammar to lexis, it is the phenomena at the
lexical end that are the most accessible; so the corpus has evolved to be organized
lexically, accessed via the word, the written form of a lexicogrammatical item.
Hence corpuses have been used primarily as tools for lexicologists rather than for
grammarians.
22 M.A.K. Halliday
In principle, as I think is generally accepted, the corpus is just as useful,

and just as essential, for the study of grammar as it is for the study of lexis. Only,
the grammar is very much harder to get at. In a language like English, where
words may operate all the way along the continuum, there are grammatical items
like the and and and to just as there are lexical items like sun and moon and stars,
as well as those like behind and already and therefore which fall somewhere in
the middle; occurrences of any of these are easily retrieved, counted, and
contextualized. But whereas sun and moon and stars carry most of their meaning
on their sleeves, as it were, the and and and to tell us very little about what is
going on underneath; and what they do tell us, if we just observe them directly,
tends to be comparatively trivial. It is an exasperating feature of patterns at the
grammatical end of the continuum, that the easier they are to recognize the less
they matter.
And it is here that the spoken language presents special problems for a
word-based observation system: by comparison with written language, it tends to
be more highly grammaticalized. In the way it organizes its meaning potential the
spoken language, relative to the written, tends to favour grammatical systems. We
have seen this already in the contrast between lexical density and grammatical
intricacy as complementary ways of managing semantic complexity: the written
language tends to put more of its information in the lexis, and hence it is easier to
retrieve by means of lexical searching. Consider pairs of examples such as the
following (and cf. those cited as Text 6 below):
Sydneys latitudinal position of 33 south ensures warm summer

temperatures.
Sydney is at latitude 33 south, so it is warm in summer.
The goal of evolution is to optimize the mutual adaption of species.

Species evolve in order to adapt to each other as well as possible.
If you are researching the forms of expression of the meaning cause, you can
identify a set of verbs which commonly lexify this meaning in written English
verbs like cause, lead to, bring about, ensure, effect, result in, provoke and
retrieve occurrences of these together with the (typically nominalized) cause and
effect on either side; likewise the related nouns and adjectives in be the cause of,
be responsible for, be due to and so on. It takes much more corpus energy to
retrieve the (mainly spoken) instances where this relationship is realized as a
clause nexus, with cause realized as a paratactic or hypotactic conjunction like
so, because or as, for at least three reasons: (i) these items tend to be polysemous
(and to collocate freely with everything in the language); (ii) the cause and effect
are now clauses, and therefore much more diffuse; (iii) in the spoken language
not only semantic relations but participants also are more often grammaticalized,
in the form of cohesive reference items like it, them, this, that, and you may have
to search a long way to find their sources. Thus it will take rather longer to derive
a corpus grammar of causal relations from spoken discourse than from written;
and likewise with many other semantic categories. Note that this is not because
they are not present in speech; on the contrary, there is usually more explicit
rendering of semantic relationships in the spoken variants; you discover how
relatively ambiguous the written versions are when you come to transpose them
into spoken language. It is the form of their realization more grammaticalized,
and so more covert that causes most of the problems.
Another aspect of the same phenomenon, but one that is specific to
English, is the way that material processes tend to be delexicalized: this is the
effect whereby gash slash hew chop pare slice fell sever mow cleave shear and so
on all get replaced by cut. This is related to the preference for phrasal verbs,
which has gained momentum over a similar period and is also a move towards the
grammaticalizing of the process element in the clause. Ogden and Richards, when
they devised their Basic English in the 1930s, were able to dispense with all but
eighteen verbs, by relying on the phrasal verb constructions (they would have
required me to say were able to do away with all but eighteen verbs); they
were able to support their case by rewording a variety of different texts, including
biblical texts, using just the high frequency verbs they had selected. These are, as
I said, particular features of English; but I suspect there is a general tendency for
the written varieties of a language to favour a more lexicalized construal of
meaning.
So I feel that, in corpus linguistics in general but more especially in
relation to a spoken language corpus, there is work to be done to discover ways of
designing a corpus for the use of grammarians or rather, since none of us is
confined to a single role, for use in the study of phenomena towards the
grammatical end of the continuum. Hunston and Francis, in their work on
pattern grammar (1999), have shown beyond doubt that the corpus is an
essential resource for extending our knowledge of the grammar. But a corpus-
driven grammar needs a grammar-driven corpus; and that is something I think we
have not yet got.
6. Corpus-based and corpus-driven

Elena Tognini-Bonelli, in her book Corpus Linguistics at Work (2001), defines
corpus linguistics as a pre-application methodology, comprising an empirical
approach to the description of language use, within a contextual-functional theory
of meaning, and making use of new technologies. Within this framework, she
sees new facts leading to new methodologies leading to new theories. Given that
she has such a forward-looking vision, I find it strange that she finds it strange
that more data and better counting can trigger philosophical repositioning; after
all, thats what it did in physics, where more data and better measuring
transformed the whole conception of knowledge and understanding. How much
the more might we expect this to be the case in linguistics, since knowing and
understanding are themselves processes of meaning. The spoken corpus might
well lead to some repositioning on issues of this kind.
24 M.A.K. Halliday
Like Hunston and Francis, Tognini-Bonelli stresses the difference between

corpus-based and corpus-driven descriptions; I accept this distinction in
principle, though with two reservations, or perhaps caveats. One, that the
distinction itself is fuzzy; there are various ways of using a corpus in grammatical
research that I would not be able to locate squarely on either side of the boundary
where, for example, one starts out with a grammatical category as a heuristic
device but then uses the results of the corpus analysis to refine it further or
replace it by something else. (If I may refer here to my own work, I would locate
both my study of the grammar of pain (1998), and the quantitative study of
polarity and primary tense carried out by Zoe James and myself (1993),
somewhere along that rather fuzzy borderline.) And that leads to the second
caveat: a corpus-driven grammar is not one that is theory-free (cf. Matthiessen
and Nesbitts On the idea of theory-neutral descriptions 1996). As I have
remarked elsewhere (2001), there is considerable recourse to grammatical theory
in Hunston and Francis book. I am not suggesting that they deny this they are
not at all anti-theoretical; but it is important, I think, to remove any such
implication from the notion of corpus-driven which is itself a notably
theoretical concept.
I dont think Tognini-Bonelli believes this either, though there is perhaps a
slight flavour in one of her formulations (p. 184): If the paradigm is not
excluded from this [corpus-driven] view of language, it is seen as secondary with
respect to the syntagm. Corpus-driven linguistics is thus above all a linguistics of
parole. I wonder. Paradigm and syntagm are the two axes of description, for
both of which we have underlying theoretical categories: structure as theory of
the syntagm, system as theory of the paradigm. It is true that, in systemic theory,
we set up the most abstract theoretical representations on the paradigmatic axis;
there were specific reasons for doing this (critically, it is easier to map into the
semantics by that route, since your view of regularity is not limited by structural
constraints), but that is not to imply that structure is not a theoretical construct.
(Firth, who first developed system-structure theory, did not assign any theoretical
priority to the system; but he developed it in the context of phonology, where
considerations are rather different.) So I dont think corpus-driven linguistics is a
linguistics of parole but in any case, isnt that notion rather self-contradictory?
Once you are doing linguistics, you have already moved above the instantial
realm.
I can see a possible interpretation for a linguistics of parole: it would be a
theory about why some instances some actes de parole are more highly
valued that others: in other words, a stylistics. But the principle behind corpus
linguistics is that every instance carries equal weight. The instance is valued as a
window on to the system: the potential that is being manifested in the text. What
the corpus does is to enable us to see more closely, and more accurately, into that
underlying system into the langue, if you like. The corpus-driven grammar is
a form of, and so also a major contributor to, grammatics.
7. Aspects of speech: a final note

I am assuming that the spoken language corpus includes a significant amount of
authentic data: unsolicited, spontaneous, natural speech which is likely to
mean dialogue, though there may be lengthy passages of monologue embodied
within it. Not because there is anything intrinsically superior about such discourse
as text if anything, it tends to carry a rather low value in the culture; but because
the essential nature of language, its semogenic or meaning-creating potential, is
most clearly revealed in the unselfconscious activity of speaking. This is where
systemic patterns are established and maintained; where new, instantial patterns
are all the time being created; and where the instantial can become systemic, not
(as is more typical of written language) by way of single instances that carry
exceptional value (what I have called the Hamlet factor) but through the
quantitative effects of large numbers of unnoticed and unremembered sayings.
For this reason, I would put a high priority on quantitative research into
spoken language, establishing the large-scale frequency patterns that give a
language its characteristic profile its characterology, as the Prague linguists
used to call it. This is significant in that it provides the scaffolding whereby
children come to learn their mother tongue, and sets the parameters for systematic
variation in register: what speakers recognize as functional varieties of their
language are re-settings of the probabilities in lexicogrammatical choice. The
classic study here was Jan Svartviks study of variation in the English voice
system (1966). It also brings out the important feature of partial association
between systems, as demonstrated in their quantitative study of the English clause
complex by Nesbitt and Plum (1988). My own hypothesis is that the very general
grammatical systems of a language tend towards one or the other of two
probability profiles: either roughly equal, or else skew to a value of about one
order of magnitude; and I have suggested why I think that this would make good
sense (1993). But it can only be put to the test by large-scale quantitative studies
of naturally occurring speech. Let me say clearly that I do not think this kind of
analysis replaces qualitative studies of patterns of wording in individual texts. But
it does add further insight into how those patterns work.
It is usually said that human language, as it evolved and as it is developed
by children, is essentially dialogic. I see no reason to question this; the fact that
other primates (like ourselves!) send out warnings or braggings or other
emotional signals, without expecting a response, is not an objection that need be
taken seriously. Dialogue, in turn, provides the setting for monologic acts; and
this is true not only instantially but also systemically: monologue occurs as
extended turns in the course of dialogic interaction, as a good-sized corpus of
casual conversation will show. Clearly monologue is also the default condition of
many systemic varieties: people give sermons, make speeches, write books,
broadcast talks and so on; but they do so, even if it is largely for their own
satisfaction, only because there are others who listen to them (or at least hear
them) and who read them.
26 M.A.K. Halliday
Any piece of spoken monologue can be thought of as an extended turn:

either given to the speaker by the (contextual) system, as it were, like a
conference paper, or else having to be established, and perhaps struggled for, as
happens in casual conversation. Speakers have many techniques for holding the
floor, prolonging their speaking turn. Some of these techniques are, in Eggins and
Slades terms, generic: you switch into telling a joke, or embark on a personal
narrative. But one very effective strategy is grammatical: the clause complex. The
trick is to make the listeners aware another clause is coming. How you do this, of
course, varies according to the language; but the two main resources, in many
languages, are intonation and conjunction. These are, in effect, two mechanisms
for construing logical-semantic relationships in lexicogrammatical form in
wording. The highly intricate clause complexes that I referred to earlier as a
phenomenon of informal speech embroil the listener in a shifting pattern of
phono-syntactic connections. This is not to suggest that their only function is to
hold the floor; but they help, because listeners do, in general, wait for the end of a
sequence it takes positive energy to interrupt.
What the clause complex really does, or allows the speaker to do, is to
navigate through and around the multidimensional semantic space that defines the
meaning potential of a language, often with what seem bewildering changes of
direction, for example (Text 3) from the doctors expectations to corridors lined
with washing to the danger of knocking out expectant mothers, all the while
keeping up an unbroken logical relationship with whatever has gone before. It is
grammatical logic, not formal logic; formal logic is the designed offspring of
grammatical logic, just as the written sentence is the designed offspring of the
clause complex of speech. This kind of spontaneous semantic choreography is
something we seldom find other than in unselfmonitored spoken discourse,
typically in those monological interludes in a dialogue; but it represents a
significant aspect of the power of language as such.
I have been trying to suggest, in this paper, why I think that the spoken
language corpus is a crucial resource for theoretical research: research not just
into the spoken language, but into language in general. Because the gap between
what we can recover by introspection and what people actually say is greatest of
all in sustained, unselfmonitored speaking, the spoken language corpus adds a
new dimension to our understanding of language as semiotic system-&-process.
That there is such a gap is not only because spontaneous speech is the mode of
discourse that is processed at furthest remove from conscious attention, but also
because it is the most complexly intertwined with the ongoing socio-semiotic
context. Tognini-Bonellis observation that all corpus studies imply a contextual
theory of meaning is nowhere more cogent than in the contexts of informal
conversation. Hasan and Clorans work on their corpus of naturally occurring
dialogue between mothers and their three-to-four-year-old children showed how
necessary it was not merely to note the situations in which meanings were
exchanged but to develop the theoretical model of the contextual stratum as a
component in the overall descriptive strategy (Hasan and Cloran 1990; Hasan
1991, 1992, 1999; Cloran 1994). Peoples meaning potential is activated and
hence ongoingly modified and extended when the semogenic energy of their
lexicogrammar is brought to bear on the material and semiotic environment,
construing it, and reconstruing it, into meaning. In this process, written language,
being the more designed, tends to be relatively more focussed in its demands on
the meaning-making powers of the lexicogrammar; whereas spoken language is
typically more diffuse, roaming widelier around the different regions of the
network. So spoken language is likely to reveal more evidence for the kind of
middle range grammar patterns and extended lexical units that corpus studies
are now bringing into relief; and this in turn should enrich the analysis of
discourse by overcoming the present disjunction between the lexical and the
grammatical approaches to the study of text.
Already in 1935 Firth had recognized the value of investigating
conversation, remarking it is here we shall find the key to a better understanding
of what language really is and how it works (1957: 32). He was particularly
interested in its interaction with the context of situation, the way each moment
both narrows down and opens up the options available at the next. My own
analysis of English conversation began in 1959, when I first recorded spoken
dialogue in order to study rhythm and intonation. But it was Sinclair, taking up
another of Firths suggestions the study of collocation (see Sinclair 1966) who
first set up a computerized corpus of speech. Much later, looking back from the
experience with COBUILD , Sinclair wrote (1991: 16): a decision I took in
1961 to assemble a corpus of conversation is one of the luckiest I ever made. It
would be hard now to justify leaving out conversation from any corpus designed
for general lexicogrammatical description of a language. Christian Matthiessen,
using a corpus of both spoken and written varieties, has developed text-based
profiles: quantitative studies of different features in the grammar which show up
the shifts in probabilities that characterize variation in register. One part of his
strategy is to compile a sub-corpus of partially analysed texts, which serve as a
basis for comparison and also as a test site for the analysis, allowing it to be
modified in the light of ongoing observation and interpretation. I have always felt
that such grammatical probabilities, both global and local, are an essential aspect
of what language really is and how it works. For these, above all, we depend on
spoken language as the foundation.
References
Baker, M., G. Francis and E. Tognini-Bonelli (eds) (1993), Text and technology:
in honour of John Sinclair. Amsterdam: John Benjamins.
Brazil, D. (1995), A grammar of speech. Oxford: Oxford University Press.
Carter, R. (2002), Language and creativity: the evidence from spoken English.
[The Second Sinclair Open Lecture, Department of English, University of
Birmingham]
Carter, R., and M. McCarthy (1995), Grammar and the spoken language.
Applied Linguistics 16: 141-158.
28 M.A.K. Halliday
Cloran, C. (1994), Rhetorical units and decontextualization: an enquiry into some

relations of meaning, context and grammar. Monographs in Systemic
Linguistics 6. Department of English, University of Nottingham.
Eggins, S., and D. Slade (1997), Analysing casual conversation. London: Cassell.
Fawcett, Robin, and Michael Perkins (1981), Project report: language
development in 6- to 12-year-old children. First Language 2: 75-79.
Firth, J.R. (1935), The technique of semantics. Transactions of the Philological
Society. Reprinted in J.R. Firth, Papers in linguistics 1934-1951. London:
Oxford University Press, 1957. 7-33.
Grimshaw, A. D. (ed.) (1994), Whats going on here. Complementary studies of
professional talk. Norwood, N.J.: Ablex.
Halliday, M.A.K. (1993), Quantitative studies and probabilities in grammar. In
Michael Hoey (ed.), Data, description, discourse. Papers on the English
language in honour of John McH. Sinclair. London: Harper Collins. 1-25.
Halliday, M.A.K. (1998), On the grammar of pain. Functions of Language 5: 1-
32.
Halliday, M.A.K. (2002), Judge takes no cap in mid-sentence: on the
complementarity of grammar and lexis. [The First Sinclair Open Lecture,
Department of English, University of Birmingham]
Halliday, M.A.K. and Z.L. James (1993), A quantitative study of polarity and
primary tense in the English finite clause. In John M. Sinclair, Michael
Hoey and Gwyneth Fox (eds), Techniques of description: spoken and
written discourse. London & New York: Routledge. 32-66.
Hasan, R. (ed.) (1985), Discourse on discourse. Applied Linguistics Association
of Australia: Occasional Papers 7.
Hasan, R. (1991), Questions as a mode of learning in everyday talk. In Thao L
and Mike McCausland (eds), Language education: interaction and
development. Launceston: University of Tasmania. 70-119.
Hasan, R. (1992), Rationality in everyday talk: from process to system. In Jan
Svartvik (ed.), Directions in corpus linguistics. Berlin: Mouton de
Gruyter. 257-307.
Hasan, R. (1999), Speaking with reference to context. In Mohsen Ghadessy
(ed.), Text and context in functional linguistics. Amsterdam &
Philadelphia: John Benjamins. 219-328.
Hasan, R., and C. Cloran (1990), A sociolinguistic interpretation of everyday
talk between mothers and children. In M.A.K. Halliday, John Gibbons
and Howard Nicholas (eds), Learning, keeping and using language.
Selected papers from the Eighth World Congress of Applied Linguistics.
Amsterdam & Philadelphia: John Benjamins. Vol. 1: 67-99.
Hunston, S. (1993), Evaluation and ideology in scientific English. In Mohsen
Ghadessy (ed.), Register analysis: theory and practice. London: Pinter.
57-73.
Hunston, S., and G. Francis (2000), Pattern grammar. A corpus-driven approach
to the lexical grammar of English. Amsterdam & Philadelphia: John
Benjamins.
Leech, G. (2000), Same grammar or different grammar? Contrasting approaches

to the grammar of spoken English discourse. In Srikant Sarangi and
Malcolm Coulthard (eds), Discourse and social life. Harlow: Longman.
48-65.
Martin, J.R. (1992), English text: system and structure. Amsterdam: John
Benjamins.
Martin, J.R. (1993), Life as a noun: arresting the universe in science and
humanities. In M.A.K. Halliday and J.R. Martin, Writing science: literacy
and discursive power. London & Washington, D.C.: Falmer. 221-267.
Martin, J.R. (1998), Beyond exchange: appraisal systems in English. In Susan
Hunston and Geoff Thompson (eds), Evaluation in text. Oxford: Oxford
University Press.
Matthiessen, C. M.I.M. (1999), The system of TRANSITIVITY: an exploratory
study of text-based profiles. Functions of Language 6: 1-51.
Matthiessen, C. M.I.M. (2002), Combining clauses into clause complexes: a
multi-faceted view. In Joan Bybee and Michael Noonan (eds), Complex
sentences in grammar and discourse. Essays in honour of Sandra A.
Thompson. Amsterdam & Philadelphia: John Benjamins.235-319.
Matthiessen, C. M.I.M., and Christopher Nesbitt (1996), On the idea of theory-
neutral descriptions. In Ruqaiya Hasan, Carmel Cloran and David G. Butt
(eds), Functional descriptions: theory and practice. Amsterdam &
Philadelphia: John Benjamins. 39-85.
Quirk, R., and D. Crystal (1964), Systems of prosodic and paralinguistic features
in English. The Hague: Mouton.
Sinclair, J. (1966), Beginning the study of lexis. In C.E. Bazell et al. (eds), In
memory of J.R. Firth. London: Longmans. 410-430.
Sinclair, J. (1991), Corpus, concordance, collocation. Oxford: Oxford University
Press.
Stubbs, M. (1996), Text and corpus analysis: computer-assisted studies of
language and culture. Oxford: Blackwell.
Stubbs, M. (2000), Words and phrases: corpus studies of lexical semantics.
Oxford: Blackwell.
Svartvik, J. (1966), On voice in the English verb. The Hague: Mouton.
Svartvik, J., and R. Quirk (eds) (1980), A corpus of English conversation. Lund:
C.W.K. Gleerup.
Tognini-Bonelli, E. (2001), Corpus linguistics at work. Amsterdam &
Philadelphia: John Benjamins.
Ure, J. (1971), Lexical density and register differentiation. In G.E. Perren and
J.L.M. Trim (eds), Applications of linguistics. Selected papers of the
Second International Congress of Applied Linguistics. London:
Cambridge University Press. 443-452.
30 M.A.K. Halliday
Appendix: Transcripts of recorded conversations
Text 1: Passage from tape recording transcribed about 1960
Key:
Indented lines represent the contributions of the interviewer, the asterisks
in the informants speech indicating the points at which such
contributions began, or during which they lasted.
The hyphens (-, --, ---) indicate relative lengths of pauses.
Proper names are fictitious substitutes for those actually used.
The informant is a graduate, speaking RP with a normal delivery.
i is this true I heard on the radio last night that er pay has gone net pay
but er -- retirement age has gone up - *for you chaps*
*yes but er*
to seventy*
5 *yes I think thats scandalous*
*but is it right is it true*
*yes it is true yes it is true*
*well its a good thing*
yes *but the thing is that er -* everybody wants more money --
10 *I mean youve got your future secure*
but er the thing is you know -- er I mean of course er the whole thing is
absolutely an absolute farce because -- really with this grammar school
business its perfectly true that - that youre drawing all your your brains of
the country are going to come increasingly from those schools - therefore
15 youve got to have able men - and women to teach in them - but you want
fewer and better ** - thats the thing they want
*hm*
- fewer grammar schools and better ones --- *because at the
*Mrs Johnson was saying*
20 moment* its no good having I mean weve got some very good men where I
am which is a bit of a glory hole -- but er theres some theres some good
men there theres one or two millionaires nearly theres Ramsden who
cornered the - English text book market -- *and er* - yes hes got a net
income of
25 *hm*
about two thousand five hundred a year and er theres some good chaps there
I mean you know first class men but its no good having first class men -
dealing with the tripe that we get *--* you see thats the trouble that youre
wasting its
30 *hm*
a waste of energy -- um an absolute waste of energy - your - your er method
of selection there is all wrong -- *um
*but do you think its better to have -- er teachers whove had a lot of
experience - having an extra five years to help solve this - problem of
of fewer teachers -- er or would you say - well no cut them off at at
sixty-five and lets get younger*
35 of fewer teachers -- er or would you say - well no cut them off at at

sixty-five and lets get younger*
*its no good having I would if I were a head Id and you know and I know
well Id chuck everyone out who taught more than ten years on principle *--
40 *ha ha ha why*
*because after that time as a boy said they either become too strict or too
laxative* --
*ha ha ha ha ha ha - hm*
*yes - but ha ha ha no they get absolutely stuck you know after ten years * * -
45 - they just go absolutely dead - we all
*hm*
do - bound to you know you you churn out the same old stuff you see - but
um - the thing is I mean its no good having frightfully - well anyway they
they if they paid fifteen hundred a year I mean - if you could expect to get
50 that within -- ten years er er for graduates er you you still wouldnt get the
first class honours - scientists - theyd still go into industry because its a
present er a pleasanter sort of life * * youre living in an adult world and
youre
*yes*
living in a world which is in the main stream -- I mean school mastering is
55 bound to be a backwater youre bound to you want some sort of sacrifice
sacrificial type of people you know **
*yes*
no matter what you pay them youve got to pay them more but youve got to
give -- theres got to be some reason you know some - youre always giving
60 out and you get nothing back **
*hm*
and --- I mean they dont particularly want to learn even the bright ones
theyd much rather -- fire paper pellets out of the window or something or --
no they dont do that but they they -- you know youve got to drive them all
65 the time --- theyve got to have some sort of exterior reason apart from your
own -- personal satisfaction in doing it you know
32 M.A.K. Halliday
Text 2: Passage from Svartvik and Quirk (1980: 215-218)

34 M.A.K. Halliday
Text 3: Orthographic (and somewhat reduced) version of Text 2
A: Yes; thats very good. I wouldnt be able to have that one for some reason
you see: this checker board effect I recoil badly from this. I find I hadnt
looked at it, and I think its probably because it probably reminds me you
know of nursing Walter through his throat, when you play checker boards or
something. I think its it reminds me of the ludo board that we had, and I
just recoiled straight away and thought [mm] not not that one, and I didnt
look inside; but thats very fine, [mm mm] isnt it? very fine, yes.
B: Its very interesting to try and analyse why one like abstract paintings, cause
I like those checks; just the very fact that theyre not all at right angles means
that my eyes dont go out of focus chasing the lines [yes] they can actually
follow the lines without sort of getting out of focus.
A: Yes Ive got it now: its those exact two colours you see, together. He had
he had a blue and orange crane, I remember it very well, and you know one
of those things that wind up, and thats it.
B: It does remind me of meccano boxes [yes well] the box that contains
meccano, actually.
A. Yes. Well, we had a bad do you know; we had oh we had six or eight
weeks when he had a throat which was [mhm] well at the beginning it was
lethal if anyone else caught it. [yeah] It was lethal to expectant mothers with
small children, and I had to do barrier nursing; it was pretty horrible, and the
whole corridor was full of pails of disinfectant you know [mm], and you went
in, and of course with barrier nursing I didnt go in in a mask I couldnt
with a child that small, and I didnt care if I caught it, but I mean it was
ours emptied outside you see [mm] and you had to come out and you brought
all these things on to a prepared surgical board [mm mm] and you stripped
your gloves off before you touched anything [mm] and you disinfected oh it
was really appalling [mm]. I dont think the doctor had expected that I would
do barrier nursing you see [mm] I think she said something about she
wished that everybody would take the thing seriously you know, when they
were told, as I did, cause she came in and the whole corridor was lined [mm]
with various forms of washing and so on, but after all I mean you cant go
down and shop if you know that youre going to knock out an expectant
mother. It was some violent streptococcus that hed got and he could have
gone to an isolation hospital but I think she just deemed that he was too small
[yes mm mm] for the experience, and then after wed had him, you know,
had him for a few days at home this couldnt be done. [mhm] She made the
decision for me really, which at the time I thought was very impressive, but
she didnt know me very well: I think she thought I was a career woman who
would be only too glad and would say oh well hes got to go into a hospital,
you know, so she made the decision for me and then said its too late now to
put him into an isolation hospital; I would have had to do that a few days
ago which, I thought, I didnt want her to do!
B: Do nurses tend to be aggressive, or does one just think that nurses are
aggressive?
A: Well, that was my doctor [oh], and she didnt at that time understand me very
well. I think she does now.
Text 4: Passage from Grimshaw (ed.) 1994
. . . ) and I / think shes a/ware of this and I / think you / know she . . . // 4
) I / think one / thing thatll / happen I / think that . . . // 1 ) that / Mike may
en/courage her // 1 ) and I / think thatll be / all to the / good //
P. // 4 ) to / what ex/tent are / these / ) the / three / theories that she se/lected // 1
truly repre/sentative of / theories in this / area //
A. // 1 thats / it / ) // 1 thats / it //
P. // 1 ) they / are in/deed //
S. // 1 yeah //
P. // 1 oh // 2 they are / the / theories //
A. // 1 thats about / it //
P. // 1 they are / not / really repre/sentative / then //
S. // 1 well there are // 1 ) there are / vari/ations // 1 ) there are / vari/ations // 1
on / themes but . . . // 4 ) but / I dont / know of any / major con/tender ) there
/ may be // 1 ) well / I dont / know of / anything that / looks much / different
from the / things shes . . . ) she has / looked at in the spe/cific / time //
A. // 4 ) ex/cept for the / sense that
P. // 1 ) so / nobody / nobody would at/tack her on / that ground / then if she
//
A. // 1 oh no / I dont / think so // 4 ) I think the / only / thing that would be
sub/stantially / different would be a // 1 real / social / structuralist who would
/ say // 4 ) you / dont have to / worry about cog/nitions // 1 what you have to
/ do is / find the lo/cation of these / people in the / social / structure // 1- ) and
/ then youll / find out how theyre / going to be/have with/out having to / get
into their / heads at / all // 4 ) and / that // 1 hasnt been / tested // 1- ) ex/cept
in / very / gross / kinds of / ways with // 1 macro / data which has / generally /
not been / very satis/factory // 1 yeah / ) // 1 ) so I can / tell her that // 3 )
you / know I
S. // 1 ) shes / won //
36 M.A.K. Halliday
Text 5. Choreographic notation for the clause complex of spoken language (cf.
forms of notation in Martin 1992). Clause complex from Text 3 above.
The doctor prob-ably
expected

I would say

2
That he had to
Go into hospital
So
Instead of
Asking me
+
she made the

decision for me
=
Which at the time

Seemed very impressive
=3 But she didnt

Know me very well
She said
2
Its too late now to
Put him into a hospital
=2
I should have had to do
That a few days ago
+2
And I thought
To myself
2
I didnt want to

You to do that
Text 6: Spoken translations of some sentences of written

English
Note: Written originals are those lettered (a) in Set 1 and those in the left hand
column of Set 2.
1. (1a) Strength was needed to meet driver safety requirements in the event
of missile impact.
(1b) The material needed to be strong enough for the driver to be safe if it
got impacted by a missile.
(2a) Fire intensity has a profound effect on smoke injection.
(2b) The more intense the fire, the more smoke it injects (into the
atmosphere).
(3a) The goal of evolution is to optimize the mutual adaption of species.
(3b) Species evolve in order to adapt to each other as well as possible.
(4a) Failure to reconfirm will result in the cancellation of your
reservations.
(4b) If you fail to reconfirm your reservations will be cancelled.
(5a) We did not translate respectable revenue growth into earnings
improvement.
(5b) Although our revenues grew respectably we were not able to
improve our earnings.
2. Sydneys latitudinal position of 33 Sydney is at latitude 33 south, so it is

south ensures warm summer warm in summer.
temperatures.
Investment in a rail facility implies If you invest in a facility for the
a long term commitment. railways you will be committing
[funds] for a long term.
[The atomic nucleus absorbs [] Each time it absorbs energy it
energy in quanta, or discrete units.] (moves to a state of higher energy = )
Each absorption marks its becomes more energetic.
transition to a state of higher
energy.
[Evolutionary biologists have [] when [species] suddenly [start
always assumed that] rapid to] evolve more quickly this is
changes in the rate of evolution are because something has happened
caused by external events [which is outside [] they want to explain that
why ] they have sought an the dinosaurs dies out because a
explanation for the demise of the meteorite impacted.
dinosaurs in a meteorite impact.
38 M.A.K. Halliday
[It will be seen that] a [] it is possible both to replace

successful blending of asset assets and to remanufacture [current
replacement with remanufacture is equipment] successfully. We must
possible. Careful studies are to be study [the matter] carefully to ensure
undertaken to ensure that viability that ([the plan] is viable = ) we will
exists. be able to do what we plan.
The theoretical program of As well as working theoretically by
devising models of atomic nuclei devising models of atomic nuclei we
has been complemented by have also investigated [the topic] by
experimental investigations. experimenting.
Increased responsiveness may be [The child] is becoming more
reflected in feeding behaviour. responsive, so s/he may feed better.
Equation (3) provided a When we used equation (3) we could
satisfactory explanation of the explain satisfactorily (the different
observed variation in seepage rates. rates at which we have observed that
seepage occurs = ) why, as we have
observed, [water] seeps out more
quickly or more slowly.
The growth of attachment between Because / if / when
infant and mother signals the first the mother and her infant grow
step in the childs capacity to (more) attached to one another //
discriminate among people. the infant grows / is growing (more)
attached to its mother
we know that / she knows that / [what
is happening is that]
the child has begun / is beginning / is
going to begin to be able
to tell one person from another /
prefer one person over another.
Intuition and annotation the discussion continues
John Sinclair
The Tuscan Word Centre
Abstract
Some corpus linguists prefer to research using plain text, while others first
prepare the texts by adding various analytic annotations. The former group
express reservations about the reliability of intuitive data, whereas the latter
group, if obliged to choose, will reject corpus evidence in favour of their intuitive
responses. This paper attempts to move from the broad differences expressed
above to a small number of specific points of contrast between the two
approaches.
1. Introduction
As the study of language in corpora continues to grow and diversify, differences
of methodology emerge, and there is room for misunderstanding. Aarts (1991,
2002a, 2002b) has monitored the development of the relationship between the
management of corpora and the theory of language, and Tognini Bonelli (2001)
has described contrasting conceptualisations of the relation between theory and
data.
The key concept here is -driven, in the phrase corpus-driven linguistics.
-driven has several characteristic usages, among which we may focus on two,
which might be paraphrased as motivated and controlled. Its use in relation
to corpus linguistics can be traced back to Johns (e.g. 1990) and his data-driven
learning. Here the matter of motivation is on top, as it was found that learners
have unbounded curiosity when they are allowed to interrogate corpora, and
apparently natural learning mechanisms to profit from the curiosity. Francis
(1993) shifted the focus to corpus-driven grammar, where controlled is
perhaps the more appropriate gloss. The grammar should follow the corpus,
accounting for as much as possible of the patterning, and being cautious in
ascribing to the language a pattern that is not attested in the corpus.
Tognini Bonelli (op. cit.) noted that in much corpus research the
theoretical and descriptive positions were carefully insulated from the findings of
the corpus investigations. Though researchers acknowledged that one legitimate
use of a corpus was to test hypotheses, there was no serious testing of the
governing theories. These, it was held, had been forged over many years, and
thoroughly tested against intuitive responses, and they were extremely abstract.
The myriad details of actual usage could provide some helpful reflections of the
40 John Sinclair
theory, but there was no question of threatening the theory with evidence from
usage, however compelling that might be.
Tognini Bonelli called this position corpus-based linguistics, and
contrasted it with corpus-driven linguistics, which specifically places the theory
in a vulnerable position, to be justified or modified according to the results of
investigations the classic posture of the empirical scientist.1 There may be
intermediate positions between these poles, but I cannot imagine any. Either
ones whole cathedral of linguistic structures is ready to receive the scaffolding,
or it is not.
Aarts (op. cit.) offers a penetrating discussion of this dichotomy, and
suggests that the two approaches contrast in their methodologies in two
important places the role they see for a persons intuition, and the place, value
and legitimacy of annotating corpora.
Regarding intuition, he anticipates quite opposed positions; the corpus-
based linguist allows his intuition to overrule his corpus data and hence gives
primacy to the former (2002a: 8) and Aarts expects the corpus-driven linguist to
do the opposite.
There are two observations to be made here. One is that Aarts moves
smoothly from considering the use of intuitive data to the more general point
of the role of the intuition in the process of making linguistic descriptions. But it
is quite reasonable to differentiate between these two positions, to reject the
former and keep an open mind on the latter as I do, and as I think most corpus-
driven linguists would also do. The other point is that I wonder if corpus-based
linguists have ever thought seriously about the priority they assign to intuitive
data. Would they really just set aside a mass of information about how people
use a language when their still, small voice tells them something different?
Leave aside the details, the one-offs, the peculiarities corpus linguistics is
about generalities if anything. Would they really feel secure in preferring their
intuition against measurable, incontrovertible objective evidence? Their only
hope would be to find an explanation for the apparent conflict, and although that
is a laudable aim, it is rarely resorted to because of the low prestige of empirical
data in the last half-centurys linguistics.
In no way do I intend this argument to devalue the importance of
intuition, as will become apparent in a little while. But I urge caution. When
cheap pitch meters became available in phonetics, it was possible to discover
exactly what the pitch contours of an utterance were. It was discovered that
people believed things that were at variance with the facts. They believed, for
example, that questions were spoken on a rising intonation, although in British
English they usually are not, and they would hear the pitch going up when in
fact it was going down. Intuition is not some kind of gut reaction to events, it is
educated in various ways, and sophisticated.
On the topic of annotation, Aarts considers that the contrast between the
two approaches to corpus linguistics is at its most marked in this area. He
deduces that corpus-driven linguists are bound to reject annotation, because it
could hamper their wish to be as close to the plain text as possible, whereas
Intuition and annotation 41
corpus-based linguists, who do not share their concerns, rely on annotation as the
main means by which they express their analysis and make it available to others.
It is, in Aarts uncompromising phrase, an indispensable tool for them.
Certainly there are contrasting attitudes to annotation among corpus
linguists of different styles, but not perhaps as extreme as suggested by Aarts. I
would like to continue this valuable discussion with an examination of the roles
of intuition and annotation in corpus linguistics, because I think that some
misunderstandings have arisen. These are quite understandable in their context,
and I am certain that Aarts is striving to be completely fair in his representation
of all points of view, and particularly those that he does not share. I can, of
course, talk only from my own perspective as following the corpus-driven
approach to research.
2. Intuition
In considering the role of intuition, for example, two issues have in recent years
tended to undermine confidence in the reliability of this elusive faculty. Let us
examine each of them briefly.
1. I have no longer any confidence in the ability of a human being to invent

sentences which display the same patterns of meaning that are to be found in
naturally occurring sentences. This has not always been my position; thirty or
more years ago I published an English grammar which illustrated its points with
almost entirely made-up sentences. I would not do that today. What is more, I
believe that most linguists share my misgivings, and it is easy to find subjective
evidence in support of this position. On the other hand, objective evidence of
what is natural and unnatural is not yet available, and this points up the primitive
nature of even our best descriptions.2
Both our productive ability in making up sentences and our critical
faculty in evaluating those sentences for naturalness are within the skills domain
that is usually held to be informed by intuition; it is clear that they do not match
up, and that even in the behaviour of the same person over time a sentence
can be approved as natural and condemned as unnatural, both positions ascribed
to intuition.
However, invented sentences are not always condemned; the general
agreement that ordinary language users can detect phoney sentences does not
lead to everyone behaving consistently with respect to them. It is a tenable
position to accept that they are different from natural ones but to prefer to study
them because of the insights they are said to give to mental processes. Or in the
business of language teaching to accept that they have no role in mature
discourse but that they are valuable stepping stones towards this. Or to maintain
that the differences between actual and invented sentences are not structurally
important. Or to dismiss the whole point by saying that the circumstances of
actual usage are of no concern to the theoretician, and so the differences are of
no account.
42 John Sinclair
My own position among these alternatives is perhaps over-cautious, but it

is shaped by many years of exposure to both academic and commercial attitudes
and arguments about the use of actual examples in presenting the language. I
simply do not trust my intuition in this matter; and now, when there is an
overabundance of used language available, it is as easy normally to find an
appropriate example from a corpus as to make one up.3 Whats more, if I have a
problem with a big corpus in finding an example, this makes me pause for
thought perhaps I have not specified the example adequately, or perhaps I am
on a wild goose chase.
No doubt in time we will contrive better descriptions, and check our
invented examples against such descriptions; but experience suggests that we
will have a long wait, because descriptions of this quality would enable us to
construct accurate examples by rule. Some fifty years ago in the science of
phonetics researchers found that there was a large gap between their ability to
reproduce by machine the actual speech sounds of an individual subject, which
was so good that the original speaker could be easily identified, and their ability
to synthesise speech by rule using the same machine, which was lamentably
poor. This kind of gap is also showing between our ability to recognise normal
English and our ability to construct it without the benefit of an interactive
context.
2. In the early days of corpus linguistics, when researchers were trying to

interpret the results of probes into the mass of data, it quickly became apparent
that the information that they expected differed substantially from the
information that they received. To recall just one of hundreds of examples, it was
found that the common verbs in English did not occur very frequently in the
meanings that were intuitively associated with them. Everyone knows that give
has to do with a free and generous passing of ownership, take concerns grasping
and holding something, keep essentially is to do with maintenance, and put to do
with placement. However, these meanings were found to be of only minor
significance in a large corpus, besides the meanings of the same verbs in familiar
collocations.
There are some ready partial explanations for this set of observations.
First of all, three very common verbs, be, have and do have a fully grammatical
role as auxiliary verbs that is the reason for their great frequency, and the
occurrence of, say have meaning possess is far less common, although
recognised as the core meaning of the word. Secondly, there is an exotic
feature of English called the phrasal verb, where a verb usually a very
common one combines with a preposition or an adverbial particle to form a
unit of meaning. Give up, meaning abandon, take over, meaning assume
control of, keep off, meaning stay away from, and put off, meaning
postpone are among the thousands of examples.
But even if we put to one side the auxiliary and phrasal verb uses of these
common verbs, we are by no means down to the core meaning. What we now
find is a host of frequent collocations which make up idiomatic structures,
idiomatic in the sense that their meaning does not simply combine the meanings
of the individual words. Examples are take place, take a photograph, take
control, take time.
While these latter phrases were well known to people dealing with
English, their prominence in texts was not, and they had clearly been under-
assessed in reference books; their frequency was overwhelming, and the
intuitively-favoured meaning was insignificant in comparison.4 This was the
beginning of distrust of intuition why did ones intuition fail to come up with
the massively common uses of a word, but instead reported a rather rare one?
The term delexicalisation was used of the process whereby the original
meaning of the verbs which appeared in these patterns was watered down or lost
completely, overlaid with a new meaning that arose from regular collocation. At
that time no-one was questioning the ideas (a) that each word had one or more
meanings, and (b) that one of the meanings had special status as the original or
core meaning. But gradually confidence in these ideas was eroded as it was
realised that a model based on these ideas only fitted the facts marginally, and
left most of the meaningful patterning unresolved in layers of ambiguity.
Estimates of the proportion of text that consists of multi-word lexical units rose
to as high as 80% in some circumstances. The link between the word and the
meaning gradually crumbled.
Delexicalisation is thus an unfortunate term. The word only appears to
lose meaning when the model has no higher unit to show where the meaning of a
multi-word unit is actually created. With the higher unit the lexical item
established we can return to the role of intuition, and take a different view of its
accuracy and relevance.
The lemma TAKE contributes to many lexical items in coselection with
e.g. prepositions and particles to make phrasal verbs, and nouns to maintain the
preference of English for simple verbs such as take a risk instead of risk. These
are not strictly meanings of TAKE , but uses of the word in combinations. If
these and other coselections are removed from the concordance of TAKE then it
might well be that the main remaining meaning of this lemma is as reported by
the intuition.5 The intuition was probably right after all, but if so this has been
obscured by an inadequate model of interpretation.
Problems of the intuition are not always resolved so easily, but the fairly
objective evidence from corpora allows us to study intuitive positions and
reactions with greater clarity, and there is less chance that the intuition will be
dismissed as irrelevant on future occasions.
From these brushes between the corpus and the intuition, it is easy to see
how word could get around that intuition was not to be trusted, and that it tended
to take up a position that could not be supported by corpus evidence. However,
the failing, as so often, is more likely to be in the model, or theory, of language
through which we perceive linguistic events and with which we interpret them.
Take the case of inventing sentences what are you asking your intuition to do?
Any utterance which is part of a communicative event is heavily dependent on
the events preceding it, to the extent that many contextual settings are already
44 John Sinclair
established before the utterance takes place, and the utterance is interpreted with
reference to those settings. If a user of English is asked to produce a sentence of
English in the absence of these settings, it is a most unnatural request, and it is
unlikely that the subject will be able to imagine a suitable communicative event,
master all the relevant settings, mentally construct enough of the preceding
utterances to provide an adequate cotext, and then think up a sensible
contribution although he or she is not involved in the hypothetical event. No
wonder we usually make a hash of it!
Because our basic models of language structure concentrate on sub-
sentential matters, and do not assign a central importance to interaction, we
formulate requests that appear simple enough from the perspective of our model,
but involve processes which are almost impossible to control, as we see when we
look through a richer model.
In the second instance, where a unit of meaning can spread over several
words, the intuition was delivering a perfectly reasonable answer to the question
as asked, but our resident models misinterpreted it, and so we blamed the
intuition and lost confidence in it. This kind of confusion is likely to characterise
future encounters with the intuition as well, until our models are rich enough to
cope with the information they are receiving, both from corpus and from the
intuitive reactions of people with command of the language.
So we both trust our intuitions and keep a wary eye on the strong
possibility of misunderstanding what we are observing. To hint at another area
where this could arise, it has been noticed informally that people recall phrases
that are frequent exceptions rather than normal constructions. Grammar deals
with the regular, and so is resonant with frequency; however many words appear
commonly in phrases which are uncharacteristic of the normal usage of the
words. The intuition will tend thus to retrieve the non-standard structure. For
example the corpus tells us that there are a number of adjectives that do not
usually appear in front of nouns, in what is called the attributive position. Instead
they normally occur after the verb be in the predicative position. However, in
certain collocations, fixed phrases and idioms this restriction is lifted; so users of
English have to remember both the rule and the exceptions. It seems that the
intuition, if queried about a particular adjective, tends to report on the
participation of the adjective in lexical patterns rather than grammatical ones in
phrases with particular collocates that are uncharacteristic of the grammar, rather
than the regular structures. From a grammatical point of view this seems
perverse it is the exceptions that come to the surface first but from a lexical
point of view it is a sensible response, since if an established multi-word
expression has a structure that is exceptional compared with the normal usage of
one of the words in it, this point has to be remembered in connection with the
individual word.
This intuitive response first came to our notice in Cobuild with the
adjective glad, and I have commented before about it (Sinclair 1991).
Overwhelmingly glad is used predicatively, and in some complex constructions.
However, many English speakers, when asked about the usage of this word, cite
a phrase from the translation of the bible published in 1611 glad tidings of
great joy that is still alive and well in the speech community. Apart from this
relic, and a few minor phrases, glad will be found, on thousands and thousands
of occasions, in predicative position. Without one of the tiny number of
collocations like tidings, it will sound very odd indeed as an attributive adjective.
There are good reasons for this which I will not go into now, because my point is
that here is another place where our intuitions may appear to report falsely about
the facts of the language. Following a grammar-predominant model, such as we
have, glad will be classified as predicative without question on the basis of
corpus evidence; the intuition may hold on fiercely to the few phrases that
contravene this convention. A model which is more balanced between grammar
and lexis should mediate successfully between these apparently opposed
positions.
It seems that glad is not alone in presenting a different pattern to the
grammar and the lexis. The adjectives ill, safe and likely are all found
predominantly in the predicative position, but it is easy to think up phrases where
they are used attributively an ill wind, ill effects; safe haven, safe sex; a likely
story, a likely lad. So in each of these cases we can interpret the role of intuition
as preserving memory of those phrasings which are characteristic of the lexical
patterning, especially when the more general and freer usage of the adjective is
in a contrasting grammatical structure. When what is exceptional in grammar is
typical in lexis, the phrasings are stored as individual items.
While there is at least an interpretive problem in the cases that we have
discussed, there is one process where the intuition can be safely trusted. In the
evaluation of corpus evidence the researcher has virtually no option but to yield
to the organising influence of his or her intuition. Complex patterns of
coselection are immediately interpreted semantically and classified broadly with
respect to each other. The same mental resource that we have seen is unable to
manage coselections outside participation in a genuine communicative context is
apparently razor sharp and completely reliable in a receptive mode.
To illustrate this I present the results of asking The Bank of English what
the principal collocates were of the pattern on the of.6 Any single word form
might occur in the gap, but it should be noted that the collocates are not restricted
to occurrence in that position, but might be anywhere in a ten-word window
around the phrase.
The leading collocates, according to their t-scores, are listed in Table 1. I
believe that anyone with normal fluency in English would find it difficult to scan
this table without making tentative groupings of the words along various
dimensions.
46 John Sinclair
Table 1: Collocates of the phrase on the of (Bank of English May 2002)
1-12 13-24 25-36 37-48

basis depends isle sale
edge depending number day
eve streets effect corner
back future surface cover
part issue banks focused
verge grounds floor comment
based strength island report
side depend focus evidence
brink impact site outcome
outskirts heels night amount
subject morning stroke restrictions
face question advice emphasis
One grouping could go like this:
1. Timing expressions, such as eve, night, morning, day. Here we note that eve is
an unusual word, usually found in poetry and oratory. This is a clue to the
meaning of these expressions, which are used in the timing of important events.
On the stroke of is also a somewhat dramatic timing expression, which needs a
particular time after it, the kind of time that is likely to be signalled by a clock
striking or something similar. As well as the hours, especially midnight, and half-
time, full-time, those unfamiliar with the game of cricket might be surprised to
find on the stroke of lunch/tea in there as well.
2. Spatial indicators, such as back, side, surface, and floor, corner. Site attracts
collocates to do with buildings. Outskirts, streets, banks are more specific spatial
references. Isle and island are parts of place names. Some uses of edge, verge,
brink are also spatial, but on the brink of and on the verge of are commonly used
as complex prepositions introducing mainly dreadful things.
3. The phoric nouns subject, issue, question, whose referents are to be found in
the surrounding cotext; in this phrasing probably just after the of.
4. The complex prepositions on the basis of, on the grounds of, on the strength
of, indicating the reason for a decision.
5. In some cases the lexical item extends beyond the designated phrase; for
example most of the occurrences of face in this phrasing is in the phrase on the
face of it, with a variant on the face of things. On the heels of is usually preceded
by hot or hard, or one of a few variants like close.
6. More generally, part fits into a phrasing X on the part of Y, where X is some
action, usually described in derogatory terms, and Y is the actor. Future attracts
talks and similar events on its left, and political problems on its right. Effect is
part of an item which can be represented as X has a Y effect on Z, where X is
some event, Y is an adjective like adverse, dramatic, and Z is something like a
political programme. Cover is usually preceded by the name of a celebrity, and
followed by the name of a journal. Sale predictably attracts the vocabulary of
financial dealings. On the evidence of has a remarkable tendency to come at the
beginning of its clause, introducing the reason for an action which is reported
later.
7. The remaining nouns that occur between o n and of are frequent but
unremarkable collocationally, like number, amount, advice.
8. Depends, depending, depend are verb forms which are likely to come in front
of the expression, as also focus, focused, based, comment, report. Evidence is
typically preceded by one of these verbs. Impact, emphasis and restrictions are
much more likely to precede this phrase than to be the missing noun.
The account of the patterning associated with this phrasal framework is

presented here artificially, because normally any instance of it would come with
a word selected to fill the gap. A fluent user of English, encountering an actual
instance of the phrase, performs an instant interpretation which involves all the
relevant categorisation and a lot more detail besides. There is no escape from
intuition if you have command of the language you are investigating. Even if a
researcher wanted to view the data directly and without the accompaniment of
intuition, it would be almost impossible. It is instructive to examine a
concordance in a language unknown to you to get an idea of what it is like to see
pattern only. But techniques exist for keeping the intuition temporarily at bay,
and these are worth cultivating.
The format of a KWIC concordance is a great help in itself, because the
vertical patterns which are not meaning-bearing are prominent, and can
provide a neutral framework within which the researcher can see patterns
without immediately ascribing meaning to them and therefore establishing
meaning-bearing relationships among them. In reading through Table 1, and
imagining each word in the frame on the of, there are probably instances
where at first the meaningful order was not clear especially if the whole of the
lexical item was not present. On the stroke of, for example, clearly needs to be
followed by a specific time to be intelligible. On the strength of is normally a
complex preposition as noted in point 4 above, where strength has little to do
with strong; however, if preceded by a form of D E P E N D , the preposition
disappears and strength reverts to its independent meaning.
The examination of Table 1, trying each word in it to see if it fits in the
gap and if so how the meaning is organised around it is a kind of alienation
process that I have called degeneralisation (Sinclair et. al. 1996: 177). Since the
48 John Sinclair
essence of finding the meaning-creating mechanisms in corpora is the

comparison of the patterns as physical objects and quasi-linguistic units with
the meanings, it is valuable to be able at times to study the one without the other.
This takes a little skill and practice, but to my mind should be an essential part of
the training of a corpus linguist.
3. Annotation
Aarts (2002a: 10) went so far as to say that annotation is anathema to
corpus-driven linguists. This is a fairly serious misunderstanding, and to clarify
my own position it is necessary to define terms carefully.
Let us first distinguish between mark-up and annotation. They are not
always kept distinct in usage, and their domains may overlap, but they are worth
distinguishing. Both of them are processes which provide additional information
to what is called plain text. Plain text is a straightforward concept, but there
are some who claim not to understand it, so we will start there.
Imagine that you had a long thin reel of paper to write on rather than a
rectangular sheet like a reel of sticky tape but made of paper. You have in front
of you a piece of writing that you want to record onto this reel of paper just a
paragraph. How would you do it? I expect that you would ignore line ends,
remove hyphens that marked words split at line-ends, and otherwise produce a
continuous stream of letters, numbers and punctuation marks in the same
sequence as the original. That is plain text, and it consists of an alphanumeric
stream.
If you continue transferring written text in this way, however, you will
soon encounter problems bold face, italic, underlinings for example, and
headings, large fonts and other layout matters. Mark-up is the process of
recording these additional pieces of information by making notes interspersed in
the alphanumeric string. So just before a section of bold face will be a tag that
says from here on there is bold face and just after the section there will be a tag
that says from here on we return to normal face. The tags are coded in a mark-
up language, of which the most widely used has been SGML, now giving way to
XML. So for each note there will be two tags.
In marking text up, then, the aim is to preserve information that would
otherwise be lost in the transfer of text to electronic form. Annotation, which we
will come to later, uses the same conventions as mark-up but has no limits on the
kind of information that is provided. Specifically, it encodes information which
is not directly recoverable from the original text, but is added by a researcher.
Returning to mark-up, now imagine that instead of a written text to be
transferred to the reel of paper, you were faced with a recording of a
conversation. Here there are many more decisions to be taken, because the sound
wave has to be interpreted as an alphanumeric stream. Let us say that you do not
attempt a phonetic transcription, but you adopt the mode of transcription called
orthographic, using ordinary spelling wherever possible, but noting all the false
starts, laughs, coughs and stutters.
If you do this conscientiously you will end up with a legible text, but one
which has lost a lot of the original information in the sound wave. Intonation is
poorly represented in punctuation, stress is usually not marked in writing, and all
sorts of emotional and attitudinal meanings will not transfer. You may want to
mark up the transcription to record some of these important items, again using a
tag coding.
There is good motivation for preserving this information, and there are
various ways of preserving it. However, it is important to note that a simple
orthographic transcription has a definite status in and for itself, even though it
may be enhanced by good mark-up. It is legible, and a fluent speaker is usually
able to infer enough of the missing information to understand the transcript with
only occasional difficulty, much as he or she would adjust to a speaker of an
unfamiliar variety of English. You will have included word spaces in your
transcription without difficulty, though you did not hear most of them, and
perhaps speaker change, with some attempt to recognise the various speakers.
You will have made a stab at sentence and paragraph boundaries, and used full
stops and capital letters with confidence.
In the very first corpus of spoken language in electronic form (Sinclair et.
al. 1970 and forthcoming) there was no difference made between capital letters
and small ones because in the early sixties computers could only cope with one
alphabet. There was no punctuation and no indication of speaker change because
transcribers were asked not to include these. Word spaces were present, and this
led to criticism from some purists, but I find no problem in using the
transcribers ability to detect word spaces to improve legibility.7
Conventions such as SGML originated when computers were not nearly
as powerful and flexible as they are today. We have reached a stage where a
recorded conversation can be digitised and all the features of the sound wave
which are relevant to language can be retained in the computer and presented to a
researcher as required so, for example, an orthographic transcription can be
aligned with the sound wave from which it was transcribed, and segments of the
recording can be played back to order, so there is no further need for mark-up.
Similarly, documents can be digitised, retaining all relevant aspects of their
format, layout and typography, and again this information, kept separate from the
alphanumeric stream, can be aligned as and when required.
So the mark-up languages represent a stage of development of computer
text processing which is now obsolete. The updating of existing corpora will be
slow because a lot of material has been tagged (and often re-tagged to keep up
with changes in best practice), and there are some contingent problems which
will be mentioned below.
There is an issue here of the integrity of texts. While it is conceded that no
electronic representation of a text is identical with the original, the object of
making an electronic copy is surely to preserve at least the alphanumeric stream
in its original sequence. Any disturbance to that will lead to difficulties later on,
particularly now that many corpora are much too large for human inspection.
The principal problem is that it is not possible to be sure that all the tags have
50 John Sinclair
been removed, without the accidental removal of some genuine text. There are
two sources of error here one is the accuracy with which the tags have been
inserted, and despite the availability in recent years of SGML parsing and
checking programs there are all sorts of opportunities for error. The other is that
strict adherence to the rules is laborious, and there are a number of short-cuts that
are commonplace, and not necessarily retrievable. The situation was summed up
by Vlado Keselj in a message to the Corpora List in April 2002: Actually,
writing a correct and general SGML detagger would be a *very* difficult task.
Thankfully, there is an easy way of avoiding this problem. The
alphanumeric stream, the plain text file, can be just one of several parallel data
streams, and mark-up tags can be another. When required, these two streams can
be merged, and a single string alternating text and tag can be made. This does not
affect the integrity of the plain text file, and the process can be repeated and
elaborated as required. This system has been in everyday use for some fifteen
years now, but it is still common to find tagged corpora that are not available in
plain text form, and can only be separated by a laborious process of doubtful
accuracy.
We can summarise the arguments around mark-up as follows:
1. The information captured in mark-up is valuable and worth preserving.

2. Mark-up is not the only way of preserving this information.
3. Mark-up is now obsolete as a way of storing text.
4. Marked-up text can be prepared by merging plain text and tags.
5. Corpus material should always be kept in plain text format.
With this in mind, we can now turn to annotation. Annotation uses the same
conventions as mark-up, but is not restricted to features of the original text or
recording. The classic annotation is POS-tagging, which means inserting after
each word in a corpus a code denoting its part of speech, but there are now many
others, some quite unusual and informal, and many corpora are very heavily
annotated.
I would certainly not condemn all annotation, and I make judicious use of
it myself; but I have reservations about some practices, and about the wisdom of
relying on a platform of annotated text in our present state of knowledge. The
idea that annotation is anathema to people who share my views no doubt arises
because of these stated reservations.
In order to clarify my position, I would like now to make a distinction
between a corpus which is prepared for general use by a community of
researchers, students and workers in the language industries, and one which is
put together for a particular application. My comments and particularly my
reservations largely concern the former type of corpus, often known as a
generic corpus, where I take the simple view that all the information apart from
the plain text should be optional, because (a) some important groups of users
require only that, and (b) most researchers will only require a small subset of the
annotations that might be available. Researchers using statistical methods usually
need a large amount of plain text, as do those searching for lexical patterns.
Information from mark-up and annotation would only be of interest in problem
cases, and statistical studies rarely get down to that level of detail. Also, as
annotations become more varied and verbose, no-one will want to make use of
all of them, and if the corpus is only available in fully annotated form, they will
be carrying a lot of baggage around with them.
The other type of corpus, one that is designed and built for a pre-
determined application, will give top priority to the needs of the job, quite
rightly. The type and level of mark-up and annotation will depend on the kind of
queries that the investigation requires, most of which will be knowable in
advance. In such circumstances, which are common in commercial applications,
the best that one can do is appeal for the researchers to observe good practice so
that their corpus may be reusable for other purposes.8 The same situation is
found in the growing practice of putting together quick, highly specialised
corpora, perhaps from the internet, in order to carry out a limited set of tasks,
with no intention of retaining the corpus after the disposable corpus. In such
cases any short-cut is justified and it is irrelevant to suggest that researchers
conform to good practice (see Pearson and Bowker 2002).
So I am only concerned with generic corpus resources. Many prospective
users of such corpora expect to be offered POS tagging and sometimes full
parsing and semantic and pragmatic tagging as well, and there is no reason why
such annotations should not be available with generic plain text corpora, but they
should be optional, and they should conform to the conditions set out above, for
mark-up, and below. Many projects start out with a request to the corpus
linguistics community for a corpus already tagged in a particular way. In the best
scientific tradition, researchers use previous research as their platform, and probe
beyond their predecessors.
Here is where my reservations start. In the first place, all the annotation
systems that I know of that code linguistic information have an element of
human input, of which the smallest-scale intervention is the human correction
of the computers mistakes. In many procedures the computer plays a fairly
minor role in the decision-making and is used just to manage the data; in others
there is a preliminary stage where the input text is manually edited and then
processed automatically.
I have argued for some years that annotation which is not fully automatic
has no place in the toolboxes of generic corpora. It is unavoidable in many
applications because of their need for practical outcomes, and because there are
no suitable tools which are fully automatic. While it is claimed that better and
better analyses are made by researchers working in partnership with the
computers, in Aarts words, at some moment the descriptive model and the
annotation tool derived from it must be frozen if the desired result is to be
achieved (Aarts 2002b). That is a fact of life in applications.
Unfortunately, too many researchers nowadays expect, and accept, off-
the-shelf tools that they do not examine too closely; the tools may be of some
antiquity, but they are not carefully evaluated. There is thus no incentive to work
52 John Sinclair
towards a new generation of fully automatic tools which derive from a corpus-
sensitive analysis, and which may present a rather different picture of the
language from the present ones. The whole procedure of annotation is pretty
frozen at present, and has moved very little in the last decade, because the
theories are not accessible for modification by the data.
There are two compelling reasons why annotation of this kind should not
be offered as part of a generic package.
One is that the models of language used in todays taggers date from a
time before evidence from a corpus was available, and some of them derive from
models which ignored empirical evidence entirely. A corpus can certainly be
used to evaluate and correct the descriptions that come from these models, and
eventually the models themselves, and this does happen in a very small way
concerning some of the details of classification. But, as Tognini Bonelli points
out in the quote early in this paper, for many scholars there is no impetus to
expose the theory to such scrutiny. Overwhelmingly the consensus view of
researchers is that the models are basically correct, and while they can be tidied
up by corpus evidence there is no need to open up the whole complexity of
language theory and description for the sake of some minor blemishes. Better to
get on with the job.
In the view of corpus-driven linguists, the picture is quite different. Their
perception is that corpus study provides a constant, subtle undermining of the
received models of language. The evidence is piling up all the time, but it is
invisible to anyone who looks only through the categories of the received model.
Claims of a high-per-cent accuracy of tagging are misleading, because the
decisions about what is correct and what is wrong are not supported with
linguistic evidence. Also most wrong assignments are systematically wrong,
because the machines are consistent at least, and the researcher is left with two
misgivings: (a) perhaps the computer is offering valuable new information rather
than making mistakes, and (b) the places where the computer is unreliable are
probably just the places where the researcher would like to rely on it.
The other argument against conventional tagging causes some problems
when put, as I sometimes do, in the form Annotation loses information. It
would seem at first sight that annotations add to the information in the corpus,
and indeed terms like enrichment are sometimes rather rashly used to promote
annotated text. Let us start with a simple case and follow it through. Let us agree
that boy, bicycle and brat are all nouns. They each are given the tag N. Once
this is done, they are all identical from the point of view of the tags; their
individuality is lost.
The proponents of annotation argue at this point (a) that there is a gain in
generality in the recognition of what is shared among members of the class N,
and (b) that the individuality of the word is not lost, because the word itself is
still there in the linear stream. These points need to be explored carefully.
First, the gain in generalisation, which is certainly a valid point as long as
generalisation can be demonstrated, but here the informality of the received parts
of speech weakens the argument considerably. No formal definition exists of the
class N; computer grammarians rely on an uneasy mix of received

grammatical categories that cannot be represented in a computer, and
discriminatory routines whose only virtue is that they come fairly close in
practice to the received categories, thus reducing the amount of manual labour in
matching them precisely. The painstaking efforts and academic honesty of Biber
et al (1999) is worth noting here, because they doggedly follow the model of
Quirk et al (1985) and so they do not have a chance of aligning their received
categories with the evidence from their corpus. So they resort to talk of the
nouny- ness of nouns (p. 59) and are most unconvincing in their attempt (pp.
255-8) to hold onto species nouns like sort, loads as still in the same class as
nouns like boy and bicycle.
The second argument is that the annotated text gets the best of both
worlds, because the individual word is retained, and the researcher has the choice
of word or tag. But the fact that word and tag alternate in a single linear string
should not deceive us; the text and the tags form two mutually exclusive versions
of the corpus, as Aarts is careful to point out (2002a: 9). While it is possible to
search for a mixed string, for example boy with the tag N, that is essentially a
lexical query and is not likely to be characteristic of the searches.
The replacement of a word like boy by a tag like N loses information in
a more subtle way also, because, having been designated N, the word cannot
be reclassified or seen, even temporarily, as anything else. If the grammar has
failed to note that Oh, boy! is an enthusiastic expression of approval, then that
boy will be replaced by N like any other. A particular view of language is
imposed on the corpus, down to the finest detail, and it is non-negotiable. There
are many areas even in POS tagging where experts differ for good reasons just
what is negative and what is not and what, if anything, lies in between these
polar opposites, for example. Or just what is a modal expression. Each tagger
will put into practice a policy for these categories that is more likely to be the
result of expediency than the elaboration of a theory, and these decisions will
affect a decade or more of research, without the users even being aware of them.
Most researchers are content that someone has tagged the corpus, and they are
not inquisitive as to how this was done, or what the shortcomings are.
One major structural feature of English, often commented on, is the large
number of forms which function as either verbs or nouns, so that a conventional
tagger has a huge job to distinguish them. Promise and promises, for example,
are interchangeable between the two word classes. Also this is a productive area,
so that new crossovers occur daily,9 and even nouns formed by suffix from verbs
can become verbs again, e.g. gift, gifted. This prominent, almost defining feature
of English word classes is completely ignored in normal POS tagging, and all
sorts of tricks and dodges are used to obliterate it. If discussed it is called
portmanteau tagging, which shows the all-pervading grip of the received
models. Why should a grammar of English not recognise a word class that covers
both verb and noun, as well as having one for just verbs and one for just nouns?
This second kind of information loss is the loss of the potential to be
classified in all sorts of different ways according to different criteria; such
54 John Sinclair
flexibility is vital to any theory-based research. This is another way of seeing the
individuality of words, which is denied them as soon as they are given a tag.
From this discussion it is clear that non-automatic annotations are best
confined to applications, where they can expect to remain in use for some time.
Their inclusion among generic resources, however, is misplaced and hazardous,
and it holds back progress substantially. Instead of research projects pushing
ahead with the improvement of fully automatic annotation, a considerable
proportion of the available funding goes into this very flawed activity.
Any unavoidable human role in the process of analysing corpora holds
back progress along many dimensions, but none so obvious as in the size of
corpus to be managed. Generic corpora are now measured in the hundreds of
millions of words, and this figure will rise and rise because each rise in the order
of magnitude shows the need for the next one, and there is no reason why this
should stop at some arbitrary size. Any human input, no matter how tiny, that
grows with the size of the corpus adds so much to the cost and time, as well as
opening an opportunity for inaccuracy, that either the size of the corpus has to be
kept down or costs will soar.
To summarise this complex area, my reservations about annotation are
quite specific, and concern only their inclusion in the resources around generic
corpora. Because they impose one particular model of language on the corpus,
they restrict the kind of research that can be done; because the practice of
annotation normally requires human intervention, it is not a replicable process
and therefore fails the first test of scientific method. Because the models imposed
by current conventions of annotation are unlikely to be informed by corpus
evidence, I believe researchers who use them are likely to make unnecessary
problems for themselves.
None of these reservations are relevant when researchers are concerned
with an application and considering matters such as cost-effectiveness, and are
not interested in any factors outside the application. Annotation as an
exploitation of the mark-up facility is typical of the kind of tool that emerged in
the early days of computing simple, extremely flexible and useful. The other
side of the coin is that it can be uncontrolled, invasive and overwhelming; I
believe that most of the research projects in corpus linguistics that are in progress
at the present time are not examining their languages at all, but are examining the
tags. The particular choices of word combinations that corpora uniquely offer us
are impossible to retrieve using tags.
As a matter of personal practice, I have very little need for non-automatic
annotation, and I use plain-text corpora whenever possible. This is because I am
primarily interested in the implications of corpus study for the development of
language theory and description. If I was obliged to use only annotated corpora
to work with which is the settled policy of, for example, the Arts and
Humanities Research Board in UK, which funds most of the relevant research
then my work would be hampered if not rendered impossible.
This is where we come to the crunch about annotation, where I think I
part company not only with Jan Aarts but with quite a proportion of the ICAME
community. This is because I do not regard the description of languages as

application, and therefore I would advise against using annotations of the kind
we have available at present in the practice of language description.
I must define what I mean by application as carefully as possible, because
the word can be used to describe many relationships between theory and
practice; a description of a language is often seen as an application of a theory,
for example, but that is not the sense in which I want to use the term. For me an
application in linguistics is the use of language tools in order to achieve a result
that is relevant outside the world of linguistics. If you are building a machine that
will hold a telephone conversation, for example, that is an application, or a
translating machine or even writing a dictionary; the end users are not
necessarily nor even primarily linguists, and so these projects are applications of
linguistics.
But research that tries to produce a better description of English grammar,
for example, is not an application; it is only directly relevant to other
grammarians. My contention is that whereas there is justification in applications
(in my sense) for using any tools that may further the work, this is not so in
language description for its own sake. In the former case the judgement is by
results, and the end justifies the means, so if the translation machine works well
it matters little what is inside it.10 But how do we tell if a description is of good
quality?
Descriptions are evaluated from above and below, so to speak. Lying
between the theory and the data, a good description is one that shows few
discrepancies in either direction. Its categories will be consistent with the theory,
and they will account comprehensively for the patterns observed in the data. But
if the data is preprocessed by annotation which is not automatic and is avowedly
an elaboration of the theory, then there is clearly a vicious circle in operation.
The theory cannot come under attack because the only available view of the
corpus is one viewed via the theory.
Corpus linguistics has still to mature a little, to shake off the last traces of
the days when a corpus was a major problem for a fledgling computer, and
where Mr Fixit attitudes were welcome because they led to quick, if perhaps
wobbly, results. The demands of todays researchers are ever more sophisticated,
and the software facilities they are offered are often built on shaky foundations.
The results of applications using annotated corpora are uniformly unimpressive
when they concern the appreciation of meaning in open text, and as a
consequence workers in Information Technology do not trust the structure of
language, and talk of it in degenerate terms reminiscent of Chomskys dismissal
of performance (Chomsky 1965: 3 f.).
My unease about the over-use of annotation an annotated corpus can
reach a condition where over 80% of the bulk consists of the annotations,
compared with less than 20% the texts always ends up with concern that the
models underlying the annotation are neither adequate for nor even relevant to
the description of the language in a corpus.11 Most researchers are not language
theorists, and they take on trust the software that is offered, and apply it
56 John Sinclair
uncritically. Corpus-driven linguistics aims at developing the models so that they

become more reliable; it is reasonable to suppose that as the models improve, the
descriptive categories become more amenable to automation, and annotations
always optionally could become associated with generic corpora.
There is no space for me to illustrate these points with reference to actual
cases, but I can refer the interested reader to my review of Biber et al (1999)
published in IJCL 6/2 (Sinclair 2001). This grammar explicitly applies a pre-
corpus model of language to a small corpus and annotates the corpus as a first
step. Despite what must have been an enormous effort of silent editing, the
evidence that surfaces in the book consistently fails to validate the categories of
the imposed description.
4. Conclusion
This has been an exercise in clarification, because for many linguists working
with corpora it might seem bizarre that one group distinguish themselves by
denying any role for the intuition, and condemning the normal practice of
annotation. I cannot, of course, speak for all researchers who might see
themselves as sharing a corpus-driven perspective, but I hope that I reflect their
general position fairly. They have a great respect for intuition, and cannot work
without it. The cannot applies in two meanings they are constantly guided by
it, and they could not get rid of it if they wanted to. As part of their professional
stance they cultivate the skills of degeneralisation, allowing them to stand back a
little from participating in the language events they observe as researchers, and
to defer momentarily the intuitive response; this gives them a small amount of
independence from their intuitions.
They appreciate, moreover, that intuitive responses need careful
interpretation, and they respect the limits of intuitive competence; in particular
they do not expect that if they invent a sentence their intuitions will ensure that it
has all the features of a naturally-occurring one.
At present corpus-driven linguists are not likely to have much use for
annotation, because most of the available systems suffer from the twin
drawbacks that their underlying model of language is pre-corpus, and that they
fit the corpus so badly that human intervention is necessary. Annotation,
however, even of the limited kind we have, has its place in applications, where
quick results are needed and rough-and-ready ones will suffice.
Perhaps the main difference between the two methodological stances in
corpus linguistics is their attitude to the use of annotations, of the present-day
variety, in purely descriptive studies. To the corpus-based linguist they are
indispensable, whereas to the corpus-driven linguist they are obfuscating.
But provided that the various safeguards discussed above are respected
(including those raised in connection with mark-up) there is no objection to the
practice of annotation in itself; used without understanding of its limitations it is
a hazardous practice. Perhaps newcomers to the growing profession of corpus
linguist should be given a few warnings that annotation is a coding convention
that has no controls beyond the grammar of the code, that the appearance of an
annotated corpus belies the fact that it is an alternation of two separate and
incompatible codes (in the sense that plain text is also a code), that the two
coding streams should always be maintained separately, and that non-automatic
annotation is essentially subjective.
Notes
1. The pattern grammars mark a first step in following the corpus evidence
with little or no grammatical preconceptions, and Hunston and Francis (1999)
give a thorough explication of this approach.
2. See the discussion in Sinclair (1984).
3. The phrase used language is from Brazil (1995); while a little whimsical to
be a regular term, it allows us to avoid the issue of authenticity that is such a
humbug in this kind of discussion.
4. It is always conceded that frequency is a crude measure of importance,
and more an indication of a criterion than a criterion in itself. But where two
uses of a word show massive discrepancies in frequency, and the less common
one is the one that first comes to mind, then there is some explaining to be
done.
5. There are 755784 instances of TAKE in the Bank of English, so it would be a
considerable though worthy labour to check this. I have looked at several
small samples, and I have not so far found any convincing examples of the
core meaning, but I would expect them to be few and far between.
6. The Bank of English stood at a little less than 500 million words when this
data were retrieved. Details of the corpus can be found at
http://www.cobuild.collins.co.uk. I am grateful to The University of
Birmingham, co-owners of the corpus, for access to it.
7. An example of this kind of text can be found in the file LEXIS at
http://ota.ahds.ac.uk/, being transcripts of recordings made at the University of
Edinburgh in the early 1960s.
8. See Wynne (ed) (forthcoming) for an example of such guidance.
9. Todays example: I badged my way into the lobby. said by a police
inspector arriving at a crime scene (Patterson 2002: 23).
10. Some might say that if the description is inaccurate then the machine will
never work properly, and that there is evidence in the performance of such
devices that support this position. But it is an empirical question.
11. Attitudes change quickly in this area of study, and I can only be sure that in
the few years up to the composition of this paper in 2002, SGML format was
regarded as the standard among the advisers to AHRB. The advisers have
changed, thankfully, and there may now be a greater understanding of the
58 John Sinclair
numbing effect of having to view ones data through the imperfect vision of
another.
References
Aarts, J. (1991), Intuition-based and observation-based grammars, in K. Aijmer

and B. Altenberg (eds), Corpus linguistics. Studies in honour of Jan
Svartvik. London: Longman. 44-62.
Aarts, J. (2002a), Does corpus linguistics exist? Some old and new issues, in L.
Breivik and A. Hasselgren, From the COLTs mouthand others.
Amsterdam: Rodopi. 1-17.
Aarts, J. (2002b, forthcoming), Review of E. Tognini Bonelli, Corpus linguistics
at work. International Journal of Corpus Linguistics 7 (1).
Biber, Douglas, S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999),
Longman grammar of spoken and written English. London: Longman.
Brazil, D. (1995), A grammar of speech. Oxford: OUP.
Chomsky, N. (1965), Aspects of the theory of syntax. Cambridge, Mass.: MIT
Press.
Francis, G. (1993), A Corpus-Driven Approach to Grammar, in M. Baker, G.
Francis and E. Tognini Bonelli, Text and technology. Amsterdam: John
Benjamins.
Hunston, S. and G. Francis (1999), Pattern grammar. Amsterdam: John
Benjamins. 137-156.
Johns, T. (1990), From printout to handout: Grammar and vocabulary teaching in
the context of data-driven learning. CALL Austria 10: 14-34.
Patterson, J. (2002), 1st to Die. London: Headline.
Pearson, J. and L. Bowker (2002), Working with specialised language: a
practical guide to using corpora. London: Routledge.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Sinclair, J. (1984), Naturalness in language, in J. Aarts and W. Meijs (eds),
Corpus linguistics: Recent developments in the use of computer corpora in
English language research. Amsterdam: Rodopi. 203-210.
Sinclair, J. (1991), Shared knowledge, in J. E. Alatis (ed), Linguistics and
language pedagogy. The state of the art. Georgetown University Round
Table on Languages and Linguistics: Georgetown University Press,
Washington D.C. 489-500.
Sinclair, J. (2001), Review of Biber et al., The Longman grammar of spoken and
written English, in IJCL 6 (2): 339-359.
Sinclair, J., S. Jones and R. Daley (1970), English lexical studies. Report to OSTI
on Project C/LP/08. Revised edition forthcoming 2003: E n g l i s h
Collocation Studies, ed. by R. Krishnamurthy, Introduction by W. Teubert.
Birmingham: Birmingham University Press.
Sinclair, J., J. Payne and C. Hernandez (eds) (1996), Corpus to corpus A study
of translation equivalence. International Journal of Lexicography 9 (3)
(Special Issue): 172-196.
Tognini Bonelli, E. (2001), Corpus linguistics at work. Amsterdam and
Philadelphia: John Benjamins.
Wynne, M. (ed) (forthcoming), Developing linguistic corpora a guide to good
practice (provisional title). Web and Print versions. Oxford Text Archive.
Recent grammatical change in English: data, description,
theory
Geoffrey Leech
Lancaster University
Abstract
This chapter begins by considering the contrast between the data-driven

paradigm characteristic of corpus linguistics and the theory-oriented paradigm
characteristic of some other schools of linguistics, particularly those espousing a
generative framework. To illustrate the corpus linguistics paradigm in detail, I
present a case study of grammatical differences observed in the LOB and FLOB
corpora and also other corpora of the early 1960s and the early 1990s. By
abductive or inductive inference from the observed data, (fallible) descriptive
generalizations can be made, and tentative conclusions of theoretical interest can
be drawn. In conclusion, I argue that corpus linguistics is not purely
observational or descriptive in its goals, but also has theoretical implications.
However, like a theory-driven inquiry in the classic formulation of Poppers
hypothetico-deductive method (1972: 297), a corpus linguistic investigation can
only lay claim to provisional truths, and therefore requires confirmation or
refutation by further research findings.
Table 1. Summary of the contents of this article

A. Metatheoretical preamble
B. Case study:
Recent grammatical changes in (mainly) written (mainly) British
English viz. frequency changes between 1961 and 1991-2:
(a) modal auxiliaries and semi-modals
(b) other grammatical categories relating to colloquialization
C. Conclusions
1. Introduction
In the 1960s, one of the widely-accepted fundamentals of linguistics was to be

found in Chomskys hierarchy of three levels of adequacy (1964: 62-3):
(1) Explanatory adequacy is achieved when the associated linguistic theory

provides a general basis for selecting a grammar that achieves [descriptive]
adequacy over others that do not.
62 Geoffrey Leech
Descriptive adequacy is achieved when the grammar gives a correct account

of the linguistic intuition of the native speaker, and presents the observed
data (in particular) in terms of significant generalisations that express the
underlying regularities of the language.
Observational adequacy is achieved if the grammar presents the observed
data correctly.
One of the implications of this formulation was a downgrading of the importance

of empirical observation: as Chomsky himself pointed out, observation adequacy
could be achieved by a mere listing of the data. Another implication, as I saw it,
was a confusion between two notions of intuition: Chomskys concept of
descriptive adequacy confused the knowledge of the language of a native speaker
with the analytic knowledge or expertise of the linguistic scientist, able to make
significant generalizations about the language. In Leech (1968) I argued this case,
and suggested a different hierarchy of three levels, which would be a more
realistic account of the main strata of investigation in linguistics:
(2) Theory: formal [and functional] characterization or explanation of language

as a phenomenon of the human mind and of society.
Description: formal [and functional] characterization of a given language, in
terms of the theory.
Data collection: collection of observations which a description, and
ultimately a theory, has to account for [e.g. corpora]
Since that time, the more empiricist and more rationalist trends in linguistics have
diverged so far as to be almost irreconcilable. However, I still find the
formulation in (2) useful, although I would now prefer to insert the words in
square brackets [and functional], showing my preference for a combination of
formal and functional explanation which corpus linguistics is characteristically
attracted to. The other words in brackets [e.g. corpora] are of course a
reminder that corpus linguistics finds its raison dtre at the observational or
data-collection stratum of these three, the one that Chomsky found to be of such
little importance. However, my overarching goal in the present chapter is to
explore the relation between these three interrelated levels, and to argue against
the common assumption that corpus linguistics is concerned with mere data
collection or mere description.
Recent grammatical change in English 63
2. A case-study: recent changes in English grammar
Alongside this, I also have a more practical goal, which is to exhibit as a case
study a particular area of linguistic description: recent quantitative change in
English grammar, as observed through the comparison of the LOB and FLOB
corpora. Although the main study has been focused on the LOB and FLOB
corpora, and therefore on written British English, it has been supplemented where
practicable by work on other corpora permitting a similar comparison between
English in the early 1960s and in the early 1990s. I will use this case study as a
means of illustrating the relation between the three levels of theory, description
and data collection or, to put them in the order which would more naturally
occur to a corpus linguist data collection, description and theory.
2.1 Data collection: using the LOB, FLOB, and other corpora
To begin with the level of observation: we began with a study of the two
matching corpora LOB and FLOB, which had already been part-of-speech
tagged, through the combined processing of two taggers: CLAWS4 and Template
Tagger (see Smith 1997 on the tagging techniques).1 By using the powerful
annotation-aware search and retrieval tool Xkwic (Christ 1994), we found it
possible to extract occurrences of a whole range of grammatical categories that
have been suspected, with varying degrees of empirical backing, to have become
more frequent or less frequent in the recent past. The main areas of grammar we
focus on in this chapter are (a) the modal auxiliaries, together with the mixed
array of verbal constructions conveniently termed semi-modals, and (b) a range
of grammatical phenomena associated with a suspected trend of
colloquialization.2
Although we began with the LOB and FLOB corpora, we extended our
study to a selective use of some other comparable corpora spanning
approximately the same period of 30 years, as shown in Table 2.
The family of four matching corpora Brown, LOB, Frown and FLOB
(henceforward termed the Brown family) is well placed to provide evidence of
frequency changes in British and American English over the period between 1961
and 1991-2. Unfortunately no comparable corpora for spoken English exist, but
we were reluctant to confine our attention to written (printed) language,
especially considering that much grammatical innovation is likely to originate in
the spoken language. With the permission and help of Bas Aarts and Gerry
Nelson at University College London, we were able to identify small comparable
spoken subsets from two other million-word corpora developed at UCL with data
from around the early 1960s and the early 1990s.3 These were the corpus of the
Survey of English Usage (SEU), of which a large spoken part was computerized
and distributed as the London-Lund Corpus, and the International Corpus of
English (the British variant known as ICE-GB). Because of difficulties of
matching samples, the spoken mini-corpora from SEU and ICE-GB were even
smaller, indeed much smaller, and were moreover less closely matched than the
64 Geoffrey Leech
Table 2. The corpora of English used in the study

Name of corpus American Date of Spoken Corpus size and design
or British data or
English collected written
LOB Corpus BrE 1961 Written Each corpus contains
Brown Corpus AmE 1961 Written approx. a million words, in
500 text samples from 15
FLOB Corpus BrE 1991 Written different genres. The four
Frown Corpus AmE 1992 Written corpora are built according
to the same design and
sampling method.
SEU-mini-sp BrE 1959- Spoken Each (sub)corpus contains
1965 approx. 80,000 words from
ICE-GB-mini-sp BrE 1990- Spoken a comparable and balanced
1992 range of spoken genres.
Brown family of corpora. One difficulty was that, although the SEU corpus had
been collected over a period of about 30 years, comparability with LOB and
Brown dictated that we rejected any material not contemporaneous with the
written corpora, a constraint we interpreted rather liberally to exclude any
material outside the time frame 1959-1965. Another problem was that the SEU
corpus was subdivided into texts of 5000 words each, whereas the ICE-GB texts
were of 2000 words each. Hence a one-by-one matching of texts between the two
spoken mini-corpora was not feasible, and partial and overlapping matchings had
to be allowed.
Because of these drawbacks, particularly the restriction of the mini-corpora
of speech to a mere 80,000 words each, our findings from the spoken corpora
could only be seen as highly tentative indicators of what was happening to spoken
English over this period. Nevertheless, we felt that such a study, however
inadequate and provisional, would be preferable to a survey of recent
grammatical change which took no account of the spoken language. In fact,
differences observed between the mini-corpora in the frequency of modals and
semi-modals were tantalizingly even greater than those observed between LOB
and FLOB. A summary of the contents of the two spoken mini-corpora is given in
Table 3.
The sophisticated ICECUP software available for searching the ICE-GB
could not be used with SEU-mini-sp, and so to ensure comparability we decided
to use the WordSmith retrieval package and XKwic for both mini-corpora.
Table 3. Mini-corpora for studying language change in recent British spoken

English
Name of corpus: Survey of English Usage International Corpus of English
spoken Mini-corpus (Great Britain) spoken Mini-
corpus
Abbreviation: SEU-mini-sp ICE-GB-mini-sp
Period of texts: 1959-1965 1990-1992
Size: 80,000 words each
Texts from these (in each corpus:) conversation, broadcast discussions, sports
categories: commentaries, other commentaries, broadcast news, broadcast talks
This section of the chapter has been called Data collection, and under this
heading we can bring together the basic evidence-providing tools of the corpus
linguists stock in trade. Obviously, these include the corpora used for this
particular study, and the software used to extract the relevant grammatical
phenomena in this case the search and retrieval tools XKwic and WordSmith.
Basic retrieval products such as concordances and frequency lists, especially
when they incorporate the results of simple grammatical analysis such as POS
tagging, might be considered to take us beyond mere data collection, and to bring
us to the threshold of the descriptive level of analysis. However, the scale of
abstraction represented by the three levels of data collection, description, and
theory is best assumed to consist of many small steps, rather than three giant
strides. I return to the matter of data collection versus description in 2.2 below.
Although so far my presentation of the three levels has worked from the
bottom up, this is of course by no means inevitable in the methodology of corpus
linguists. Some studies are problem-driven where the need to investigate a
particular theoretical or descriptive hypothesis may determine the collection or
selection of a suitable corpus, and the selection of particular corpus data to be
studied. But in the present case, the bottom-up methodology prevailed. We did
not start with a particular theoretical claim (say about the process of historical
change) or a particular descriptive hypothesis (say about the English modals),
although our study led to these. It was the existence of the LOB and FLOB
corpora, and the particular equivalence relation between them (found also
between Brown and Frown) which enticed us to follow the example already set
by Hundt, Mair and others, and to use these corpora to investigate recent changes
in grammar.4
2.2 Description: the modals and semi-modals
The descriptive level of linguistic investigation attempts to determine what can be

truly said about some aspect or level of the language, in this case English
grammar. On the face of it, an example of linguistic description is provided by
Table 4, showing changes in the frequency of modal auxiliaries over the 30-year
period as reflected by the paired corpora.5 However, at this stage, statements are
66 Geoffrey Leech
being made about a particular set of corpora, rather than about the language that
they exemplify. We could call this level of statement data description: an
intermediate step between data collection and linguistic description.
Table 4. Frequencies of modals in the four written corpora (including negative

forms)
British English Log Diff % American English Log Diff

likhd likhd %
LOB FLOB Brown Frown
would 3028 2694 20.4 -11.0 would 3053 2868 5.6 -6.1
will 2798 2723 1.2 -2.7 will 2702 2402 17.3 -11.1
can 1997 2041 0.4 +2.2 can 2193 2160 0.2 -1.5
could 1740 1782 2.4 +2.4 could 1776 1655 4.1 -6.8
may 1333 1101 22.8 -17.4 may 1298 878 81.1 -32.4
should 1301 1147 10.1 -11.8 should 910 787 8.8 -13.5
must 1147 814 57.7 -29.0 must 1018 668 72.8 -34.4
might 777 660 9.9 -15.1 might 635 635 0.7 -4.5
shall 355 200 44.3 -43.7 shall 267 150 33.1 -43.8
ought 104 58 13.4 -44.2 ought 70 49 3.7 -30.0
need 78 44 9.8 -43.6 need 40 35 0.3 -12.5
Total 14667 13272 73.6 -9.5 Total 13962 12287 68.0 -12.2
In this chapter we will be almost entirely concerned with description in

terms of relative frequency, or relative likelihood, of occurrence.6 Table 4 records
the frequency of each modal auxiliary of the canonical set of modals in each of
the Brown family of corpora. In the absence of other explanations (such as the
corpora being importantly different in other ways than in the dates of their
composition) we can tentatively conclude that these differences reflect different
states of the language: that between 1961 and 1991, the modals declined very
significantly in frequency in written English in both American and British usage.
(The overall percentage losses are 9.5% in BrE and 12.2% in AmE). The fourth
and ninth columns in Table 4 tell us how much the frequencies of the modals
have declined, as a percentage of the 1961 figures. The fifth and tenth columns
provide a second measure of the degree of decline, this time using the log
likelihood ratio (G2) as a measure of significance (Dunning 1993). In these
columns, any score of 3.8% or over is calculated to be significant at the chi-
square level of p <0.05, and any score of 6.6% or over is significant at the level of
p <0.01. The larger the log likelihood ratio, the greater the significance.
The individual modals show a decline varying between can (which actually
increases its frequency in FLOB, and declines only 1.5% in Frown) and shall
(which declines over 40% in both FLOB and Frown). In Table 4, the modals are
listed in order of frequency in LOB, and exactly the same order of frequency,
with the exception of should and must, applies to Brown. It will also be seen that
a roughly similar pattern of falling frequency is observed in both BrE and AmE
corpora. Broadly, the most frequent modals decline least, and the least frequent
modals decline most in percentage terms, the rare modals shall, ought (to) and
need (+ bare infinitive) having become much rarer. Some middle-order modals
(especially must and may) also show very significant falls in frequency.
The most interesting observation from Table 4, however, is that the overall
frequency of modals is highest in LOB and lowest in Frown, with FLOB and
Brown in intermediate positions. Alongside the decline between 1961 and 1991-
2, there is an equally important difference between AmE and BrE, which invites
interpretation as a time lag. It is as if BrE is following rather reluctantly in the
wake of a change in AmE, with something like a generation gap. This is shown
graphically (though not strictly to scale) in Figure 1.
more frequent ---------------------------------------------------------------> less frequent

13,962 12,287
Brown (1961) Frown (1992)
LOB (1961) FLOB (1991)

14,667 13,272
Figure 1: British English following an apparent American English trend
It might be proposed that the apparent decline in modal usage is due to the
rise, in recent centuries, of the so-called semi-modals, such as be going to and
have to, which are presumed to be still increasingly used. Perhaps these are
gradually encroaching on the territory of the canonical modals. Such a hypothesis
can be tested, up to a point, by noting the differences of frequency of semi-
modals in the four corpora, as shown in Table 5. Although the class of semi-
modals is not a well-defined set, those in Table 5 may be taken as fairly
representative.
Ostensibly, there is no strong connection between the patterns shown by the
modals and the semi-modals.7 Altogether, the semi-modals are very much less
frequent (in written English) than the modals, and their changes in frequency
show a mixed picture. Some of them seem to have increased their usage
massively in the period 1961-1991/2, but others have declined. One of the
differences at first glance lending credence to the encroachment hypothesis is that
AmE shows a greater increase in the semi-modals (+18.6%) in comparison with
BrE (+10.0%) a mirror image of what is happening with the modals.
Unexpectedly, however, the overall frequency of semi-modals is found to be
greater in the BrE than in the AmE corpora in both periods!
68 Geoffrey Leech
Table 5. Frequencies of some semi-modals in the four written corpora

BrE LOB FLOB Log Diff AmE Brown Frown Log Diff
likhd (%) likhd (%)
BE g oing 248 245 0.0 -1.2 BE going 219 332 23.5 +51.6
to* to*
BE to 454 376 7.6 -17.2 BE to 349 209 35.3 -40.1
(had) 50 37 2.0 -26.0 (had) 41 34 0.7 -17.1
better better
(HAVE) 41 27 2.9 -34.1 (HAVE) 45 52 0.5 +15.6
got to* got to*
HAVE to 757 825 2.7 +9.0 HAVE to 627 643 0.1 +1.1
NEED to 54 198 83.0 +249.1 NEED to 69 154 33.3 +123.2
BE sup- 22 47 9.2 +113.6 BE sup- 48 51 0.1 +6.3
posed to posed to
used to 86 97 0.6 +12.8 used to 51 74 4.3 +45.1
WANT to* 357 423 5.4 +18.5 WANT to* 323 552 60.9 +5.2
TOTAL 2069 2275 9.2 +10.0 TOTAL 1772 2101 28.4 +18.6
*Forms spelt gonna, gotta and wanna are counted under be going to, have got to, and want to
respectively
Table 6. Comparison of SEU-mini-sp and ICE-GB-mini-sp: modals in spoken

BrE (provisional figures)
SEU-mini-sp ICE-GB-mini-sp Log likhd Difference
(%)
would 415 (5188) 271 (3388) 30.5 -34.7
will 248 (3100) 307 (3838) 6.3 +23.8
can 252 (3150) 295 (3688) 3.4 +17.1
could 145 (1813) 83 (1038) 17.1 -42.8
may 86 (1075) 36 (450) 17.5 -54.1
should 100 (1250) 84 (1050) 1.6 -17.3
must 87 (1088) 35 (438) 24.3 -60.7
might 56 (700) 50 (625) 0.3 -10.7
shall 26 (325) 17 (213) 1.9 -34.6
ought 20 (250) 9 (113) 4.3 -55.0
need 0 (0) 0 (0) 0.0 0.0
Total 1435 (17938) 1187 (14838) 23.5 -17.3
Note: The figures in parenthesis show frequency per million words, and are therefore comparable to
the figures for the written corpora given in Table 4.
At this point, it is an attractive idea to look at the patterns of change

observable in the spoken mini-corpora, small as they are. Surely the innovatively
increasing use of semi-modals, and perhaps the corresponding fall in modals, are
likely to show up far more in the spoken language than in the written. The
differences in frequency (in spoken BrE only) between SEU-mini and ICE-GB-
mini are shown in Table 6.
In general, the patterns of frequency shown in Table 6 suggest that trends in
spoken English are similar to those in written English, but somewhat more
exaggerated. The modals are more frequent in the written than in the spoken
corpora, for both periods, but the decline in frequency is also greater a loss of
17.3%. May and must are particular heavy losers, whereas will and can, in
contrast, show a surprising increase from the 1961 to the 1991 corpus. This
picture may be contrasted with the apparent considerable increase in semi-modals
in spoken English between the early sixties and the early nineties, as observed in
the spoken corpora, and as shown in Table 7.
Table 7. Comparison of SEU-mini and ICE-GB-mini: some semi-modals in

spoken BrE
SEU-mini-sp ICE-GB-mini-sp Log Difference (%)
likhd
(BE) going to 88 120 4.9 +36.4
BE to 5 10 1.7 +100.0
(HAVE) got to 35 26 1.3 -25.7
HAVE to 79 104 3.4 +31.6
NEED to 2 15 11.3 +650.0
BE supposed to 8 12 0.8 +50.0
Total 217 287 9.8 +32.3
These numbers, of course, are ridiculously small only three can be

counted as significant in log likelihood terms. However, overall they suggest, as
many would suspect, that the general increase of semi-modals is even greater in
spoken than in written English.
2.3 Descriptive conclusions and further discussion on modals and semi-

modals
The following overall findings can be presented by way of summary of the

preceding section on modals and semi-modals. On the basis of the evidence from
the corpora:
(i) In general terms, there is clearly an appreciable decline of frequency in the

use of modal auxiliaries between 1961 and 1991-2.
70 Geoffrey Leech
(ii) During this period, individual modals have been declining at different rates,
but there is a tendency for very common modals to hold their own (e.g. will,
can), and for infrequent modals (e.g. shall, ought to, need) to decline
sharply and to appear almost moribund. Some middle-ranking modals (e.g.
may and must) have also declined sharply.
(iii) Alongside the decline of modals, there is no clear overall picture regarding
semi-modals: although in general, semi-modal usage is increasing, some
semi-modals are declining, and semi-modals as a whole are much less
frequent than true modals.
If we ignore the italicised phrase above (On the basis of the evidence from the
corpora) these statements are descriptive: they claim to tell us something that is
true about the language, English. But rather than accept them uncritically, we
have to bear in mind some hazardous assumptions which can be made in moving
from data description to language description:
Hazardous Assumptions: from Data Description to Language Description
1. That the corpora are large enough and varied/balanced enough to allow us to
extrapolate from corpus findings to what is happening in (relevant varieties of) the
language in general.
2. That the corpora are sufficiently comparable in terms of samples of the varieties
represented, and in using the same sampling methods.
3. That statistically significant results can be attributed to real linguistic differences,
rather than to extraneous factors such as cultural shifts or faulty sampling.
4. That the grammatical categories are defined and used in a way that other
grammarians or linguists find reasonable.
5. That the extraction of data from the corpora has been acceptably (if not totally) free
from error.
The first of these assumptions the well-known issue of representativeness is

perhaps the biggest hazard. In the lack of any practical, general measure of
representativeness,9 the statements (i)-(iii) must be regarded as hypotheses, well
evidenced, it is true, but needing to be supported by further corpus studies as and
when opportunities arise. The second assumption underlies the whole enterprise
of comparing the Brown family of corpora. The third raises the thorny question of
how to relate statistical significance to certain causative factors. For example, we
might attempt to explain changes in the direction of spoken style as part of a
general socially-driven trend of colloquialization (see sections 2.4 and 3) when it
is possible that these changes can be more directly explained by an increase in the
amount of quoted speech included in the 1991-2 corpora (see below). The fourth
hazardous assumption reminds us that linguistic categories even consensual
ones like modal auxiliary verb, are not Gods truth but capable of being
challenged. The fifth has already been discussed in Note 4.
In my view, none of these hazards justifies a response of extreme scepticism

which says if one cannot prove the truth of these descriptions, one should not
make them at all. Rather, they lead to the recognition that such results should be
regarded as provisional and that there is a need to seek further corroborating
evidence as well as means of increasing accuracy and reliability. This striving
for perfection can be a slow, gradual and time-consuming process, which might
include further manual checking or even collecting and analysing fresh corpora.
It is, though, reassuring to bear in mind that even if an objection is raised to
a hazardous assumption, this often fails to undermine the results in more than a
minor way. For example, the discovery of occasional errors or differences of
categorization in identifying modals is unlikely to cause more than a minor
change in the frequency counts in Table 6, and hence in the statistical significance
of the results. Thus if someone insists that ought to is not a modal but a semi-
modal, this would change the overall findings only marginally. Or to take another
example, on checking the examples of may in Frown, I found two examples of
non-modal may lurking in the database, thus reducing the count of the modal may
from 878 to 876: this makes almost no difference to the significance of the
decline, and in fact increases it.
Returning to the colloquialization trend mentioned above (and to be taken
up again in 3 below), the claim that this phenomenon is an illusion because of an
increase in quoted speech in the later corpora can be checked by actually
undertaking a measurement of quoted material in LOB and FLOB. This has been
done by Nick Smith for LOB and FLOB, and shows that there is an increase of c.
9.5% in the incidence of quoted material in FLOB as compared with LOB.10
However, this could account only in part for most of the changes that might be
attributed to colloquialization (see Table 8 below), so there still remains
something linguistically interesting to be explained here.
To counterbalance the hazardous assumptions, observations such as the
following can have a compensating effect in increasing the plausibility if not
authority of the results, and suggesting that they are not just a matter of chance or
accident:
(a) Many results are highly significant as measured by log likelihood ratio.
(b) Trends are consistent across different items e.g. the general frequency
decline of the modals is replicated in almost every single modal auxiliary.
(c) Trends are often consistent across different subcorpora e.g. if we
subdivide each of the Brown family into genre categories Press (A-C),
General Prose (D-H), Learned (J), and Fiction (K-R), often similar trends
are observed in all these four subcorpora. An instance of this is the decline
of the passive from LOB to FLOB (see Table 8). The passive is less
frequent in FLOB as a whole by 12.4%, a trend repeated in a similar way
for each subcorpus: Press 12.5%; Gen Prose 12.4%; Learned 16.6%;
Fiction 3.6%).
72 Geoffrey Leech
I find it useful to use an analogy of scaffolding in the confirmation and extension

of descriptive findings. If we think of the corpus-based methodology as the
constructing of a building by erection of scaffolding, the superstructure of
description of a language can be supported in three ways:
(i) Data observation: from below, struts or buttresses can be used to

strengthen the grounding of data description (e.g. seeking confirmation from new
data).
(ii) Description: at the same descriptive level, findings can be extended and
deepened. For example, we can probe into the crude frequency changes of modals
in Table 4 by analysing subcorpora as already noted in (c) above, or by
undertaking a semantic analysis of examples. This we did for may, must and
should, and noted a trend in may and should towards monosemy viz. the
dominant senses of may (epistemic) and should (deontic) increased their
dominance in spite of loss of frequency (see Leech, 2003). Must, on the other
hand, showed a decline of both its major senses, the epistemic and deontic
meanings. Such further descriptive investigations help to pinpoint what is
happening more precisely, in terms of how and where the modals are becoming
less used.
(iii) Theory: pointing up to the theoretical level, further descriptive
investigations, for example by taking contextual factors into account, can help to
identify appropriate theoretical explanations as to why the modals are declining.
This is where broad explanatory concepts such as colloquialization come into
play, and help to direct investigation into particular channels.
2.4 Continuing the case study: grammatical changes relating to

colloquialization
Taking further the descriptive study of the LOB and FLOB corpora, we now turn
to a wider-ranging set of grammatical categories, mostly belonging either to the
verb phrase or to the noun phrase. What brings all these categories together is that
they can all be associated with a trend towards colloquialization, that is a
tendency for the written language gradually to acquire norms and characteristics
associated with the spoken conversational language. Quantitatively,
colloquialization can be shown in two ways: (a) by an increasing frequency of
phenomena associated with spoken language, and (b) by a decreasing frequency
of phenomena associated with the written language. Type (a) changes
predominate in Table 8 below, but Type (b) changes are also seen, in the
decreasing frequency of the passive, of the of-construction, and of the relative
pied-piping construction.
Table 8. Changes apparently indicative of colloquialization (tokens per million

words)
LOB FLOB Log lkhd Difference (%)
Categories within the verb phrase
a. Present progressive (active) 980 1263 36.0 +28.9
b. Progressive passive 198 260 8.4 +31.3
c. Verb contractions (e.g. its) 3126 3867 79.1 +23.7
d. Negative contractions (-nt) 1940 2462 62.6 +26.9
e. Passive forms (all) 13260 11614 109.8 -12.4
Miscellaneous colloquialization
features outside the verb phrase
f. Questions (all) 2572 2816 11.1 +9.5
g. Verbless questions 310 424 17.7 +36.6
h. Tag questions 63 65 0.1 +4.5
j. Genitives 4935 6122 128.5 +24.1
k. Of-phrases 33715 32139 37.9 -4.7
l. Of-phrases competing with the 124 95 3.9 -23.6
genitive (2% sample only)
Relative clauses
m. Wh-relative pronouns 6971 6376 26.7 -8.5
n. Zero relative with stranding 18 73 36.4 +310.0
(sample)
p. Pied-piping relatives 1394 1158 21.9 -16.9
Of the categories within the verb phrase, the first four (a.-d.) all show very
convincing increases between LOB and FLOB. Previous corpus studies (e.g.
Biber et al. 1999: 461-463) have shown the progressive to be more common in
conversation than in written genres, and this is a justification for treating
colloquialization as a possible explanation for a. and b. (However, the growing
use of the progressive aspect can also be linked with grammaticalization, going
back over 500 years.) The passive (e.), on the other hand, is strongly associated
with the written medium (see for example Biber et al. 1999: 476-477), and so its
decline in frequency can count as a negative manifestation of colloquialization.
The next set of categories in Table 8 (f.-h.) is more mixed. In fact f. and h.
(questions) should arguably be excluded from the list of colloquialization
phenomena, as the increase of quoted speech in FLOB compared with LOB (see
Note 9) provides a readier explanation for the increasing occurrence of questions
(+9.5%) and tag questions (+4.5%).
We have begun to investigate two further colloquialization themes in the
noun phrase (see j.-p. in Table 8): the s-genitive vs. the of-phrase; and zero or
that-relative clauses vs. wh- relative clauses. Results so far point in the direction
of (a) a rise in the genitive with a corresponding decline in of-phrases; and (b) a
rise in zero relative clauses ending with a stranded preposition and a
corresponding decline in wh- relative clauses. The rise in stranding accords with
an unsurprising and significant fall in the use of pied-piping constructions in
which the wh-relative pronoun is preceded by a preposition (in which, of whom,
etc.).
74 Geoffrey Leech
Summary of descriptive conclusions relating to colloquialization

(a) The use of the present progressive construction has increased overall by c.
30% between LOB and FLOB. This seems part and parcel of the spread
of the progressive aspect usage over the past 500 years.
(b) In practice, this increase has been chiefly in the present progressive the
past progressive has actually shown a slight decline.
(c) As part of a general colloquialization trend, the use of negative and verb
contractions has increased by approximately a quarter (25%). Part of this,
though, can be attributed to the increase in the proportion of quoted
speech in the written corpora.
(d) Conversely there has been an appreciable decline in the use of the passive
a verbal category strongly associated with formal written language.
(e) The written corpora show an increase in 9.5% in the use of questions.
(f) This actually increases to approximately 36.6% if we confine our
attention to questions which lack a finite verb this fragmentary
interrogative type is particularly strongly associated with conversational
English (see Biber et al. 1999: 211-212). Tag questions, on the other
hand, have not increased much. Perhaps this is because they are
essentially dialogic in a way that other questions are not. (In Biber et al.
ibid, tag questions are shown to be of particularly low frequency in the
written language.)
(g) In the noun phrase, historically, the competition between s genitives and
of-constructions has been interpreted as a competition between more and
less oral styles of expression.11 Genitives have increased by about 25%
from LOB to FLOB, whereas of-phrases have declined by about 5%.
However, if we confine our attention to of-phrases which could be
replaced semantically by genitives, the decline of the of-construction
(based on a 2% sample) goes up to 24%. This intriguing provisional
result, which almost exactly balances the gain in the genitive, needs
further corroboration with a larger sample.
(h) There is a general tendency for wh-relative clauses to decline. This
applies not only to whom but also to who, whose, and which. The decline
is not unexpectedly magnified if we confine our attention to pied-piping
relatives (beginning with a preposition e.g. of which, to whom).
(j) Conversely, there appears to have been an increase in the use of zero
relatives, i.e. relative clauses with a zero relativizer (the book I read)
especially when combined with a stranded final preposition (someone I
spoke to). This is a provisional finding based on a small sample, and again
needs further research.
As a conclusion to the descriptive sections of this chapter, I reiterate two caveats

already mentioned. First, the results presented are provisional (particularly those
based on a small sample, such as (j) above) since the research presented here is
still work in progress. (In fact I have gone so far as to suggest that it is in the
nature of corpus research to be provisional.) Second, the hazardous assumptions
listed in section 2.3 have to be kept in mind throughout, and opportunities found
to probe them further. I have yielded above to the temptation to talk in terms of
the language change between LOB and FLOB: a kind of dynamic metaphor used
to explain what are actually sets of synchronic observations about a 1961 corpus
and a 1991 corpus. But the claims that these observations represent changes in the
(use of the) language ultimately remain hypotheses, in need of further probing
and confirmation.
3. Back to theory: conclusions
There is a great deal more to be done in terms of short term diachronic

investigation of the Brown family of corpora. Once the gross frequency changes
have been plotted, the next step is to investigate factors internal to the corpora
that might help to explain these changes (e.g. differential results in different
subsections of the corpus). Much more research also needs to be done and some
is being done on the changing frequency of semantic categories such as
epistemic modals and pragmatic uses of the progressive. We are also making
further comparisons between the British corpora and their American counterparts
Brown and Frown. And of course, there is room for much more work on spoken
language the spoken mini-corpora used for this study are likely to reflect more
fascinating indications of language change, but are obviously of inadequate size.
Explaining the changes in a deeper sense means finding historical reasons
investigating both language-internal and language-external (especially socially
motivated) explanations of why these changes of frequency are taking place. In
part the changes noted e.g. in the increase of semi-modal use may be related
to well-known grammaticallization processes:
Grammaticalization the process whereby lexical items and constructions

come in certain linguistic contexts to serve grammatical functions, and,
once grammaticalized, continue to develop new grammatical functions.
(Hopper and Traugott 1993: xv)
This is a linguistically-oriented explanation, invoking a whole theory of language

change, applicable particularly to the growth of the semi-modals and the
progressive aspect. But frequency studies such as the present one are less
concerned with linguistic innovation than with diffusion and attenuation of
aspects of language use, and invite social explanations in terms of such trends as:
Colloquialization a tendency for features of the conversational spoken

language to infiltrate and spread in the written language.
Democratization speakers and writers tendency to avoid unequal and face-
threatening modes of interaction (this may account in part for the decline of
76 Geoffrey Leech
deontic must and the rise of deontic should, have to and need to). For this
kind of explanation in the realm of modality, see Myhill (1995).
Americanization the influence of north American habits of expression and
behaviour on the UK (and other nations). This shows up apparently in the
loss of frequency of the modals, as depicted in Figure 1.12
However, these izations manifest themselves patchily. For example, in contrast

to the Americanization effect noted with the decline of modals, the growth of the
present progressive shows very little difference between AmE (in the Brown and
Frown corpora) and BrE, as demonstrated in Table 9.
Table 9. Comparison of increase of present progressive in LOB-FLOB and

Brown-Frown (active only)
1961 corpora 1991-2 corpora Log likelihood Difference
British 980 (LOB) 1263 (FLOB) 36.0 +28.9%
American 996 (Brown) 1316 (Frown) 43.6 +31.8%
So Americanization can be only tentatively invoked here, although it might be

applied to other changes touched on earlier, such as the decline of the relative
pronoun which. Another example of patchiness is the virtual stasis of the get-
passive in LOB and FLOB (101 instances in LOB; 104 in FLOB): this obviously
colloquial construction does not seem to follow the pattern observed elsewhere.
One explanation for the selectivity of these ization trends is that the trends
can be in conflict with one another. What happens, for example, to a formal
(uncolloquial) construction characteristic of AmE? Does it increase in BrE
because of American influence, or does it decline in BrE because of its negative
association with colloquialization? An apparent example of this kind of conflict is
the mandative subjunctive as in:
the Secretary of Labor requires that he be willing to risk his reputation

(Example from the Brown Corpus)
a construction which (in a study by Serpollet 2001) increases from 14 in LOB
to 33 in FLOB, while in AmE it is far more frequent, though declining: 91 in
Brown and 78 in Frown. What seems to happen here is that in BrE, the
Americanism of the construction outweighs its non-colloquialism. But different
kinds of explanations might be applicable to other cases.
As we move from the level of description to that of explanation, it is
appropriate to ask what kind or kinds of theory would be best able to explain the
descriptive findings of corpus linguistics. Terms like colloquialization do
represent some rather general attempt to explain change, but they do not amount
to well-developed theories. As for grammaticalization, Croft (2000), like Krug
(2000) is one of those who see grammaticalization taking place within a usage-
based, communication-based, utterance-oriented theory of language change. Croft
emphasises the important diachronic collaboration between innovation or
actuation the creation of novel forms of language and propagation or diffusion

the way the use of these forms expands into more general language use. The
converse mechanisms of change contraction and loss also need to be given
fuller consideration: we need a theory to explain the decline of the modals as well
as the growth of the semi-modals.
In diachronic corpus comparisons we can observe the results of propagation
and contraction. (It is unlikely that we will find true grammatical innovation or
that we would recognize it as such in a corpus even if we came across it.) This
means that we need explanations which take full account of socio-cultural factors
inducing language change. Croft argues (2000: 166) that the basic mechanism for
propagation is the speakers self-identification with a social group, and he cites in
this connection a maxim put forward by Keller (1990/1994), Talk like others
talk. Here, the social-psychological theory of accommodation as a linguistic
process comes into play.
This seems to place propagation of change firmly in the sphere of
sociolinguistics, but it might be pointed out that the Brown family of corpora are
not sociolinguistically sensitive in the normal sense: by definition, they contain
published, i.e. public, language. So where does this leave the explanation of
increase and decrease of frequency in the LOB and FLOB corpora? It is
reasonable to suggest that the spread or shrinkage of linguistic usage in recent
modern society has been influenced considerably by language use in the public
media. So it can be helpful to complement the sociolinguistic perspective by
perspectives oriented towards mass communication.
Table 10. Some principles of usage-based models of language (after Barlow and
Kemmer 2000)
1. The intimate relation between linguistic structures and instances of the
use of language.
2. The importance of frequency.
3. Comprehension and production are integral, rather than peripheral to the
language system.
4. Focus on the role of learning and experience in language acquisition.
5. Importance of usage data in theory construction and description.
6. The intimate relation between usage, synchronic variation, and diachronic
change.
7. The interconnectedness of the linguistic system with non-linguistic
cognitive systems.
8. The crucial role of context in the operation of the linguistic system.
For example, with reference to colloquialization, Fairclough (1992)

discusses the apparent democratization of discourse in present-day English-
speaking society. Conversational discourse, he goes on, has been, and is being,
projected from its primary domain into the public sphere (p.98). Social theories
focusing on public discourse, like Faircloughs, here provide a valuable
78 Geoffrey Leech
supplement to the more established frameworks of historical linguistics and

sociolinguistics. But there would be much benefit in investing in the support such
theories may gain from the empirical findings of corpus research.
To conclude, I return to the opening theme of metatheory. Although I have
not gone far towards suggesting theoretical solutions, I have worked my way
around to suggesting the kind of theoretical approach that is better suited to
corpus linguistics than is the Chomskyan paradigm. Corpus linguistics finds a
good ally in the usage-based frameworks championed by Barlow and Kemmer
(2000: viii-xxii), who, among other principles of this approach, list those in Table
10.
The usage-based conception of linguistics is not a monolithic theory, or a
single school of thought, but is more like a confederation of linguists with similar
goals, priorities and methods. Their tenets are the opposite of the generative
paradigm in nearly every respect. Corpus linguistics finds a natural place in this
body of linguists who believe that there is not a gulf, but on the contrary a natural
bridge, between the study of naturally-occurring data and the cognitive and social
workings of language.
Notes
1. In this chapter, we refers to Nicholas Smith and myself. I am grateful to Nick

for much of the corpus processing, quantitative and analytic work that resulted
in the findings reported here, as well as for discussion of broader issues and
specific comments on this chapter. The project on Recent Grammatical
Change in British English was supported by a research grant from the Arts
and Humanities Research Board (UK) and a British Academy Larger
Research Grant. In this research, we have benefited from collaboration with
Christian Mair and Marianne Hundt at Freiburg University, to whom we owe
support and inspiration, as well as the more practical benefit of the post-
editing of most of the automatically-tagged FLOB corpus.
2. The colloquialization tendency for written style to drift towards more oral
styles over time for some genres between and 17th and the 20th centuries is
demonstrated statistically by Biber and Finegan (1989).
3. We are very grateful to Bas Aarts and Gerry Nelson for their help both in
allowing use of these corpora, and extracting the data for the mini-corpora.
4. There has been a growing range of publications on the comparison of the
LOB and FLOB corpora. Particularly relevant to the present study are Hundt
(1997) and Mair (1997).
5. The findings on the modals in this chapter are presented and discussed more
extensively in Leech (forthcoming 2003) and Smith (forthcoming 2003).
Some of the counts in the tables are slightly different from those in these cited
papers, owing to further research and further accuracy checks (see Note 6).
6. A caveat about frequency: most of the frequency figures in this study are very
close approximations rather than guaranteed 100% accurate. Both manual
procedures and automatic procedures can give rise to error, although the
incidence of error is likely to be totally insignificant. The one exception to this
is the margin of error arising from POS tagging (about 2% in the present
context). Although we were able to use the results of manual correction for
the LOB Corpus and most of the FLOB corpus, for the fictional genres (K-R)
of FLOB and for the Frown Corpus we had to rely on automatic tagging only.
A method of approximation was devised on the basis of comparing automatic
tagging and manual tagging outcomes in cases where they were both
available, and hence calculating an error coefficient for each tag. The
procedure is described in the Appendix to Mair et al. (2003).
7. However, the decline of must may have some connection with the increase in
use of have to and need to see Smith (2003). In general, the varied
behaviour of the semi-modals in this corpus confirm the impression that they
comprise a miscellaneous category. In Quirk et al. (1985:136-148), where it is
argued that they form a gradient between auxiliary and full verbs, four
intermediate categories are distinguished: marginal modals, modal idioms,
semi-auxiliaries, and catenative verbs.
8. However, the decline of must may have some connection with the increase in
use of have to and need to see Smith (forthcoming 2003). In general, the
varied behaviour of the semi-modals in this corpus confirm the impression
that they comprise a miscellaneous category. In Quirk et al. (1985:136-148),
where it is argued that they form a gradient between auxiliary and full verbs,
four intermediate categories are distinguished: marginal modals, modal
idioms, semi-auxiliaries, and catenative verbs.
9. On representativeness, Biber (1993) is the classic reference; but Bibers
position has also been criticised (e.g. by Vradi 2001). There is no test that
could be used to ensure that statements about the LOB and FLOB corpora are
representative of the varieties of English of which they are samples, except to
collect independent samples of data of the same text types in effect, to
replicate the LOB and FLOB corpora but with different text samples.
10. Nick Smith has undertaken a count of quoted material in the LOB and FLOB
corpora, helped by a program written by Izumi Tanaka. He found that the
number of words within quotation marks in FLOB was c.127,000, compared
with c.116,000 words in LOB an increase of c. 9.5%. This figure of +9.5%
is a reasonably close approximation, but needs to be followed up by further
checks and edits.
11. Actually genitives are not so frequent in conversation as in some varieties of
written English, especially news writing (see Biber et al. 1999: 302). This can
be largely explained by the fact that nouns are notably infrequent in the
spoken language: a construction which is rich in nouns (a description that
applies both to the genitive construction and the of-construction) is therefore
comparatively rare. However, if we consider the likelihood of choosing a
80 Geoffrey Leech
genitive as contrasted with a semantically equivalent of-phrase, the odds in

favour of the genitive are higher in spoken English than in a range of written
varieties (see Leech et al. 1997).
12. Colloquialization and Americanization are discussed, with reference to the
LOB and FLOB corpora, by Mair (1997, 1998). See also Hundt (1997).
References
Biber, D. (1993), Representativeness in corpus design, Literary and Linguistic

Computing 8: 243-257.
Biber, D. and E. Finegan (1989), Drift and the evolution of English style: a
history of three genres, Language 65.3: 487-517.
Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman
grammar of spoken and written English. London: Longman.
Barlow, M. and S. Kemmer (eds) (2000), Usage-based models of language.
Stanford: CLSI.
Christ, O. (1994), A modular and flexible architecture for an integrated corpus
query system, Proceedings of C O M P L E X 94: 3rd Conference on
Computational Lexicography and Text Research (Budapest, July 7-10,
1994). Budapest, 23-32.
Croft, W. (2000), Explaining language change: an evolutionary approach.
London: Longman.
Chomsky, N. (1964), Current issues in linguistic theory, in: J.A. Fodor and
J.J.Katz, The structure of language. Englewood Cliffs, New Jersey, 50-
118.
Dunning, T. (1993), Accurate methods for the statistics of surprise and
coincidence, Computational Linguistics 19.1: 61-74.
Fairclough, N. (1992), Discourse and social change. Cambridge: Polity Press.
Hopper, P. and E. Traugott (1993), Grammaticalization. Cambridge: Cambridge
University Press.
Hundt, M. (1997), Has BrE been catching up with AmE over the past 30 years?,
in: M. Ljung (ed.), Corpus-based studies in English: Papers from the 17th
International Conference on English Language Research on Computerized
Corpora (ICAME 17). Amsterdam, Rodopi, 135-151.
Keller, R. (1990/1994), On language change: the invisible hand in language.
London: Routledge. (Translation and expansion of Sprachwandel: von der
unsichtbaren Hand in der Sprache. Tbingen: Francke.)
Krug, M. (2000), Emerging English modals: A corpus-based study of
grammaticalization. Berlin & New York: Mouton de Gruyter.
Leech, G. (1968), Some assumptions in the metatheory of linguistics,
Linguistics 39: 87-102.
Leech, G. (2003). Modality on the move: the English modal auxiliaries 1961-
1992, in: R. Facchinetti, M. Krug and F. R. Palmer (eds), Modality in
contemporary English. Berlin & New York: Mouton de Gruyter, 223-240.
Leech, G., B. Francis and X. Xu (1997), The odds in favour of the genitive: a
study of gradience in English, in: K. Yamanaka and T. Ohori, The locus
of meaning: Papers in honor of Yoshihiko Ikegami. Tokyo: Kuroshio, 187-
208.
Mair, C., M. Hundt, G. Leech and N. Smith (2002), Short term diachronic shifts
in part-of-speech frequencies: a comparison of the tagged LOB and FLOB
corpora, International Journal of Corpus Linguistics, 245-264.
Mair, C. (1997), Parallel corpora: a real-time approach to language change in
progress, in: M. Ljung (ed.), Corpus-based studies in English: Papers
from the 17th International Conference on English Language Research on
Computerized Corpora (ICAME 17). Amsterdam: Rodopi, 195-209.
Mair, C. (1998), Corpora and the study of the major varieties of English: issues
and results, in: H. Lindqvist et al. (eds), The major varieties of English.
Vxj: Vxj University Press, 139-157.
Myhill, J. (1995), Change and continuity in the functions of the American
English modals, Linguistics 33: 157-211.
Popper, K. (1972), Objective knowledge (revised edition). Oxford: Oxford
University Press.
Rayson, P., A. Wilson, T. McEnery, A. Hardie and S. Khoja (eds) (2001),
Proceedings of the Corpus Linguistics 2001 Conference. Lancaster
University: UCREL Technical Papers 13.
Serpollet, N. (2001), The mandative subjunctive in British English seems to be
alive and kicking Is this due to the influence of American English?, in:
Rayson et al. (2001), 531-542.
Smith, N. (1997), Improving a tagger, in: R. Garside, G. Leech and A. McEnery
(eds), Corpus annotation: Linguistic information from text corpora.
London: Longman, 137-150.
Smith, N. (2003), Changes in the modals and semi-modals of strong obligation
and epistemic necessity in recent British English, in: R. Facchinetti, M.
Krug and F. R. Palmer (eds), Modality in contemporary English. Berlin &
New York: Mouton de Gruyter, 241-266.
Vradi, T. (2001), The linguistic relevance of corpus linguistics, in: Rayson et
al. (2001), 587-593.
Corpus data in a usage-based cognitive grammar
Joybrato Mukherjee
University of Giessen
Abstract
The present paper is intended to bridge the long-established gap between corpus-
based research into actual language use on the one hand and cognitive models of
the abstract language system (in terms of speakers competence) on the other.
For this purpose, a very useful, non-generative framework is provided by
Langackers usage-based cognitive grammar. In general, the consideration of
corpus data in cognitive grammar leads to an innovative and realistic model of
speakers linguistic knowledge, i.e. a model which is data-oriented and
frequency-based, functionalist and lexicogrammatical in nature. This theoretical
from-corpus-to-cognition approach will be illustrated by discussing corpus data
on the use of the ditransitive verb GIVE and by sketching out how the data may
be included in a truly usage-based model of the lexicogrammar of GIVE.
1. Introduction: cognitive grammar and corpus data
In principle, generative models of language cognition have always been based on

what Langacker (1987, 1999, 2000) has repeatedly called the rule/list fallacy,
that is a clear distinction between a set of syntactic rules on the one hand and a
list of lexical entries on the other. This is particularly true of the recent version of
generative grammar, the Minimalist Program, which is guided by strict economy
conditions (cf. Chomsky 1995). Langacker (2000), on the other hand, suggests a
fundamentally different approach to language cognition:
There is a viable alternative: to include in the grammar both the rules

and instantiating expressions. This option allows any valid generali-
zations to be captured (by means of rules), and while the descriptions
it affords may not be maximally economical, they have to be preferred
on grounds of psychological accuracy to the extent that specific
expressions do in fact become established as well-rehearsed units.
Such units are cognitive entities in their own right whose existence is
not reducible to that of the general patterns they instantiate.
(Langacker 2000: 2)
Such well-rehearsed units, comprising routinised patterns of specific instantiat-

ing expressions, cut across the lexicon-syntax boundary. What is more, they are
established due to the recurrent use of specific lexical items in a given
86 Joybrato Mukherjee
construction or, from a complementary perspective, the frequent use of specific

constructions with a given lexical item. In Figure 1, one of the examples given by
Langacker (1999) is shown. It visualises how combinations of specific construct-
ions, e.g. the basic ditransitive pattern [[V][NP][NP]], and specific ditransitive
verbs such as GIVE and SEND are entrenched as cognitive entities in their own
right. The left-hand circle refers to the constructional network of the construct-
ional schema [[V][NP][NP]], while the right-hand circle depicts the lexical net-
work of the verb SEND. At the intersection of the two circles, the resulting
pattern can be found, i.e. [[send][NP][NP]].
Figure 1. Lexical and constructional networks in cognitive grammar (Langacker

1999: 123)
In Figure 1, the conceptual similarities between Langackers cognitive grammar

and corpus-linguistic approaches are obvious, even though the objects of inquiry,
namely language cognition and language use respectively, are no doubt different.
Specifically, the concept of lexical and constructional networks (representing
lexicogrammatical entities) could be easily mapped onto the notion of lexico-
grammatical pattern as it is described by Hunston and Francis (2000):
The patterns of a word can be defined as all the words and structures
which are regularly associated with the word and which contribute to
its meaning. [...] as a word can have several different patterns, so a
Corpus data in a usage-based cognitive grammar 87
pattern can be seen to be associated with a variety of different words.

This is the opposite side of the coin.
(Hunston and Francis 2000: 37, 43)
In effect, such lexicogrammatical patterns are at the basis of cognitive grammar.1

Another cross-correspondence between cognitive grammar and corpus-
based pattern grammar is related to the fact that Langacker (1987) considers his
model to be usage-based, which is defined as follows:
Substantial importance is given to the actual use of the linguistic

system and a speakers knowledge of this use; the grammar is held
responsible for a speakers knowledge of the full range of linguistic
conventions, regardless of whether these conventions can be sub-
sumed under more general statements. [It is a] nonreductive approach
to linguistic structure that employs fully articulated schematic net-
works and emphasizes the importance of low-level schemas.
(Langacker 1987: 494)
Special emphasis is placed here on the actual use of the linguistic system. In
general, this clearly mirrors the Hallidayan assumption that system and use are
inseparable because language use instantiates the system (cf. Halliday 1991: 31).
More specifically, a model of language cognition should be able to account for
actual usage, so that the model has to be based on actual use in the first place. It is
exactly here that corpus data may play a major role in refining cognitive grammar
and increasing its usage-basedness: corpora are samples of actual use of the
linguistic system; the schematic networks, low-level schemas and linguistic
conventions correspond largely to the lexicogrammatical patterns and routines
that can be identified by drawing on corpus data.
Table 1. Corpus-based insights into actual language use and their implications
for a usage-based cognitive grammar
some typical features of language use implications for a usage-based
as attested in corpora cognitive grammar
linguistic forms differ with regard to knowledge about these frequencies
frequency and distribution and distributions should be part of
the model
language use is to a large extent based the model should account not only
on recurrent patterns of different kinds for linguistic creativity but also for
linguistic routine
quantitative findings can often be ex- these principles/factors are part of
plained by considering functional and speakers linguistic knowledge and
context-dependent principles/factors should be included in the model
lexical and grammatical choices are lexicogrammatical patterns should
interdependent be at the basis of the model
Table 1 summarises four typical and general features of actual language

use as attested in corpora. In the right-hand column of Table 1, the implications of
these corpus-based findings for a truly usage-based cognitive grammar are
indicated. While, in a sense, lexicogrammatical patterns have always been at the
basis of cognitive grammar (cf. Figure 1), it seems to me that the first three
aspects in Table 1 have so far been neglected by proponents of a usage-based
cognitive grammar. In particular, existing models based on cognitive grammar
include neither actual frequencies of linguistic forms nor the principles and
factors that may lead language users to choose from a variety of options a specific
form in a given context. This kind of information, however, can be easily
obtained from corpus data. I would contend that the incorporation of this corpus-
based information in cognitive grammar would certainly increase the usage-based
quality of cognitive models. This theoretical approach will be exemplified in the
following section by delving more closely into the patterns of the ditransitive
verb GIVE in the British component of the International Corpus of English (ICE-
GB, cf. Nelson et al. 2002) and by deriving from the data a genuinely usage-
based cognitive model of the lexicogrammar of GIVE.
2. The relevance of corpus data to a usage-based cognitive grammar: the

case of GIVE
Table 2 provides an overview of the frequency of all GIVE-patterns in ICE-GB.2

In the following, I will be concerned with the eight most frequent patterns only;
they are given in boldface in Table 2. These eight patterns alone account for more
than 91% of all occurrences of GIVE in ICE-GB. In a sense, then, it is these eight
patterns in particular that should be taken into consideration in a model of routin-
ised patterns in language use, because all the other patterns are only sporadically
used. Picking up on Aartss (1991) distinction between performance and
language use, this section is thus intended to abstract away from the entirety of
performance data a model of language use that accounts for frequent lexico-
grammatical routines in using GIVE.
Generally speaking, type I represents the basic ditransitive pattern with
both objects realised as noun phrases. I have little to say about this pattern since it
can be regarded as the default case both quantitatively and structurally. Thus, the
focus here should be on the reasons why language users opt for other patterns
than this default pattern in specific contexts, i.e. on significant principles of
pattern selection (cf. Mukherjee 2001).
For type I b, one specific factor can be easily identified. This type tends to
be used whenever the direct object has already been activated in the preceding
text because it is part of a previous pattern. As shown in (1), this explanation
accounts for some 83% of all cases of type I b. The examples in (2) to (4) nicely
illustrate the fact that, generally speaking, a preceding pattern in the text (e.g.
the... the..., grateful for sth., thank sb. for sth.) predetermines to a large extent the
following GIVE-pattern by providing the initial slot (and element) for the next
pattern.4 In the examples, the preceding pattern is given in italics, and the over-
lapping GIVE-pattern is underlined.
Table 2. Frequency of GIVE-patterns in ICE-GB3
Type Pattern Sum Freq.
I (S) GIVE [Oi:NP] [Od: NP] 404 38.0%

Ia (S) GIVE [Od: NP] [Oi:NP] 1 0.1%
Ib [Od: NP (antecedent)] (rel. pron.) [S] GIVE [Oi:NP] 23 2.2%
Ic [Oi:NP (antecedent)] (rel. pron.) [S] GIVE [Od: NP] 2 0.2%
Id [Od: NP (fronted)] [S] GIVE [Oi:NP] 1 0.1%
Miscellaneous 10 0.9%
IP [S < Oi active] BE given [Od:NP] (by-agent) 84 7.9%
IP b IP with [Od:NP (antecedent)]+ rel. clause/past participle 12 1.1%
II (S) GIVE [Od:NP] [Oi:PP (to...)] 123 11.6%
II a (S) GIVE [Od:NP] [Oi:PP (for...)] 4 0.4%
II b [Od:NP (antecedent)] (rel. pron.) [S] GIVE [Oi:PP (to...)] 7 0.7%
II c (S) GIVE [Oi:PP (to...)] [Od:NP] 2 0.2%
IIP [S < Od active] BE given [Oi:PP (to...)] (by-agent) 23 2.2%
IIP b IIP with [S<Od (antecedent)]+ rel. clause/past participle 17 1.6%
III (S) GIVE [Od:NP] Oi 247 23.2%
III b [Od:NP (antecedent)] (rel. pron.) [S] GIVE 16 1.5%
IIIP [S < Od active] BE given Oi (by-agent) 38 3.6%
IIIP b IIIP with [S<Od (antecedent)]+ rel. clause/past participle 28 2.6%
IV (S) GIVE Oi Od 10 0.9%
Total 1064 100%
(1) I b [Od: NP (antecedent)] (rel. pron.) [S] GIVE [Oi:NP]
part of a previous pattern

(19 of 23 cases = 82.6%)
(2) But it then means that the more things they put on the menu the tinier the
amount they give you <ICE-GB:S1A-018 #24:1:B>
(3) I would anticipate doing one or two units per year and would be grateful
for any financial assistance that the college could give me
<ICE-GB:W1B-022 #152:13>
(4) I must thank you, Simon and your parents officially for the slow cooker
and table cloth you gave us for our wedding <ICE-GB:W1B-004 #12:1>
For the passive type IP, there are many factors that seem to play a role in
the process of pattern selection. The cluster of relevant factors is summarised in
(5). It is not at all surprising that in more than 96% of all instances the by-agent is
left out. An important reason for choosing type IP thus lies in the optionality of
the agent. Additionally, two further factors seem to be responsible for the fact that
the recipient (corresponding to the indirect object in the default active type-I
pattern) is placed in the initial slot, thus serving as the grammatical subject. First,
this pattern tends to be chosen whenever the direct object is significantly heavier
than the initial element and is therefore placed in final position according to the
principle of end-weight (cf. Quirk et al. 1985: 1362). The correlation between
weight and pattern selection is illustrated in examples (6) and (7). This factor
alone accounts for 50% of all 84 cases. Second, in some 10% of all cases it is the
recipient that has already been activated before and is thus taken up as the first
element in the type-IP pattern. This is in line with the principle of end-focus (cf.
Quirk et al. 1985: 1357) according to which there is a general tendency to place
given information before new information. In examples (7) to (9), the previously
activated element which is part of (or provides the initial element for) the GIVE-
pattern at hand is italicised.
(5) IP [S < Oi active] BE given [Od:NP] (by-agent)
activated before/ heavy left out

taken up (42 of 84 (81 of 84
(8 of 84 cases cases cases
= 9.5%) = 50.0%) = 96.4%)
(6) [...] Margaret Thatcher cannot be given all the credit for our record levels
of radioactivity both at sea and on land <ICE-GB:W2B-014 #11>
(7) and rather nastily she had been tied to a chair until she was fourteen by her
blind mother and never actually given any form of uhm sound or language
communication <ICE-GB:S1B-003 #102>
(8) After all Saddam Hussein uh led his people they although they were not
given much choice in the matter in an eight year war [...]
<ICE-GB:S1B-035 #66>
(9) The Italian peoples were bound to fight in Romes wars at their own
charge [...] Some peoples were actually given Roman citizienship [...]
<ICE-GB:W2A-001 #006/8>
In type II again, it is a cluster of factors that can be shown to play a role to

different extents in the process of pattern selection. As shown in Table 2, type II
differs from the basic type I in that the indirect object is realised as a pre-
positional phrase (introduced by to) and placed after the direct object. Heaviness
of the final element is again a relevant factor since it is involved in 39 of 123

cases (= 31.7%). But there is another factor that seems to be even more important
for language users choice of this pattern in given contexts, namely the lexical
item in direct-object position. The lexical items that are frequently used as direct
objects in type II can be grouped into three major types. In nearly 25% of all
cases, it is the pronoun it. The second group contains words which, broadly
speaking, are habitually associated with the preposition to according to the pattern
information in the corpus-based Macmillan English Dictionary (cf. Rundell
2002). This group thus includes nouns such as access, answer and reaction which
have a pattern themselves that could be described in COBUILD manner (cf.
Sinclair 1995) as N to n. This group also includes nouns (such as name) which
are part of larger verb-dependent patterns containing the sequence N to (e.g.
give ones name to sth. and put a name to). Whether it is due to small-scale
patterns of the noun itself or due to large-scale patterns of a verb including the
noun-to sequence, the overall effect is the same: the noun at hand and the pre-
position to tend to co-occur fairly frequently in actual usage. The third group
includes words that are so closely associated with this pattern that the resulting
word-pattern combinations may be regarded as lexically stabilised idioms, e.g.
give birth to sb./sth. and give rise to sb./sth.: here the type-I pattern no longer
provides a genuine alternative. These three groups of lexical items in direct-object
position account for some 75% of all occurrences of this pattern. The two factors
that are responsible for the preference of the type-II pattern over others namely
weight of the indirect object and lexis of the direct object are summarised in
(10). The examples given in (11) to (13) are intended to illustrate the second
factor in particular. In all three examples, the lexical items in direct-object
position that seem to trigger off the selection of the type-II pattern are italicised.
Additionally, the relevant small-scale to-pattern of the noun in direct-object
position is in boxes in examples (12) and (13).
(10) II (S) GIVE [Od:NP] [Oi:PP (to...)]
heavy
(39 of 123 cases
= 31.7%)
frequent lexical items in Od-position (91 of 123 cases = 73.9%):

1. it (30 of 123 cases = 24.4%)
2. words that are associated with the preposition to in general, e.g. access,
aid, answer, attention, comfort, consideration, credence, (ones) name,
reaction, reply, substance (18 of 123 cases = 14.6%)
3. words bound to type II in lexically stabilised idioms, e.g. give birth /
rise / thought / way to sb./sth. (43 of 123 cases = 34.9%)
(11) so we can have an acid and alcohol and give it to the esterase which is a
useful product <ICE-GB:S2A-034 #39>
(12) A clutch of opinion polls gave comfort to both sides in the simmering
civil war yesterday <ICE-GB:W2C-006 #76>
(13) but when you follow that through youve got the means to give rise to a
change in the method of accounting thats adopted in the company
<ICE-GB:S2A-037 #122>
Type IIP is the passive form that can be derived from the type-II pattern.
Note that the systematic correspondence between the two patterns stems from the
fact that in both cases the indirect object is realised as a to-phrase. As shown in
(14), all the kinds of factors that are involved in the choice of the passive pattern
IP are also involved in type IIP: previous activation of the initial element (6 of 23
cases = 26.1%), heaviness of the post-verbal element (8 of 23 cases = 34.8%),
and the frequent omission of the by-agent (22 of 23 cases = 95.6%). In the light of
the 23 cases at hand, we may also assume that two further factors may at times tip
the balance in favour of type IIP: (i) the need to put the indirect object in focus
according to the principle of end-focus; (ii) the use of a lexical item (e.g. thought)
in the passive subject which may be habitually associated with the preposition to.
The cluster of all five factors and their explanatory power in quantitative terms
are summarised in (14). Example (15) illustrates the relevance of the principle of
end-focus (here in order to contrast the two italicised elements at the end of the
two dependent clauses). Example (16) refers to the influence of the lexical item in
subject position on the selection of the type-IIP pattern.
(14) IIP [S < Od active] BE given [Oi:PP (to...)] (by-agent)
activated before/ heavy left out

taken up (8 of 23 (22 of 23
(6 of 23 cases cases cases
= 26.1%) = 34.8%) = 95.6%)
words that are associated with the deliberately placed in

preposition to [> Macmillan English final focus position
Dictionary] (9 of 23 cases = 39.1%) (7 of 23 cases = 30.4%)
(15) At the start of the conflict you said more time should have been given to
sanctions but now youre saying that more time should have been given to
pursue those diplomatic initiatives <ICE-GB:S2B-018 #92-94:2:D>
(16) It is not clear that enough thought has been given to the consequences of
these proposals for the movement of traffic outside the areas immediately
affected <ICE-GB:W1B-027 #41:4>
What all type-III patterns have in common is the fact that the indirect
object is omitted. Note that in many of these cases, the verb GIVE is not parsed
as ditransitive but as monotransitive in ICE-GB. For various reasons, how-
ever, I regard all instances of GIVE as examples of ditransitivity. Without going

into details about this theoretical issue, it is necessary to point out that my
approach to ditransitivity is inherently lexico-semantic (rather than, say, merely
syntactic) in nature. In other words, the underlying assumption is that the verb
GIVE always triggers what Goldberg (1995) calls the ditransitive construction
at a cognitive level. However, as pointed out by Goldberg (1995) herself, not all
argument roles of the process of giving (i.e. the agent, the recipient and the
patient) need to be explicitised at the level of syntactic surface structure. Among
many others, Matthews (1981), Jackson (1990), Newman (1996) and Biber et al.
in the Longman Grammar of Spoken and Written English (1999) show that
specific elements may be left out because, for example, they can be recovered
from the context or can be inferred from world knowledge. In a sense, then,
GIVE should be regarded as a ditransitive verb in all its occurrences from a
cognitive-semantic point of view because it is bound to evoke an event type
which includes three argument roles, even though some implicit argument roles
may not be explicitised.
Type III is the second most frequent pattern of GIVE in ICE-GB. It does
not come as a surprise that corpus data reveal that this pattern is used whenever
the recipient is indeed recoverable from the context or when its specification is
irrelevant in a given context. In fact, this pertains to all 247 cases at hand.
Furthermore, the pattern tends to be chosen whenever specific lexical items are
used in direct-object position. That is to say, the omission of the indirect object
seems to be linked to lexical items which may imply no need for any specification
of the recipient because it is only the mere existence of a recipient that is relevant
but not the particular kind of recipient.5 In (17), those 21 words are listed that are
used at least three times as direct objects in the type-III pattern of GIVE. Note
that these 21 words alone account for roughly 50% of all cases of this pattern.6 As
in the type-II patterns, it thus seems as though specific lexical items may serve as
pointers to the type-III pattern. Some examples are given in (18) to (20).
(17) III (S) GIVE [Od:NP] Oi
contextually recoverable / specification irrelevant

(all 247 cases = 100.0%)
frequent lexical items (3):

account (9), birth (3), command (3), detail (10), effect (3), evidence (20),
example (9), hint (3), impression (10), indication (7), information (5),
instruction (5), it (4), lecture (8), message (3), (sb.s) name (4), notice (3),
reason (3), signal (3), talk (3), way (6) (124 of 247 cases = 50.2%)
(18) So for instance we can give a very nice account of coarticulation [...]
<ICE-GB:S2A-030 #12>
(19) It helps to clarify the poets ambiguous comments beforehand by giving an
actual example of what he means <ICE-GB:W1A-018 #33>
(20) And its that sort of thing that gave the impression which Im sure he was
trying to do <ICE-GB:S1B-038 #103>
From type III, the passive form IIIP can be derived. Again, the optionality
of the by-agent is most important for the process of pattern selection because it is
omitted in 31 out of 38 cases (81.6%). Additionally, specific lexical items in the
subject position (i.e. the subjectivised direct objects of the type-III pattern) tend to
be closely associated with this pattern. That is to say, not only is the type-IIIP
pattern used whenever neither the agent nor the recipient needs to be explicitised
but also when particular words refer to the patient of the action. In (21) those
words are listed that occur at least twice in this pattern in ICE-GB, accounting for
some 45% of all instances. Some of them are exemplified in (22) to (24).7
(21) IIIP [S < Od active] BE given Oi (by-agent)
left out
(31 of 38 cases
= 81.6%)
recurrent lexical items (2):
approval (2), limit (2), information (2), detail (7), time (2), directions (2)
(17 of 38 cases = 44.7%)
(22) Hes called Malachi in the opening verse but no biographical information
is given about him <ICE-GB:S2A-036 #78>
(23) uh directions are given from Ushant uh from the Scillies uh from the
South coast of Ireland down to Cape Ortegal or Finisterre
<ICE-GB:S2B-043 #20>
(24) More specific implementation details are given at the end of the report
<ICE-GB:W1A-005 #5:1>
The last pattern to be mentioned is type IIIP b. This type is similar to

pattern I b in that the patient (i.e. the subjectivised direct object) serves as an
antecedent to which a relative clause or a past participle construction refers back.
As shown in (25), there is again a clear tendency for language users to choose this
pattern with a fronted antecedent whenever this antecedent has already been part
of a preceding pattern in the text at hand. Examples (26) to (28) illustrate this
dependency on the previous pattern (given here in italics: know of sth., consider
sth., trace on to ... sth.) the last element of which provides the starting-point for
the subsequent GIVE-pattern. It should be noted in passing that the by-agent is
not as frequently omitted as in all other passive patterns mentioned so far. In fact,
in more than one third of all cases (10 of 28 cases = 35.7%), the agent is stated
explicitly. Thus, the optionality of the by-agent as such turns out to be less
forceful a factor for this particular passive form.
(25) IIIP b IIIP with [S<Od (antecedent)] + relative clause/past participle
part of a previous pattern with or without by-agent (10 vs.

(16 of 28 cases = 57.1%) 18 cases = 35.7% vs. 64.3%)
(26) and he will also know of the increased uh support given uh in the uh
announcement last week by my right honourable friend the Social Security
Secretary <ICE-GB:S1B-056 #46:1:B>
(27) [...] it also is of relevance when considering the evidence given by Mr Holt
because there is a clear conflict [...] <ICE-GB:S2A-068 #40:1:A>
(28) But what I have simply done is to trace on to a map the directions that are
given which give you some indication [...] <ICE-GB:S2B-043 #19:1:A>
I Ib IP
(S) GIVE [Od: NP (antecedent)] [S < Oi active] BE
[Oi:NP] [Od: NP] (rel. pron.) [S] GIVE [Oi:NP] given [Od:NP] (by-agent)
[Od:NP] part of recipient activat-

(default case) previous pattern agent irrelevant ed before/taken up
=> antecedent => by-agent
[Od:NP] heavy
[Oi:PP (to...)] recipient irrelevant/recover-

II heavy able => Oi III
(S) GIVE [Od:NP] GIVE (S) GIVE
[Oi:PP (to...)] specific lexical specific lexical [Od:NP] Oi
items in [Od:NP]: it; access, items in [Od:NP]: account,
answer...; birth, rise... detail, evidence ...
transferred entity activ-

ated before/taken up [Oi:PP
agent irrelevant (to...)] recipient
=> by-agent heavy recoverable/ (other
IIP irrelevant patterns)
[S < Od active] BE given => Oi
[Oi:PP (to...)] (by-agent)
agent irrelevant [S<Od] part of previous
specific lexical items in [S<Od] => by-agent pattern => antecedent
detail, limit, time...
IIIP IIIP b
[S < Od active] BE IIIP with [S<Od antecedent)]
given Oi (by-agent) + relative clause/past participle
Figure 2. A usage-based cognitive model of the lexicogrammar of GIVE

The actual use of the eight most frequent GIVE-patterns and the relevant
principles of pattern selection as described above provide an empirically sound
basis for a truly usage-based cognitive model of the lexicogrammar of GIVE.
Such a usage-based model on the basis of ICE-GB is visualised in Figure 2.
In two regards, the tentative model suggested in Figure 2 is more
elaborated and more usage-based, as it were, than traditional lexical networks in
cognitive grammar (as, for example, shown in Figure 1). Firstly, the thickness of
the lines between GIVE and its patterns depends on the frequency of GIVE in
each pattern. Figure 2 thus puts into operation what has been suggested, among
others, by Lamb (2002: 91), namely that different [d]egrees of entrenchment
[can be] accounted for by variability in the strengths of connections. Secondly,
at all lines connecting GIVE and its patterns there is information on why a
particular pattern is used in a given context. Such principles of pattern selection
can be identified only by looking at large amounts of natural data in context and
have so far not been taken into consideration in cognitive grammar. More
specifically, traditional network models in cognitive grammar have focused on
what is structurally possible. Corpus data, however, provide information on what
is likely to occur and why. As I have argued elsewhere (cf. Mukherjee 2002),
both aspects are part of speakers linguistic knowledge and should therefore be
covered by a truly usage-based cognitive grammar.
3. Conclusions and prospects for future research
The present paper is informed by the belief that corpus linguistics and cognitive
linguistics are not at all mutually exclusive but can fruitfully complement each
other in developing a genuinely usage-based model of language cognition, i.e. of
speakers knowledge of the underlying language system. A genuinely usage-
based model defies the rigid Chomskyan dichotomy between competence and
performance.8 In fact, such a model is intended to bridge the gap between
system and use and to mirror speakers linguistic knowledge along the lines of
Hymess (1972, 1992) concept of communicative competence, in which the
ability to use linguistic forms and structures idiomatically (e.g. in terms of
frequently co-occurring forms) and appropriately (e.g. in terms of pragmatic
principles) is integral to speakers knowledge of the language. This view is
closely related to the Hallidayan idea that language use and language system are
intricately interwoven, which makes it possible and reasonable to derive from a
corpus-based analysis of actual language use a usage-based model of the
cognitive entrenchment of the language system. In effect, this approach
capitalises on Schmids (2000: 39) From-Corpus-to-Cognition Principle:
Frequency in text instantiates entrenchment in the cognitive system. In
particular, I hope to have shown that lexical network models in cognitive
grammar can be refined in two regards by taking into account corpus data: not
only is it possible to introduce frequency-based information on different strengths
of linkage between lexical items and constructions but also to introduce in the
model context-dependent principles of pattern selection (such as lexico-

grammatical co-selections, pragmatic principles and activation statuses of
discourse entities). Thus, corpus-linguistic methodology obviously opens up new
and promising perspectives in cognitive linguistics.
By including quantitative trends and context-dependent principles of
pattern selection in usage-based models of language cognition, future research in
this field should try to quantify the influence that each of the relevant factors
exerts on the process of pattern selection and to empirically describe the
prototypicality of a specific pattern in a given context (cf. Griess 2001 model of
a multifactorial analysis). In order to establish more reliable quantitative trends
(in terms of, say, lexical co-selections of a given pattern), it will certainly be
useful to analyse larger corpora such as the British National Corpus. From a more
theoretical perspective, future research into the refinement of the usage-based
model as sketched out in the present paper will have to address the question as to
whether the principles of pattern selection should be integrated with each
individual lexicogrammatical pattern of a given verb or, alternatively, whether
they should best be regarded as a separate subcomponent of a usage-based model.
As shown in Figure 1, constructional networks provide, in a sense, mirror images
of lexical networks, which begs the question as to whether it is necessary and
reasonable to posit separate constructional networks in a usage-based model.
While proponents of construction grammar (e.g. Goldberg 1995) place special
emphasis on the constructional nature of language cognition, other researchers
(e.g. Nemoto 1998) call into question the plausibility of the concept of abstract
and entirely delexicalised constructions.
Finally, brief mention should be made of the issue of genre distinctions. In
the present paper, the influence that specific genres may exert on the frequency of
individual GIVE-patterns has been left out of consideration. Future research into
corpus-based cognitive models should certainly delve more closely into the
correlations between specific genres and the frequency of linguistic forms. It
remains to be seen, though, whether genre-specific factors should best be
regarded as full-fledged principles of pattern selection at the centre of a usage-
based model or as additional factors on the periphery of such a model.9
Notes
1. Note that Langacker (1999: 122) himself states that lexicon and grammar
grade into one another so that any specific line of demarcation would be
arbitrary. This description is of course largely reminiscent of the Hallidayan
approach to lexicogrammar as a unified phenomenon, a single level of
wording, of which lexis is the most delicate resolution (Halliday 1991:
31-32).
2. It should be noted that the data in Table 2 are based on a manual analysis of
all occurrences of GIVE and not on the parsing information included in ICE-
GB. The reason why the data were analysed manually is the fact that many
instances of GIVE are not parsed as ditransitive in ICE-GB but, for example,
as monotransitive (especially in the case of type-III patterns) or as complex-
transitive (especially in the case of type-II patterns). In contrast, I regard all
instances of GIVE as examples of ditransitivity on cognitive and semantic
grounds (cf. Goldberg 1995 and Newman 1996). It is for this reason that
phrasal verbs such as GIVE AWAY, GIVE IN and GIVE UP have not been
taken into account, because their semantics tends to be quite different from
GIVE. Note also that not all instances of GIVE can be grouped into any of the
patterns listed in Table 2. However, such miscellaneous cases are rare and
thus of a marginal nature.
3. The pattern formulas are based on the following notational conventions: [...]
obligatory element; [...(...)] obligatory element with a specific form/function;
(...) optional element; Oi/Od clause element which is not part of the
lexicogrammatical pattern at the level of syntactic surface structure (although
the corresponding argument role is taken to be implicitly evoked by GIVE at a
cognitive level).
4. In fact, this is reminiscent of what Hunston and Francis (2000: 211) refer to as
pattern flow: Pattern flow occurs whenever a word that occurs as part of the
pattern of another word has a pattern of its own.
5. Since, from a lexico-semantic point of view, the existence of a recipient is
already inherent in the event type evoked by the ditransitive verb GIVE, there
is no need to explicitise the recipient as an indirect object at the level of
syntactic surface structure in these cases. For example, in phrases such as give
a lecture and give a talk some kind of recipient is always implied (e.g. an
unspecified audience). Accordingly, Newman (1996: 54), in his cognitive
study of GIVE, describes such implicit argument roles as unfilled elaboration
sites.
6. As a matter of fact, many of the lexical items could be complemented by other
items of the same semantic field that also occur in this GIVE-pattern in ICE-
GB, e.g. give a lecture/a talk (+ a paper, a speech, a statement...), give
instructions (+ advice, help, orientation...) and give a message (+ an answer,
an outline, a response, a warning...). The important point here is that the lexis
in direct-object position is semantically restricted.
7. Note that the analysis of type IIIP is based on 38 instances only. One could
easily hypothesise that the list of recurrent lexical items would have been
much more similar to the list given for the type-III pattern if some 250 cases
had been scrutinised. Here, larger corpora are needed.
8. It is for this reason that the term competence is not used in the present paper.
A cognitive model that is based on corpus evidence, as suggested in the
present paper, has not much in common with a generative model of
competence. Thus, it is not very useful to take over and extend or redefine the
term competence, which would automatically lead to terminological
confusion (cf. Taylor 1988). Instead, I prefer to speak of a usage-based model
of speakers linguistic knowledge.
9. Note that many issues that have only been mentioned in passing in this
section, including the implications of the concept of communicative
competence, the issue of constructional networks and the place of genre
distinctions in a usage-based model of speakers linguistic knowledge, will be
discussed in much more detail in a book-length study that is underway (cf.
Mukherjee, forthcoming).
References
Aarts, J. (1991), Intuition-based and observation-based grammars, in: K. Aijmer

and B. Altenberg (eds.) English corpus linguistics: studies in honour of
Jan Svartvik. London: Longman. 44-62.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman
grammar of spoken and written English. Harlow: Pearson Education.
Chomsky, N. (1995), The minimalist program. Cambridge, MA: MIT Press.
Goldberg, A.E. (1995), Constructions: a construction grammar approach to
argument structure. Chicago, IL: The University of Chicago Press.
Gries, S.T. (2001), A multifactorial analysis of syntactic variation: particle
movement revisited, Journal of quantitative linguistics, 8: 33-50.
Halliday, M.A.K. (1991), Corpus studies and probabilistic grammar, in: K.
Aijmer and B. Altenberg (eds.) English corpus linguistics: studies in
honour of Jan Svartvik. London: Longman. 30-43.
Hunston, S. and G. Francis (2000), Pattern grammar: a corpus-driven approach
to the lexical grammar of English. Amsterdam: Benjamins.
Hymes, D.H. (1972), On communicative competence, in: J.B. Pride and J.
Holmes (eds.) Sociolinguistics: selected readings. Harmondsworth:
Penguin. 269-293.
Hymes, D.H. (1992), The concept of communicative competence revisited, in:
M. Ptz (ed.) Thirty years of linguistic evolution: studies in honour of
Ren Dirven on the occasion of his sixtieth birthday. Amsterdam:
Benjamins. 31-57.
Jackson, H. (1990), Grammar and meaning: a semantic approach to English
grammar. London: Longman.
Lamb, S. (2002), Types of evidence for a realistic approach to language, in: R.
Brend, W. Sullivan and A. Lommel (eds.) LACUS forum XXVIII: what
constitutes evidence in linguistics? Houston, TX: LACUS. 89-101.
Langacker, R.W. (1987), Foundations of cognitive grammar, vol. I: theoretical
prerequisites. Stanford, CA: Stanford University Press.
Langacker, R.W. (1999), Grammar and conceptualization. Berlin: Mouton de
Gruyter.
Langacker, R.W. (2000), A dynamic usage-based model, in: M. Barlow and S.
Kemmer (eds.) Usage-based models of language. Stanford, CA: CSLI
Publications. 1-63.
Matthews, P.H. (1981), Syntax. Cambridge: Cambridge University Press.
Mukherjee, J. (2001), Principles of pattern selection: a corpus-based case study,

Journal of English linguistics, 29: 295-314.
Mukherjee, J. (2002), The scope of corpus evidence, in: R. Brend, W. Sullivan
and A. Lommel (eds.) LACUS forum XXVIII: what constitutes evidence in
linguistics? Houston, TX: LACUS. 103-114.
Mukherjee, J. (forthcoming), English ditransitive verbs: aspects of theory,
description and a usage-based model. Amsterdam: Rodopi
Nelson, G., S. Wallis and B. Aarts (2002), Exploring natural language: working
with the British component of the International Corpus of English.
Amsterdam: Benjamins.
Nemoto, N. (1998), On the polysemy of ditransitive save: the role of frame
semantics in construction grammar, English linguistics, 15: 219-242.
Newman, J. (1996), Give: a cognitive linguistic study. Berlin: Mouton de
Gruyter.
Rundell, M. (ed.) (2002), Macmillan English dictionary: school edition for
advanced learners. Hannover: Schroedel.
Schmid, H.-J. (2000), English abstract nouns as conceptual shells: from corpus
to cognition. Berlin: Mouton de Gruyter.
Sinclair, J. (ed.) (1995), Collins COBUILD English dictionary. London: Harper
Collins.
Taylor, D.S. (1988), The meaning and use of the term competence in linguistics
and applied linguistics, Applied linguistics, 9: 148-168.
Putting putting verbs to the test of corpora
Caroline David
Universit de Poitiers
Abstract
Constraints on verb-preposition combinations and of variations in the relation

between object and destination point raise questions such as the following:
How is it that two verbs semantically as close as put and place have
different syntactic constraints in sentences such as:
(a) X puts something into Y
(b) ?? X places something into Y
How is it that spray and load (both labelled as 'putting verbs'), which are
often compared and associated with the groups of Coil-verbs and Pour-verbs, in
fact can be closer to Fill-verbs only because they display a quite similar
behaviour in the nature of the link they maintain between their object and the
destination?
Based mainly on a quantitative study of the British National Corpus and
the LOB, FLOB, Brown and Frown corpora, the work presented in this paper is
an attempt to highlight how the prototypical verb put functions, in order to better
understand the syntactico-semantic mechanisms which underlie the other verbs of
this class. My ultimate purpose is to show that, beyond the classifications already
proposed by various linguists, a new typology of putting verbs can be outlined.
1. Introduction
Beneath homogeneous semantic and cognitive features, the class of Verbs of
Putting, as Levin (1993) and Dixon (1991) label it, displays significant variations
in their syntactic organisation. On the basis of corpus evidence, I will first tackle
the problem from a semantic point of view, showing that the synonymy of put,
set, lay and place is only superficial, and that the constraints they impose on their
prepositions depend on the semantic content of each (verb and preposition).
Indeed, the general semantic value of these three-place predicates (which
represent most of their uses) is the result of a semantic balance between the
preposition and the verb itself. Then I will examine the verb load whose syntactic
behaviour seems to make it closer either to FILL or COIL-verbs according to the
structure it is in. Finally, the analysis of fill will bring out (i) the complex and
intricate relations that tie up the three arguments of the verb (the subject, the
object, and the goal location), that is to say the syntactico-semantic mechanisms
which underlie it, and (ii) the intrinsic semantic properties of each argument that
are required by fill.
102 Caroline David
2. Put as a hyperOnym: a semantic approach

The postulate inherited from Saussure's (1916) structuralist approach according to
which there exists a structural organisation of the lexicon is now generally
accepted. The various semantic features shared or not by each lexeme of the same
semantic field inevitably give rise to a classification in strata, with hierarchical
relations between the different lexemes, and co-relations among lexemes of the
same stratum. Thus, in contrast to Dixon (1991: 99)1 who considers put, set, place
as belonging to the same group of verbs and Levin (1993: 111-122, see
Appendix) who groups put, set and place into the same class entitled PUT-verbs
(sub-class n1), I will make a slight semantic distinction between the three of
them and between them and the verb lay.
2.1 Polysemy: phraseological abundance

Indeed, put seems more polysemous than the other three, which is notably shown
by the large number of idiomatic structures, phrasals and proverbs including put
(see Table 1). A dictionary like the Cobuild Dictionary of Idioms, to mention but
one, lists more than 94 idiomatic expressions such as put on airs, put your foot in
it, put someone on a pedestal, put the cat among the pigeons, put all your eggs in
one basket etc., whereas set only counts 21 (e.g. set the wheels in motion, set out
your stall, set tongues wagging), lay 12 (e.g. lay your cards on the table, lay an
egg) and place as a verb, none (only as a noun: put someone in their place, fall
into place, not a hair out of place etc.).
Table 1. Number of idioms in the Cobuild Dictionary of Idioms
VERBS PUT SET LAY PLACE
Number of idioms 94 21 12 0
2.2 Generalness of meaning

This abundance of phraseological expressions and the fact that put combines with
more prepositions and particles than set, lay and place (see Table 2, taken from
Pauwels 2000: 139) can be an indication of generalness of meaning. In other
words, the less meaningful the verb is, the more it can combine with different
prepositions, and the more important the role of the preposition is in the general
meaning of the prepositional verb phrase, and vice-versa. The semantic value of
such three-place predicates stronly depends on the semantic weight of the
prepositional phrase, on the one hand, and on the semantic content conveyed by
the verb itself, on the other.
Putting putting verbs to the test of corpora 103
Table 2. Verb-preposition combinations2

VERBS PUT (%) SET (%) PLACE (%) LAY (%)
on 560 (26.5) 37 (6.9) 85 (35.1) 60 (19.3)
in 428 (20.4) 47 (8.7) 59 (24.4) 7
down 184 (8.7) 14 45 (14.5)
to 116 (5.4) 32 (5.9) 5 (2) 6
out 77 (3.7) 72 (13.4) 42 (13.5)
up 72 (3.6) 78 (14.5) 5
back 61 4 1 2
(a)round 42 2
away 39 1
forward 38
off 26 20 14 (4.5)
at 24 9 15 (6.2) 8 (2.6)
together 22 1 1
over 22 3 2
under 21 3 6 (2.5)
aside 17 14 2
through 12
behind 11
before 7 2 5 3
against 6 9 1 7
by 6 1
across 5 1 2
ahead 4
past 4
about 4 11 3
above 3 2
between 3 1 3
after 2 1
outside 2
beyond 1
from 1
with 1 1 3
within 1
forth 1 3
beneath 1 2 1
beside 1 2 4
towards 1
along 1 1
next to 1 1
inside 2
below 1
among 1
near 2 1
opposite 1
apart 1
104 Caroline David
2.3 High frequency

In addition, if we look at the quantitative results of a search in the British
National Corpus (100 million words) shown in Table 3, we can see that, with
65,194 examples found, put (including all its forms) is one of the most frequent
verbs in the corpus, perhaps only a little less common than get and take. If we do
the same with the other three verbs, we obtain a total of 33,441 instances for set,
which is almost half the frequency of put, 10,260 instances for place, which is
about one sixth of those of put, and finally, 6,993 instances for lay, which is about
ten times less frequent than put. These proportions are not quite the same as those
in the LOB, FLOB, Brown, Frown corpora (which I will refer to as the Brown
set) which contain 4,38 million words (see Table 4 below), but the high frequency
of put compared to the other three verbs is again emphasised.
Table 3. Number of occurrences in the BNC
VERBS PUT SET PLACE LAY

Total 65 194 33 441 10 260 6 993
Table 4. Number of occurrences in the Brown set of corpora3
VERBS PUT SET PLACE LAY

Total 2 212 1 432 683 790
The high frequency characteristic of put and its generalness of meaning, that is to
say its polysemy compared with set, lay and place, both lead me to the same
conclusion: put, set, lay and place are not synonymous, and therefore cannot be
classified under the same label of PUT-verbs (sub-class n1), because they do
not seem to represent the same process of putting things. Therefore, what has
been obvious for lay since Levin (1993) described it as a Verb of Putting in
Spatial Configuration, should be the same for set and place.
Defining lay as the way things are put, the way the object is displaced,
adds more information to the process of putting, which is not included in the
light meaning of put, if I can put it this way. Similarly, it seems that set and
place behave like lay in the sense that they also describe the way things are
moved, if we rely on the following definitions given by dictionaries such as the
Collins Cobuild English Dictionary, the Oxford English Dictionary and the
Longman Dictionary of Contemporary English (the highlighting in boldface is
mine):
Collins Cobuild English Dictionary
SET 1. If you set something somewhere, you put it there, especially in a careful
or deliberate way. He took the case out of her hand and set it on the
floor.When he set his glass down he spilled a little drink.
PLACE 1. If you place something somewhere, you put it in a particular position,

especially in a careful, firm or deliberate way. Chairs were hastily
placed in rows for the parents.
LAY 1. If you lay something somewhere, you put it there in a careful, gentle, or
neat way. Mothers routinely lay babies on their backs to sleep.
Oxford English Dictionary
SET 1. To put in a definite place (the manner of the action being implied either
in the verb itself or in the context); to put (more or less permanently) in a
definite place.
PLACE 1. To put or set in a particular place, position, or situation; to station; to

posit; fig. to set in some condition, or relation to other things. Often a mere
synonym of put, set.
2. To put or set (a number of things) in the proper relative place, i.e. in
order or position; to arrange, dispose, adjust.
LAY 1. To deposit; to place in a position of rest on the ground or any other

supporting surface; to deposit in some situation specified by means of an
adverb or phrase.
2. To dispose or arrange in proper relative position over a surface.
Longman Dictionary of Contemporary English
SET 1. To carefully put something down somewhere, especially something that is

difficult to carry.
PLACE 1. To put something somewhere, especially with care.

2. To put someone or something in a particular situation.
LAY 1. To put someone or something down carefully into a flat position.
For each verb, we find the notions of deposit, dispose, arrange or place in a
certain, specified or particular position, not to forget the notion of carefulness
which is quite often underlined. The OED even specifies for set that the manner
of the action is implied either in the verb itself or in the context.
Therefore, I will propose a new distribution of the different classes, where
on the one hand put, alone, is considered the prototypical verb of the general
process of putting with little additional information regarding the way things are
displaced and, on the other hand, set, lay, place etc., which are more specific in
meaning than put, are classified together as a kind of manner of putting.
According to Lyons tests (1977: 292), which allow us to establish the position of
a lexeme in a hierarchical lexical field, setting, placing, laying things are all a
kind of putting things, that is to say put is a hyperOnym and the others hyponyms.
This conception of the semantic organisation of the verbs might be represented as
in Figure 1.4
106 Caroline David
hyperOnym
prototype
SET
PUT
PLACE
LAY
hyponyms
Figure 1. New structural organisation of the Putting Verbs
Let us now move on to some other verbs of the large class of Verbs of
Putting, viz. SPRAY/LOAD-verbs.
3. Load and locative alternation: a syntactic approach

The SPRAY /LOAD-class is considered homogeneous by a large number of
linguists (Anderson 1971, Jackendoff and Culicover 1971, Tremblay 1991,
Rivire 1997), in the sense that they accept locative alternation of two different
syntactic structures. Taking LOAD as an example, and starting from an instance in
the LOB corpus (1), we can compare the simplified version in (1a) with the with-
variant in (1b):
(1) Similarly there had been hay bales. Similarly, now there were for us
school trunks. Three times a year I loaded school trunks on to the car
and took them to the station, and three times a year loaded them on the car
and brought them home from the station. (LOB G19 163-164)
(1a) I loaded school trunks on to the car5
(1b) I loaded the car with school trunks
Table 6. Load and locative alternation
Corpora Brown Frown LOB FLOB Total
No. of PP-phrases 6 6 9 5 24
No. of with-phrases 12 7 4 11 34
To judge from the frequencies in the Brown set of corpora (Table 6), load does
not really favour one structure more than the other.6 This syntactic feature
(locative alternation), which can be viewed as the common feature of the whole
class, links them closely to either COIL-verbs or FILL-verbs.
3.1 The PP-structures
The first structure (1a) resembles the syntactic construction of put and can be
glossed as I put school trunks on to the car by loading them. So if load and put
are syntactically and semantically close to each other we have a subject (I) who
triggers the action of moving an object (trunks) to a destination (the car) what
sets them apart is only the manner of putting the trunks on to the car, which ties
up with what I have just said about the manner of movement for set, lay and
place. Thus, in (1a) the object is transferred and located relative only to the
destination, that is pure displacement. What I should add is that the affectedness
of the object tends to take on a holistic interpretation, in other words the default
interpretation is that all the trunks are loaded, irrespective of whether the car is
full or not. As with put, in this structure, we can analyse the process of loading
in two steps (see Figure 2): first and foremost the relation [I - load - trunks] is
established (a strong link between the subject and the object), and secondly the
trunks are located relative to the car, and they are completely loaded. The order of
construction of the two relations in (1a) emphasises one thing: the importance of
the second argument, that is to say the direct object, which is also found in the
construction of put and of course in one of the possible structures of COIL and
POUR-verbs, as illustrated by pour and spill, respectively:7
First step: I loaded school trunks
I trunks
subject object
Second step: the whole quantity of trunks is moved on to the car
I trunks the car

subject object destination
Quantification8
Figure 2. Process of loading: pure displacement (1a)
(2) Spoon the mixture over the potatoes and then pour the cheese sauce over
the top. (FLOB E19 122)
108 Caroline David
(3) Enthusiastically, Thompson and Arbella Lacey spilled the contents of

four hessian sacks onto the kitchen floor. (FLOB M01 219)
3.2 The with-structures

However, the second structure (1b), I loaded the car with school trunks, is quite
remote from that of put, and at the same time far from that of COIL or POUR-
verbs, since we cannot have *I put the car with school trunks (by loading them),
or *John poured the bowl with water. The argument that is now in the position of
object is the car and it is no longer a goal location. It is an affected object and the
quantity is no longer the focus of interest as it is given a holistic interpretation:
the car is entirely loaded. What was a displacement of the object (trunks) in the
first structure is now a change of state of the object (the car). As a result, the
affected object (the car) could be described as being a car full of trunks, as
opposed to any other condition full of any other kind of objects, such as stools,
or desks, for instance. In a sense, this structure has to be linked to the FILL-verbs.
In this case, the focus is put on the direct object (the car) whose properties are
modified by trunks. Accordingly, the process of loading in this second structure
can be divided as shown in Figure 3.
First step: The car is loaded
I the car
subject object
Second step: The car is qualified as being a car full of trunks
I the car school trunks

subject object affected
Qualification9
Figure 3. Process of loading: affected object (1b)
It is interesting to notice that the notions of affectedness and

qualification of the object are reinforced by the overwhelming number of
passive examples of load with-constructions. Actually, 28 of the 34 examples of
load with (82%) in the Brown set of corpora (Table 6) are passive constructions.
This distribution, which contrasts with the active examples of load usually found
in the literature, underlines once again the usefulness of corpora in the study of
the language.
Before moving on to the FILL -verbs, let us examine a trickier and more
unusual example of load, which shows that the with-construction is decisive for
the notion of qualification but not for quantification:
(4) She knew, for Gideon dispatched formal reports whenever he could, that
the ship had made her way safely to Australia and disgorged her
passengers and cargo safely. She had loaded with Mundy wool from the
warehouses in Melbourne and was on her way back home. (FLOB P03
153)
This can be simplified as:
(4') The ship had loaded with wool
If we regard this example as indicating the result of a process of loading which

implies first that (X) had loaded the ship with wool and then that the ship was
loaded with wool, then the goal location, v i z. the container (the ship), is
interpreted holistically. Even if the ship is in a syntactic subject position, it is
indeed also an affected object, in terms of semantic roles, interpreted as being a
ship full of wool. Once again, we have an operation of qualification where a
particular kind of load (wool) modifies the properties of the container (the ship).
Unlike Quirk et al. (1985: 744) who call this structure an intransitive
construction,10 which does not correspond to a transitive construction, my claim
is that this example looks like a kind of middle voice, where the object has been
moved into a syntactic subject position, leaving thus an empty object position and
an obligatory adverbial (or adjunct) required for that type of structure, since it
attributes the characteristics, or qualifications, to the syntactic subject. Note,
finally, that the perfective aspect (had loaded) which indicates a resulting state
can be useful to convey the idea of affectedness.
It may be useful to sum up the three (rather than two) different structures
with load.11 We have a PP-variant (1a), syntactically close to the COIL and POUR-
verbs, and two with-variants (1b) and (4') which can be linked to FILL-verbs:
(5) Norah will pack us up some sandwiches, and I will fill the flasks with tea.
(FLOB P25 146)
(6) Just then, to his delight, the gray fills with snow, as though someone
standing below the window had broken open seedpods and tossed up
fistfuls of white puffs. (Frown N22 17)
4. FILL-verbs
If we now turn to the structure of F ILL-verbs, we find that we can have Norah
filled the flasks with tea, but not *Norah filled the tea into the flasks. The flask,
both locative, container, and the object affected by tea, has changed properties,
110 Caroline David
since it goes from the state of emptiness to the state of fullness. It changes from a
flask (a tea flask) to a flask of tea. The destination (the underlying element in this
pattern) is first any kind of container (any flask) and then it becomes an object
with a specified content (a flask which has the properties of being a flask full of
tea and not a flask of water, or of wine). Therefore, fill (like the FILL -verbs in
general) is more restrictive on the type of object it can take than load: we can
only take into account the [fill-container] relation and then qualify it. Hence, we
have first [fill-flask] and then [tea: process of filling], which is very similar to the
second case of load (1b). This relation [fill-container] is very strong since we can
have a structure with only two nominal arguments (two participants), which is
statistically quite frequent (28% of the transitive uses of fill are two-participant
structures):12
(7) He fetched the Nescafe and camp stove from his provisions, then went to
fill the bottle. (Frown N21 54)
In fact, load gives less information about the type of locative it accepts and about
the way things are moved than fill. The object constructed with fill must satisfy
all the properties of something which can be filled in, whereas almost anything
can be loaded. As FILL-verbs format, give shape to, and put more restrictions
on the container than load, the only possible constructions are Norah filled the
flasks with tea, and the gray fills with snow (as in example 6).
In short, beyond the locative alternation which shows a certain syntactic
consistency across the members of the class, the properties of the object in one
type of structure and of the destination in the other reveal how the verb behaves
in a more significant way. The choice of one structure over another, e.g. the
choice of pour and coil over fill, serves to highlight the motion of the content,
rather than the change in fullness of the container. This idea is notably supported
by Gropen et al. (1991: 161) in their discussion of verbs like pour and fill:
If a verb specifies how something moves in a main event, it must

specify that it moves; hence we predict that for verbs that are choosy
about manners of motion (but not change of state), the moving entity
should be linked to the direct object role. In contrast, if a verb
specifies how something changes state in a main event, it must specify
that it changes state; this predicts that for verbs that are choosy about
the resultant state of changing entity (but not manner of motion), the
changing entity should be linked to the direct object role.
What is called a change of location can be related to what I have described as a

quantification, and what is said to be a change of state is comparable to a
qualification of the resultant state. Consequently, the locative alternation of a
verb such as load reveals three perspectives:
school trunks on to the car = COIL-verbs (Quantification)
LOAD the car with school trunks = FILL-verbs (Qualification)
the ship with Mundy wool = FILL-verbs (Qualification)
5. Conclusion
All the verbs put, set, lay, place, coil, load, and fill need a PP which provides
different kinds of additional specification of the destination from a semantic point
of view; at the same time, a characterisation is given of the nature of the link
between the verb and its different arguments.
If we return to Levin's classification (see Appendix) and run through her
categories, we have first a light semantic verb (put) which imposes few
restrictions on the type of object, on the destination point and thus on the
preposition. This verb can combine with the largest number of prepositions: 39.
Then, verbs such as set, lay and place add more specification and information on
the process of putting but still without heavy constraints on the object and on the
destination as is shown by the various definitions of the dictionaries and the verb-
preposition combinations: 24 different prepositions for place and only 20 for set
and lay. Next, we come to the POUR -verbs and the COIL-verbs which combine
with a restricted set of prepositions reflecting the semantics of the verb and the
focus on the quantity of the object displaced. At the boundary between the COIL-
verbs and the FILL-verbs, we find the SPRAY/LOAD-verbs which, depending on
the type of structure (a PP-construction or a with-construction), are closer to
either of these classes. LOAD, according to our findings in the Brown set of
corpora, does not really favour any of these two constructions: 24 PP-phrases and
34 with-phrases. Moreover, it must be stressed that 82% of the examples of load
with in the Brown set are passive, a distribution that contrasts with the preference
for active examples in reference books. Next, if we return to SPRAY, the figures
are reversed, since we found 13 occurrences with PP-structures and 5 with-
structures, a tendency which once again could not be induced without a corpus
analysis. Finally, the FILL-verbs specify and severely limit the goal location (the
container) by qualifying it with its object. Let us recall that 28% of the transitive
uses of fill in the Brown set do not have any goal location in their structures.
Large electronic corpora of the kind used here provide a fruitful resource
for new approaches to the study of language, such as the use of data for a more
qualitative analysis and for testing hypotheses about syntactic structures. Corpora
describe and reflect a type of language reality which does not always correspond
to the picture presented in traditional descriptions.
112 Caroline David
Notes
1. Dixon (1991: 99) even goes further since he gathers put, set, place, fill and
load in the same class labelled rest verbsput subtype, which refers to
causing something to be at rest at a Locus.
2. Note that deverbal uses have been included in this count.
3. Excluding deverbal cases.
4. In this study, I use some of the concepts in Culioli's Theory of Enunciative
Operations (Culioli 1990), in particular quantification (QNT) and
qualification (QLT) (see below). However, it should be noted (J. Chuquet,
private communication) that the concept of notional domain that one might be
tempted to associate with the put/set/lay/place system cannot be relevant here,
as it bears little or no relation to a prototype theory.
5. The use of the definite or indefinite article with the location and the object
transferred has been discussed by Laffut (1997, 1998).
6. Conversely, spray is more used in a PP-structure (13 examples) than in a with-
structure (5 examples) in the Brown set of corpora.
7. For more details on how COIL-verbs function, see Beatty (1979).
8. Quantification refers to the existence of an entity in Culioli's theory.
9. In rough approximation qualification refers to the properties of an entity.
(for a full treatment, see Culioli 1999).
10. They give the following examples: Her books translate well. The sentence
reads clearly. My shirts have dried very quickly. The sheets washed easily. My
teapot pours without spilling.
11. By contrast, it seems that spray has only got two structures (1a) and (1b).
12. 163 examples out of 583.
References
Anderson, S.R. (1971), On the role of deep structure in semantic interpretation,

Foundations of Language 6: 387-396.
Beatty, J. (1979), An analysis of some verbs of motion in English, in: J. Fisiak
(ed) Studia Anglica Posnaniensia 1, Pozna_: 127-142.
Collins Cobuild English Dictionary (1995), London: Harper Collins.
Cobuild Dictionary of Idioms. (1995), University of Birmingham: HarperCollins.
Culioli, A. (1990), The concept of notional domain, in: Pour une linguistique de
l'nonciation, Tome 1, Gap: Ophrys: 67-81.
Culioli, A. (1999), Pour une linguistique de l'nonciation. Domaine notionnel.
Tome 3, Gap: Ophrys.
Dixon, R.M.W. (1991), A new approach to English grammar. On semantic
principles. Oxford: Clarendon Press.
Gropen, J., S. Pinker, M. Hollander and R. Goldberg (1991), Affectedness and

direct objects: the role of lexical semantics in the acquisition of verb
argument structure, in: B. Levin and S. Pinker (eds) Lexical and
conceptual semantics. Oxford: Blackwell: 153-195.
Jackendoff, R. S. and P. Culicover (1971), A reconsideration of dative
movements, Foundations of Language 6: 397-412.
Laffut, A. (1997), The Spray/Load Alternation: some remarks on a textual and a
constructionist approach, Leuvense Bijdragen 86: 457-87.
Laffut, A. (1998), The locative alternation: a contrastive study of Dutch vs.
English, Languages in Contrast 3: 127-160.
Levin, B. (1993), English verb classes alternations. A preliminary investigation.
Chicago: University of Chicago Press.
Longman Dictionary of Contemporary English (2000), Web Dictionary. Essex:
Pearson Education-Longman.
Lyons, J. (1977), Semantics. Cambridge: Cambridge University Press.
Oxford English Dictionary (1993). CD-ROM, Release version 1.0b: Oxford
University Press.
Pauwels, P. (2000), Put, set, lay and place. A cognitive linguistic approach to
verbal meaning. Lincom studies in theoretical linguistics: Lincom Europa.
Rivire, C. (1997), Une ou plusieurs notions? Les prdicats trois places qui
admettent deux constructions, in: C. Rivire et M.L. Groussier (eds) La
Notion. Paris: Ophrys: 175-184.
Saussure, F. de (1916/1965), Cours de linguistique gnrale. 3rd ed. Paris: Payot.
Tremblay, M. (1991), Alternances d'arguments internes en franais et en
anglais, Revue qubcoise de linguistique 20, Montral: Universit du
Qubec: 39-53.
Electronic Corpora:
British National Corpus Online (BNC) (http://info.ox.ac.uk/bnc/).
Appendix: Levin's classification of Verbs of Putting (1993: 111-122)

114 Caroline David
Appendix: Levin's classification of Verbs of Putting (1993: 111-122)
1. Put Verbs
arrange, immerse, install, lodge, mount, place, position, put, set, situate, sling,
stash, stow
2. Verbs of Putting in Spatial Configuration
dangle, hang, lay, lean, perch, rest, sit, stand, suspend
3. Funnel Verbs
bang, channel, dip, dump, funnel, hammer, ladle, pound, push, rake, ram, scoop,
scrape, shake, shovel, siphon, spoon, squeeze, squish, squash, sweep, tuck, wad,
wedge, wipe, wring
4. Verbs of Putting with a Specified Direction
drop, hoist, lift, lower, raise
5. Pour Verbs
dribble, drip, pour, slop, slosh, spew, spill, spurt
6. Coil Verbs
coil, curt, loop, roll, spin, twirl, twist, whirl, wind
7. Spray/Load Verbs
brush, cram, crowd, cultivate, dab, daub, drape, drizzle, dust, hang, heap, inject,
jam, load, mound, pack, pile, plant, plaster, ?prick, pump, rub, scatter, seed,
settle, sew, shower, slather, smear, smudge, spatter, splash, splatter, spray, spread,
sprinkle, spritz, squirt, stack, stick, stock, strew, string, stuff, swab, ?vest, ?wash,
wrap
8. Fill Verbs
adorn, anoint, bandage, bathe, bestrew, bind, blanket, block, blot, bombard,
carpet, choke, cloak, clog, clutter, coat, contaminate, cover, dam, dapple, deck,
decorate, deluge, dirty, douse, dot, drench, edge, embellish, emblazon, encircle,
encrust, endow, enrich, entangle, face, festoon, fill, fleck, flood, frame, garland,
garnish, imbue, impregnate, infect, inlay, interlace, interlard, interleave,
intersperse, interweave, inundate, lard, lash, line, litter, mask, mottle, ornament,
pad, pave, plate, plug, pollute, replenish, repopulate, riddle, ring, ripple, robe,
saturate, season, shroud, smother, soak, soil, speckle, splotch, spot, staff, stain,
stipple, stop up, stud, suffuse, surround, swaddle, swathe, taint, tile, trim, veil,
vein, wreathe
9. Butter Verbs
asphalt, bait, blanket, blindfold, board, bread, brick, bridle, bronze, butter,
buttonhole, cap, carpet, caulk, chrome, cloak, cork, crown, diaper, drug, feather,
fence, flour, forest, frame, fuel, gag, garland, glove, graffiti, gravel, grease,
groove, halter, harness, heel, ink, label, leash, leaven, lipstick, mantle, mulch,
muzzle, nickel, oil, ornament, panel, paper, parquet, patch, pepper, perfume,
pitch, plank, plaster, poison, polish, pomade, poster, postmark, powder, putty,
robe, roof, rosin, rouge, rut, saddle, salt, salve, sand, seed, sequin, shawl, shingle,
shoe, shutter, silver, slate, slipcover, sod, sole, spice, stain, starch, stopper, stress,
string, stucco, sugar, sulphur, tag, tar, tarmac, tassel, thatch, ticket, tile, turf, veil,
veneer, wallpaper, water, wax, whitewash, wreathe, yoke, zipcode
10. Pocket Verbs
archive, bag, bank, beach, bed, bench, berth, billet, bin, bottle, box, cage, can,
case, cellar, cloister, coop, corral, crate, dock, drydock, file, fork, garage, ground,
hangar, house, jail, jar, jug, kennel, land, lodge, pasture, pen, pillory, pocket, pot,
sheathe, shelter, shelve, shoulder, skewer, snare, spindle, spit, spool, stable,
string, tin, trap, tree, warehouse
Esphoric reference and pseudo-definiteness
Peter Willemse
University of Leuven
Abstract
This paper investigates the referential status and structure of pseudo-definite

NPs occurring in postverbal position in the unmarked type of existential
sentences. These NPs are pseudo-definite in the sense that they are formally
definite but in fact realize presenting rather than presuming reference (Martin
1992); they introduce new entities into the discourse. In many cases, the formal
definiteness of the NP can be explained in terms of an esphoric, i.e. a forward
phoric relationship between elements within the same NP (Du Bois 1980, Martin
1992). The aim of this paper is to propose a classification of the different types of
pseudo-definites in terms of the distinct nominal constructions they realize. This
classification aims at being more exhaustive than the ones already available in
the literature (e.g. Lumsden 1988, Ward and Birner 1995). To this end, a corpus
of unmarked, or cardinal, existentials containing a formally definite postverbal
NP has been analyzed. The classification obtained on the basis of the analysis of
this corpus was checked against an additional corpus extraction containing
attributive clauses in which a formally definite NP occurs as Attribute.
1. Introduction
Esphoric reference is a type of cataphoric reference, i.e. a type of reference that

manifests itself as a forward phoric relationship. It is a phoric relationship
because esphora is associated with presuming rather than presenting reference: it
occurs in NPs whose grammar signals that the identity of the discourse
participant they realize is in some way recoverable. It is a forward relationship
since the identity of the referent of an esphoric NP is retrievable through
information located further on in the discourse. More specifically, Martin (1992:
123) defines esphora as forward reference within the same nominal group.
Whereas Halliday and Hasan (1976) do not distinguish esphora within the broad
category of cataphoric reference, Martin (1992) sets esphora apart as a phoric
relation within one and the same NP, from cataphora, which he defines as a
forward phoric relationship between different NPs. Martin further remarks, with
reference to Du Bois (1980: 224-225), that in general, what follows in the NP
justifies the use of a definite determiner; more specifically, it is usually the
Postmodifier element that gives the information necessary to identify the
participant in question. Halliday and Hasan (1976: 72) make a similar remark in
saying that the definite article the often refers cataphorically to a modifying
element within the same nominal group as itself.
118 Peter Willemse
It seems that the category of esphora, as defined by Martin (1992), covers several
phenomena that are still quite different in nature. It is a broad category within
which further subcategories can be distinguished. Most importantly, real
definite esphoric NPs should be distinguished from what Davidse (1999: 231) has
referred to as pseudo-definite NPs. An example of a truly definite esphoric type
of NP is an NP containing a restrictive relative clause functioning as a
postmodifier, for instance the man who spoke first at the meeting. Davidse (2000:
1112) gives a convincing argument for this building on Langackers (1991)
interpretation of the restrictive relative clause (RRC) as part of the type
specification evoked by the nominal. The RRC makes the type specification more
specific and thereby restricts it. Davidse points out that in NPs which contain a
definite determiner, a reference mass, viz. the set of all instances corresponding
to that particular type in the discourse context, is defined by the type
specification. For an NP containing a definite article and a singular count noun to
be unambiguous, it is necessary that only one contextually relevant instance
corresponds to the specified type (viz. the one referred to). When an NP contains
a RRC, the latter may narrow down the reference mass (by making the type
specification more specific) to only one contextually relevant instance, thus
making the NP truly definite and justifying the use of a definite determiner. In the
example quoted above, for instance, the reference mass defined by the type
specification man presumably contains several instances; the added RRC who
spoke first at the meeting narrows down this reference mass to just one instance
and in this way makes the NP definite.
The focus of the present paper is on the other type of esphoric NPs, viz.
the pseudo-definite ones. Consider the following example:
(1) Tomorrow afternoon, there will be the usual Christmas concert.
In this example, the underlined NP is formally definite in that it contains a

definite determiner (viz. the definite article). However, with regard to its
referential status it is not definite; it introduces a new instance of the type
Christmas concert into the discourse rather than referring to an instance known
to the hearer or a contextually unique instance. The definite article seems to
signal something else than the referential status in this example. Similarly, non-
referential NPs in certain contexts where one would normally expect a formally
indefinite NP sometimes take a definite form. Consider the following example of
an attributive clause in which the Attribute is realized by a formally definite NP:1
(2) He is the son of a slave.
If one imagines that this sentence occurs in a context from which it is clear that
the slave in question has more than one son, the definite form of the attribute NP
is unexpected. The definite article seems to be motivated here by some
relationship within the NP in which it occurs.
Esphoric reference and pseudo-definiteness 119
In order to learn more about these pseudo-definite NPs, I will zoom in on

two grammatical contexts which normally exclude definite NPs, yet in which
formally definite NPs do occur occasionally. I will primarily focus on the
unmarked, or cardinal type of existential constructions. In addition, I will look
at some types of pseudo-definite NPs occurring in attributive clauses, as
illustrated by example (2).
The unmarked type of existential is particularly interesting in view of the
topic of the present paper because a so-called definiteness restriction applies to
its postverbal, or Existent (Halliday 1994) NP. Traditionally, existential
constructions were analyzed as basically attributing a certain location to the
central entities they refer to (see Davidse 1999). However, it has been remarked
(Milsark 1976, 1977, Lumsden 1988, Davidse 1999) that two rather different
types of existentials should be distinguished. On the one hand, there is the
unmarked, or cardinal, type of existential. On the other hand, there is the
enumerative existential, which is the marked type of existential construction.
The semantic differences between the two types of construction are reflected in
the systematic distributional differences which they display. Consider the
following examples:
(3) There were two usherettes in the foyer. [cardinal existential]

(4) Even before his triumph at Seoul, Lewis had toyed with the idea of
returning to England. There was the British title to go for, and also the
Commonwealth, the European. [enumerative existential]
In (3), the existential states how much instantiation of the type usherettes there is
in the contextually specified situation. A cardinal existential indicates the
cardinality of the instantiation of the type expressed by the type specification in
the Existent NP (Davidse 1999: 238). Typically, cardinal existentials express
cardinal quantification (hence the name), i.e. they measure the intrinsic
magnitude of the designated mass in terms of a quantitative scale (see Langacker
1991: 84). Importantly, moreover, the existent (postverbal) NP is typically
indefinite: the designated instance(s) is/are being introduced into the discourse
and are not presumed known to the hearer. An enumerative existential, on the
other hand, enumerates in ordinal fashion, with implied reference to a
contextually specified type, instances sharing a superordinate type which
corresponds to that contextual type (Davidse 1999: 240-241). Let us look at
example (4) to make this more clear. In this example, three Existent NPs (the
British title, the Commonwealth, the European) are ordinally enumerated. They
all share a superordinate type, viz. something like major athletics competition.
This general type is further specified and situated contextually as roughly major
competitions for which Carl Lewis considered returning to England. The
instances of this type, designated by the three Existent NPs, are then held up
for consideration one by one. It is also possible that only one instance
corresponding to a contextually defined type is mentioned in the enumerative
existential. An important distributional difference with the cardinal existential is
120 Peter Willemse
that the enumerated instances in the existent NP are typically realized as

presumed known to the hearer, i.e. with definite grounding. Types of NP that
often occur as the existent NP in enumerative existentials are NPs with definite
grounding such as proper names, pronouns, NPs containing a definite article, a
demonstrative or a possessive and definite genitive NPs. However, it is also
possible that the enumerated instances are coded as not presumed known
through an indefinite NP.
From the above description it may be clear why the unmarked, or cardinal,
type of existential has been chosen as the environment to study types of pseudo-
definite NPs. The semantics of this type of construction entail that the Existent
NP is indefinite. Consequently, when a formally definite NP occurs in postverbal
position, its referential status can only be pseudo-definite, and not truly definite.
Enumerative existentials were not included in the analysis, since no definiteness
restriction applies to the postverbal NP in this type of construction.
The other type of construction which has been looked at for this paper is
the attributive clause. In contrast with the postverbal NP in existential sentences,
the NP realizing the attribute in an attributive clause is non-referential. Typically,
the Attribute NP has indefinite reference (see Halliday 1994: 120). Consequently,
attributive clauses are another suitable environment to study formally definite
NPs with pseudo-definite status.
2. The corpus study
The corpus analyzed for this paper is an extraction from COBUILDs The Bank
of English, a corpus of 450 million words containing both spoken and written
material.2 A concordance was extracted on any third-person finite form of to be
(i.e. is, are, was, were) preceded by there and followed by a complex noun phrase
consisting of a definite noun phrase and another noun phrase, connected by the
preposition of. This query was considered to be the most practical way of tracing
many, if not most, of the cases of postverbal NPs in cardinal existentials having
pseudo-definite status and involving some form of esphoric retrieval. Indeed,
pseudo-definite reference requires a definite determiner with its first noun for its
apparent definiteness and an indefinite or zero determiner with its second
noun to realize its true indefinite status. These are necessary but not sufficient
conditions for pseudo-definiteness, as many NP complexes of this form are truly
definite. Naturally, this extraction covers only those pseudo-definite NPs
occurring in unmarked existentials. At least some additional subtypes which were
not attested in the existential corpus will be included in the overview for the sake
of completeness. Some of these additional subtypes were found in the additional
corpus containing 50 attributive clauses (which was consulted only for types that
did not occur in the existential corpus). For the existential corpus, in total 200
examples were extracted. This relatively low number of examples is due to the
marked nature of esphora as such, and to its highly marked occurrence in the
intrinsically indefinite postverbal complement-slot of the unmarked existential

and as the Attribute in the attributive clause.
The corpus data were classified according to the construction types which
are realized by the pseudo-definite NPs. Three primary categories were
distinguished; within some of these categories, I also propose more delicate
subcategories. For each category, the main question that will be dealt with is the
motivation of the use of the definite article despite the indefinite referential status
of the NP.
2.1 Type/Subtype-constructions
The first category contains the complex Existent NPs in which the first NP
indicates a subtype of the second NP.
(5) Iona McLeish's vast concrete set, a wire mesh-gated compound within the
ravaged Troy, is a happening in itself. <p> And all around, and above,
there is the sort of action reminiscent of war movies: the clatter of
helicopter rotors, whistling jet airstrikes and, when the city burns, a fire so
realistic you could toast bread on it.
(6) They could still be our friends. But now they never will be. Now there is
the sort of hatred I've described, the sort of cruelty, savagery, barbarity.
In both examples, the grammatical structure of the postverbal NP may provide an

explanation for the use of the definite determiner the in a context which does not
normally allow for definite NPs. In both cases, the noun referring to the type
(sort) functions as the Head of the NP. It is followed by a Postmodifier (of action,
of hatred). This complex Head is in its turn modified by a Postmodifier
(reminiscent of war, Ive described). It seems that the definite determiner, then,
has its normal function in the NP as such: it signals that the identity of the
referent of the head noun it modifies is somehow recoverable. In other words, the
hearer is supposed to be able to identify the specific sort of thing that is being
talked about. More specifically, the information needed to identify the referent of
N1 is present within the same NP in the form of the postmodifier: the definite
determiner is, therefore, esphorically motivated. A possible paraphrase of
example (5) may illustrate this: there is an instance of action of the sort
reminiscent of war. This paraphrase also makes clear why this type of NP
containing the definite article the is acceptable in a cardinal existential: although
the hearer is supposed to identify the type of thing talked about (through the
information given by the postmodifier), the actual instance of this type of thing
referred to is a new instance, which is being introduced into the discourse. The
reference to a new instance is a pragmatic inference; the NP as a whole refers
primarily to an identifiable type. However, the pragmatic inference to instances is
one that must be made due to the specific nature of the existential context.3
Therefore, the reference of the complex NP in this type of examples can be said
122 Peter Willemse
to be pseudo-definite rather than truly definite (or truly indefinite, for that
matter): although on one level of interpretation, the identity of the referent is
recoverable (viz. the type of thing), on another level, an instance is still being
introduced into the discourse as a new entity, which is, consequently, not
presumed known to the hearer. We can briefly refer here to another type of
pseudo-definite NP which involves a similar reference mechanism, viz. NPs
containing specific types of postdeterminers. Postdeterminers are elements which
occupy the slot following the determiner slot in the NP and which fulfill a
function ancillary to the functions of (definite/indefinite) identification and/or
(absolute/relative) quantification realized primarily by the determiner (see
Davidse 2001 for a more comprehensive discussion of postdeterminers). Consider
the following examples:
(7) The Woody Allen-Mia Farrow breakup, and Woodys declaration of love
for one of Mias adopted daughters, seems to have everyones attention.
There are the usual sleazy reasons for that, of coursethe visceral thrill of
seeing the extremely private couples dirt in the street, etc. [San Francisco
Chronicle, 24/8/92; cited in Ward and Birner 1995:732]
(8) There was the usual crop of letters to the Member of Parliament
concerned, and about once every twelve months a really abusive one from
the tortured victim himself.
Ward and Birner (1995: 732) describe this kind of postverbal NPs as having dual
reference, both to a type and a token. They point out that, although the type has
hearer-old status, which justifies the use of the definite article, the token or
instance referred to is hearer-new, which accounts for the acceptability in
existential contexts. It is clear that the underlined NPs in examples (7) and (8)
introduce new entities into the discourse. At the same time, a definite determiner
is used to signal the identifiability, not of the instances, but of the type. In (7), for
instance, although the specific reasons for the public fascination with the break-
up are being introduced into the discourse and thus not presumed known to the
hearer, the speaker does assume that the hearer knows the type of reasons that are
usually the basis for such public attention. The postdeterminer fulfills a secondary
identifying function, in that it helps the hearer to make mental contact with the
right type: it provides a clue for the hearer to identify the correct type.
It may be clear that a very similar explanation holds for type-subtype
constructions, although in the latter type of construction the reference to the type
is given an explicit linguistic realization in the form of the head noun the sort of,
whereas the notion of type is not lexicalized in the case of a postdeterminer with
dual reference.
2.2 Possessive constructions
This second category contains different types of pseudo-definite NP which are

structurally and semantically very similar, in that they all express some kind of
possessive relation (in the broad sense) between NP1 and NP2. More specifically,
it concerns two types of constructions which express types of relationships which
can be regarded as prototypical for possessive constructions (Langacker 1991:
169), viz. part/whole and kinship relations.
The first type is formed by NP complexes in which NP1 specifies a part or
component of the referent of NP2. This category can be further subdivided
according to the degree of abstractness of the part-whole relation which holds
between the two NPs.
The part-whole relation can be situated on a concrete, spatio-temporal
level (meronymy), as in the following example, which was also the only attested
example of this type in my existential corpus:
(9) In a room outside the court he talked with the French prosecuting counsel,
who showed him some of the evidence he was going to submit. There was
the shrunken head of a Polish boy whose crime had been that he had fallen
in love with a German girl. The head, mounted on a plaque like some
trophy of the hunt, had been found in a German official's house, used as an
ornament.
The use of the definite article in an example like this can be explained
straightforwardly in terms of an esphoric bridging relationship. Bridging is
defined by Martin (1992: 124) as a type of indirect reference in which the identity
of a part is recovered through an experiential connection which exists between
that part and another part of the same whole or between that part and the whole it
belongs to, or vice-versa. In the type of construction under discussion here the
bridging relationship is an esphoric one because it holds between the first and the
second NP within the same complex NP. NP2 introduces the entity Polish boy
into the discourse; it realizes presenting reference and consequently uses an
indefinite form (the indefinite article a). The first NP has presuming reference
and consequently uses a definite form (the definite article the); the identity of its
referent is recoverable by virtue of an experiential connection with the entity
introduced by the second NP: a head is part of (the body of) a boy.
Examples like these shed an interesting light on Martins (1992) taxonomy
of retrieval types. Bridging and esphora are more intertwined than it would seem
at first sight. In fact, bridging is often the motivating factor for the use of a
definite form in a pseudo-definite, esphoric NP. In such cases, the esphoric nature
of the NP lies in the fact that the information necessary to identify its referent,
and thus to justify the use of a definite determiner, is to be found further on in the
same NP; the actual information is then retrievable through bridging from the
second NP.
124 Peter Willemse
In the majority of the cases attested in the corpus, the part-whole relation
was of a more abstract nature. A first group contains examples such as the
following:
(10) there was a limp wad of lettuce whose leaves glistened with a fine film of
oil; there was a clean piece of wood jutting out with a shining nail bent at
the end of it; there were several eggshells showing bits of yellow yolk;
there was the stump of a cigar bearing the marks of a man's teeth; and
there was a clump of fluffy dust freshly gathered from some floor
(11) It comes from a pleasant er beach in Cornwall I won't want to say exactly
where it is in case it affects the tourist potential of the beach but er on the
slope where that photograph was taken from there is the remains of an
<ZF1> old tin mine <ZF0> old tin mine <ZGY> and it so happens that
that particular tin mine had quite a lot of uranium in the ore <ZF1> as a
<ZF0> as a sort of by-product.
In these part-whole constructions, a certain spatio-temporal element is still

present. They are different from the type represented by example (9), however, in
that the part-whole relation which they symbolize implies a process. In all of the
examples of this type found in my corpus, the head noun of the first NP is
situated in the lexical field of remains. It is clear that remains are not an intrinsic
part of a whole; rather, they are the result of a process of change, often more
specifically a process of destruction or demolition. It is this process which is
implicitly evoked in this type of examples. In (10), for instance, a stump is what
is left of a cigar after it has been smoked; thus the process of smoking is implied.
In (11), the remains are what is left of the tin mine which has presumably been
abandoned and fallen into disrepair; thus the process of change which the mine
has undergone is activated. In order to explain the use of the definite determiner
in examples of this kind, it can be remarked that the process which is implicitly
activated is one that is strongly associated with, and possibly even collocationally
linked to (e.g. in the case of cigar and smoking), the referent introduced by the
formally indefinite second NP. Consequently, it can be assumed that the process
itself is evoked fairly automatically and easily in the hearers mind. The first NP,
then, refers to the result or end product of this process and can do so with a
definite form since the process which has been evoked in the hearers mind in its
turn implies a specific result. Instead of a relation between a material part and a
material whole, the more abstract part-whole relationship which is present in this
type of examples can be said to be one between a process and one of its phases.
A process is equally implied in the following examples, which are still
further removed from the concrete, spatio-temporal level:
(12) He added: `We still don't know how many are required. I just wish there
were 20 games left. There's the making of a good team here. There is real
ability. In the first half we definitely suffered from tension after their goal.
But in the second half we looked good."
(13) In this sample, he uses only a few, such as the s for plural, although there
is the beginning of a negative construction in `no book" and the beginning
of a question form in `where going?" although he does not yet have the
auxiliary verb added to the question.
In (12), a process such as creating a team is implied; in (13), the process which
is implicitly activated is that of the child acquiring (linguistic constructions).
The use of the definite determiner in the first NP in these examples can be
explained in the same way as for examples (10) and (11): the first NP refers to a
phase of the process associated with, and thus implied by, the referent of the
second NP.
In a number of similar examples, finally, the process is explicitly realized
in the form of a gerundive or a nominalization:
(14) There is the birth of healing and that may be a silly thing to say but I think
if I may be allowed to develop the theme I think that there is a p <ZF1> a
<ZF0> a feeling of healing and time passing in nineteen-ninety-six that
isn't again just locked into
(15) And I think if Mary Matalin wants to go out and raise those questions, she
could be doing the president an enormous disservice because there--there's
the beginning of discussion of--of his side of this too.
Another similar type of pseudo-definite NP complex is the one where a kinship or

family relation is the link between the referent of the first and the second NP.
Examples of this were only found in the corpus containing attributive clauses:
(16) Next month the Network is bringing Dr Kenneth Kaunda, the former
president of Zambia, to Scotland (he is the son of a Church of Scotland
minister) as part of the crusade to fight the `cancer of debt" in the world's
poorest countries.
(17) The 7ft 1in centre, who has a home in West Hampstead, is the son of a
Nigerian diplomat but has lived in England since he was two.
In these cases, the definite article is again motivated by forward bridging and
again, a conceptual-associative link rather than a hyponymy or meronymy
relationship is the basis for the bridging: kinship nouns evoke in their conceptual
structure other people who fulfill certain roles in relation to the person they refer
to. For instance, a son is always someones son; the concept son evokes in its
structure the concept of parents, or at least of one specific parent (mother or
father). The referent of the second NP corresponds to this concept.
126 Peter Willemse
1.3 General-Specific constructions
This third category contains a number of types of pseudo-definite NPs in which

the first NP is in some way more general than the second NP. A further
distinction has to be made between appositive and non-appositive cases of
general-specific pseudo-definite NPs. When the construction is appositive in the
sense of Van Langendonck (1999), the first NP categorizes the second NP. When
the construction is non-appositive, the relation between NP1 and NP2 is one of
symbolizing.
1.3.1 Appositive
In by far the majority of my corpus examples, the relationship between NP1 and
NP2 is one of apposition, and more specifically of restrictive or close apposition.
Van Langendonck (1999: 113) points out that in close appositional structures,
there are two appositives which together form an intonational unit and cannot
always be interchanged. Several subtypes can be distinguished, according to the
sort of noun that functions as the head noun of NP1.
nouns of modality
(18) By several accounts, there is the possibility of an Iraqi attack, either by
missile or by bomb on the air base--the allied air base where the US forces
are in Saudi Arabia at Dhahran.
(19) He reckons the beans cause wind, and he feels there's a danger the eggs
can be underdone and there's the chance of sickness.
phenomenal nouns
(20) At that moment there was the sound of a door opening.
(21) The crowd surged as the musicians pounded and whined. There was the
scent of sweat, and the stench of arrack on hot breath.
nouns denoting subject matter

(22) Fitness became a problem and, of course, there was the matter of personal
discipline - more off the park than on it.
(23) I think we certainly on the <ZZ1> place name <ZZ0> side felt that we had
to do something to stop this it you know wonderful police with <ZGY> I
<ZF1> don't <ZF0> don't disagree with that <M01> Mm. <M02> but
there is the question of sector policing er which would replace <ZZ1>
place name <ZZ0> when it goes <M01> Mm. <
other
(24) And for the two overall leaders after the semi-final stage of the
competition there is the bonus of a day out at the FA Cup final at
Wembley on May 11.
(25) Noosa is a great course it's fast and the climate is good. <p> And there's
the added motivation of a $25,000 car, so I'm giving it my best shot."
It is interesting to consider the semantic relationship between NP1 and NP2 in

this type of construction. Van Langendoncks (1999) remarks on close
appositions involving proper names (PNs) shed an interesting light on this. Van
Langendonck uses close apposition as a formal criterion for proper-namehood,
since he regards PNs as a semantic-syntactic class rather than as a word-class and
opposes them to proprial lemmas, lexemes which prototypically function as PNs
but can also be used in other ways. The element with the potentially most
specific reference (Van Langendonck 1999: 116) in a close apposition in which
two minimal nominal units are juxtaposed, sometimes with the help of the
apposition marker of (116) is the proper name. Interestingly, Van Langendonck
argues that the common noun element in a close apposition containing a PN
indicates the basic level category to which the proper name in question belongs
(Van Langendonck 1999: 113, 120f). For instance, in a close apposition like the
city of Antwerp, Antwerp is the unit with the most specific reference and therefore
functions as a PN. The other unit, the city, functions as a common noun and
indicates the basic level category which Antwerp belongs to. However, the
category indicated by the common noun in the appositional construction is not
necessarily the basic level category; especially in the case of personal name
appositions, it can also be a more specific category (e.g. Prime Minister Blair, the
Parisian Chirac, etc.).
Although Van Langendonck mentions minimal definite determination
(Van Langendonck 1999: 116) of the two units as a prerequisite for the close
apposition test for proper-namehood to work, the constructions under discussion
here seem to display similar features while the second unit has indefinite
determination. In all of the different subcategories of appositional pseudo-definite
complex NPs, the first NP categorizes the referent of the second NP in some way.
In example (20), for instance, the phenomenon referred to by NP2 is categorized
as a sound in NP1. In example (24), the day out referred to by NP2 is classified
as a bonus. These two examples also make clear that although NP2 normally
restricts the number of possibilities regarding possible categorizations, the
speaker has a certain amount of freedom and thus creative or rhetorical
categorization is possible to a greater or lesser extent. The categorization of a day
out as a bonus in (24) is an example of a fairly creative categorization on the
part of the speaker, whereas the categorization of the phenomenon referred to as a
sound in (20) seems to be more determined by the nature of the referent that is
being categorized. In the cases where the head noun of the first NP is a noun of
modality, the categorization also very often has a creative or rhetorical character.
The fact that the categorizing NP has definite determination not only when the
categorized NP is definite (for instance in the case of PNs) but even when the
second NP is indefinite (in the constructions under discussion here) allows us to
conclude that in English, category indications appear to be typically definite,
irrespective of the definiteness status of the NP which is being categorized.
128 Peter Willemse
Another important question regarding this type of pseudo-definite

contruction pertains to the motivation of the use of the definite determiner in the
first (categorizing) appositive even though the second (categorized) appositive is
indefinite. There are still two different possibilities here. In a first type of
construction, N1 functions as the Head of the construction while the second NP is
a Postmodifier indicating a subtype of the type designated by N1:
(26) Then the low whine of the vacuum cleaner came to his ears, and when it
stopped there was the musical flow of water in the bathroom.
(27) There was the smell of pot all over the apartment. [quoted in
Woisetschlaeger 1983: 142]
In these cases, the pseudo-definite status of the whole NP complex can be

explained in terms of dual reference. On the one hand, there is (definite)
reference to a generic concept. Woisetschlaeger (1983: 142) observes that in
examples such as these, some generic concept having narrow enough
specifications to qualify for prior identification accounts for the definite form of
the NP. Using different terminology, what is identifiable in these examples is the
type of thing talked about. On the other hand, there is (implied) reference to a
new instance of this known type: definiteness, and the attendant existential
presupposition, attaches to the concept referred to by the generic, while the
existence claim introduced by existential there attaches to some instantiation of
the generic concept (Woisetschlaeger 1983: 143). It should be noted that dual
reference is no longer present and that only a reading in terms of instantial
reference is possible in case the definite article in the first NP is replaced by an
indefinite article:
(27) There was a smell of pot all over the apartment.
The second type of appositive construction has a different internal structure. In

this type, N2 rather than N1 functions as the Head of the construction and the first
NP is a Premodifier to this Head. In many of these cases, NP2 contains a
participle:
(28) Suddenly, from nowhere, there was the sound of a very fast Forbes saying
that all Americans need only pay a single income tax rate of less than 20
per cent. The middle classes should be given a big break.
In these cases, the definite determiner has to be explained in terms of forward

bridging rather than dual reference. This type of pseudo-definite is thus, again,
esphorically motivated; the first NP points forward to the second NP, in the
sense that the information needed to identify the referent of the first NP is given
in the second NP, which introduces an instance into the discourse. Note that an
alternative in which NP1 has an indefinite article is not possible here:
(28) *There was a sound of a very fast Forbes saying that all Americans need
only pay a single income tax rate of less than 20 percent.
It seems, then, that pseudo-definite appositive constructions can be put on a cline

regarding their internal organization and their referential status. At the far ends of
the cline, there are the two different readings, one in terms of dual reference
(where N1 is the Head and NP2 functions as a Postmodifier) and the other in
terms of instantial reference with forward bridging from the Premodifier NP1 to
the Head NP2. Some examples clearly allow for only one of these two possible
readings, while others are more or less ambiguous. It seems that the more specific
and instance-like NP2 is, the more fixed the definiteness of NP1 (recall the
impossibility of a paraphrase with the indefinite article) and the more natural a
reading in terms of forward bridging becomes.
Finally, a special subclass of appositive pseudo-definite NPs is formed by
the cases in which the first appositive indicates a measure or degree:
(29) The title tells it all, and there's the flavour of Whiskey Galore and The
Titfield Thunderbolt about the movie, which offers chuckles and beautiful
Welsh locations as a group of villagers insist on their hill" officially
recorded as a mountain the very first mountain in Wales.
(30) Every time there was a lull, every time there was the hint of an opportunity
for any Tory to giggle at her personally, in came the trolleys again rank
upon rank of them, as patients queued while wicked Conservatives `tore
the National Health Service limb from limb
As is clear from examples such as (29) and (30), lexical extension mechanisms
such as metaphor play a role in these cases: a phenomenal noun occurs as the
head noun of NP1, but a literal categorization of the referent of the second NP
in terms of this type of phenomenon is not intended.
1.3.2 Non-Appositive
The last type of pseudo-definite NPs occurring in my corpus includes the ones in
which a symbolic relation, and more specifically a relation of representing or
depicting, exists between the referents of NP1 and NP2.
(31) Turn your back on the rock and follow the coastal path the other side of
the church to the covered fontaine de St They. There is the statue of a saint
in one niche and, until a few years ago, the other contained a stone,
apparently also revered, showing that the old practices of Morgan's people
have not wholly faded away.
(32) There was the wedding picture of a young black couple among his papers.
[quoted in Woisetschlaeger 1983, example 15f]
130 Peter Willemse
This type of pseudo-definite NP allows for alternation with a formally indefinite

NP, containing an indefinite article:
(31) There is a statue of a saint in one niche.

(32) There was a wedding picture of a young black couple among his papers.
Moreover, in Dutch, a language which is typologically closely related to English,

similar constructions allow only for the indefinite article; a pseudo-definite
variant is not possible in examples like:
(33) Er staat een beeld van een heilige in de ene nis.

There stands a statue of a saint in the one niche.
The motivation for the use of the definite article in the first NP is, again, a
forward bridging relation between the two NPs in the NP complex. As has
already been remarked earlier, a relationship of collocation or association is a
possible basis for bridging. In these cases, the concepts evoked by NP1 and NP2
are strongly associated with each other; we can therefore assume that one concept
evokes the other fairly automatically. Note that in this type of construction, the
second, and not the first noun, is the node of the collocation. However, these are
still cases of forward bridging, because the definite article is motivated by, and
thus points forward to, the second NP.
3. Conclusion
The category of esphoric reference as it has been defined and discussed in the
literature (Martin 1992) covers a number of constructions which are still quite
different in nature. First of all, truly definite NPs may involve esphora, in that
the information needed to identify the referent of the NP is present in the NP
itself, for instance in the form of a restrictive relative clause.
Besides these real definites, there are a number of pseudo-definite
types of esphoric NPs, which, although they show formal signs of definiteness,
realize indefinite reference to instances that are being introduced into the
discourse. This paper has zoomed in on these constructions and has studied them
in an environment which specifically excludes truly definite NPs, viz. the
postverbal position in the unmarked type of existential sentences. On the basis of
a corpus analysis, a classification of the different types of pseudo-definite NPs
was made. The construction types realized by the pseudo-definite NPs and the
semantic relation between NP1 and NP2 formed the basis of the classification. An
important question that was asked for each type concerned the motivation of the
use of the definite article in the first NP of the pseudo-definite NP complexes:
why does the first NP contain a definite determiner even though the unit it is part
of really realizes indefinite reference? It turns out that there are two basic
explanations for this.
The first possible motivation is what I have called, using a term from
Ward and Birner (1995), dual reference. The definite article can in that case be
explained by definite reference to a generic concept, to a known type. The NP
complex is, however, only pseudo-definite and can hence take the postverbal
position in an unmarked existential, because at the same time there is reference to
new instances, which are being introduced into the discourse. The reference to
new instances is a pragmatic inference which must be made due to the particular
nature of the grammatical environment of the unmarked existential, which does
not allow true definites.
The second explanation is a relation of what I have termed forward
bridging within the NP. The first NP in the NP complex takes a definite article
because its referent is identifiable by virtue of a bridging relationship to the
information supplied in the second NP. The definite article is thus esphorically
motivated, with bridging as the ultimate foundation for the esphoric relationship.
The basis for the bridging relation may in its turn be hyponymy or meronymy or,
alternatively, a collocational or associative link.
Notes
1. The term attributive is used by Halliday (1967, 1985, 1994); an equivalent

term, used by Declerck (1988), among others, is predicative clause.
2. All the examples quoted in this paper, except when otherwise indicated, are
extracted from the COBUILD corpus via remote log-in and are reproduced
here with the kind permission of HarperCollins publishers.
3. Unlike in other types of constructions, where the pragmatic inference to (new)
instances is optional and therefore creates vagueness. For instance, in They
decided to bomb various sorts of targets, there is certainly and primarily
reference to different (sub)types of targets (e.g. military target, civilian target,
etc.). On a second level of interpretation, then, specific instances of targets
may be evoked as well (e.g. a military base in Kabul, a residential area in
Bagdad, etc.). However, the sentence is perfectly acceptable without this
inference; in existential contexts, on the other hand, ungrammaticality arises
when the interpretation in terms of instances is not activated.
References
Davidse, K. (1999), The semantics of cardinal versus enumerative existential

constructions, Cognitive Linguistics 10(3), 203-250.
Davidse, K. (2000), A constructional approach to clefts, Linguistics 38-6, 1101-
1131.
132 Peter Willemse
Davidse, K. (2001), Postdeterminers: their secondary identifying and quantifying

functions. Preprint n177, Linguistics Dept., K.U. Leuven.
Declerck, R. (1988), Studies on Copular Sentences: Clefts and Pseudo-Clefts.
Leuven: Leuven University Press and Foris Publications.
Du Bois, J. (1980), Beyond Definiteness: the trace of identity in discourse, in
W. Chafe (ed.), The Pear Stories: cognitive, cultural and linguistic aspects
of narrative production, 203-274. Norwood: Ablex.
Halliday, M.A.K. (1967), Notes on Transitivity and Theme in English, Journal
of Linguistics 3 (1), 37-81.
Halliday, M.A.K. (1994), Introduction to functional grammar. 2nd Ed. London:
Arnold
Halliday, M.A.K. and R. Hasan (1976), Cohesion in English. London: Longman.
Langacker, R.W. (1991), Foundations of cognitive grammar. Vol. II: Descriptive
Application. Stanford: Stanford University Press.
Lumsden, M. (1988), Existential Sentences. Their Structure and Meaning.
London: Croom Helm.
Martin, J. (1992), English text: System and Structure. Amsterdam: Benjamins.
Milsark, G (1976). Existential Sentences in English. Bloomington: Indiana
University Linguistics
Milsark, G. (1977). Toward an explanation of certain peculiarities of the
existential construction in English, Linguistic Analysis 3: 1-29.
Van Langendonck, W. (1999), Neurolinguistic and syntactic evidence for basic
level meaning in proper names, Functions of Language 6.1: 95-138.
Ward, G. and B. Birner (1995), Definiteness and the English existential,
Language 71, 4: 722-742.
Woisetschlaeger, E. (1983), On the question of definiteness in an old mans
book, Linguistic Inquiry 14, 1: 137-154.
Why an angel rides in the whirlwind and directs the storm: A
corpus-based comparative study of metaphor in British and
American political discourse.
Jonathan Charteris-Black
University of Surrey
Abstract
This paper compares choice of metaphor in two political corpora: the Inaugural
speeches of American Presidents and party political manifestos of two British
political parties during 1974-1997. Initially metaphors are classified according
to their source domain; they are then analysed from a cognitive semantic
approach. The major findings are that metaphors from the domains of conflict,
journeys and building are common to both corpora. However, the British corpus
includes metaphors that draw on the source domain of plants whereas the
American corpus contains metaphors that draw on source domains such as fire
and light and the physical environment that do not occur in the British corpus.
These variations suggest differences in metaphors between British and American
political discourse and provide insight into cultural differences.
The cognitive analysis reveals the importance of the conceptual metaphors
POLITICS IS CONFLICT, PURPOSEFUL SOCIAL ACTIVITY IS TRAVELLING ALONG A PATH
TOWARD A DESTINATION and A WORTHWHILE ACTIVITY IS A BUILDING in both
corpora. However, SOCIAL PURIFICATION IS HEAT and A SOCIAL CONDITION IS A
WEATHER CONDITION occur only in the American corpus. There is some evidence
that British political discourse has borrowed metaphors based on the concept
POLITICS IS RELIGION from American political discourse.
1. Introduction
In this paper I am interested in exploring variation in metaphor choice within the

domain of politics by comparing the metaphors found in a corpus of American
presidential speeches with those found in a corpus of British party political
manifestos. Speeches and political manifestos are both types of political discourse
with the shared function of persuasion, therefore we may anticipate some overlap
in metaphor use. However, it may also be expected that differences in the
historical traditions and cultures of Britain and the United States may lead to
differences in the types of metaphors that are selected to attain similar rhetorical
objectives within political discourse.
134 Jonathan Charteris-Black
2. Political speeches, manifestos and metaphor
Metaphor plays an important rhetorical role in persuasive language because it has

the potential to exploit the associative power of language in order to provoke an
emotional response on the part of the hearer. In the domain of politics metaphors
have a crucial role since they combine a discourse function of communicating
policy with an expressive function of persuasion both as regards evaluating
policy and evaluating the reliability and integrity of politicians. Metaphors are
valuable because they facilitate the exploration of possible political objectives
while not committing the speaker to any of these and because they encourage
affective involvement of the type sought by successful political leaders. Using
Hallidays terms they therefore combine ideational and interpersonal functions of
language (Halliday 1994).
One of the main difficulties in comparing British and American political
discourse is that there are not identical text types in the conventions of the two
political systems. In the British system the policy statements that lay out the
intentions of a political party are communicated in written manifestos published
prior to an election. In the American system one means of communicating policy
statements are the Inaugural addresses that occur in the January following the
election of a new President; they receive wide media coverage because they set
out policy for the duration of the administration. The rationale for comparing
these two text types is that they broadly share a common communicative purpose
of persuading the voting public of the value of the governments intended policy.
Manifestos are documents stating the intentions and policies of political
parties that have the communicative function of persuading the electorate to vote
for a political party. They are usually generated through collaborative processes
of drafting and redrafting and entail multiple authorship. They provide phrases
that may subsequently be used as slogans in speeches. Political speeches differ
from many other types of spoken discourse because there is usually some use of
pre-prepared written script; they are characterised by a higher degree of planning
than is normal in spontaneous speech and scripts are often prepared by a team of
ghost writers. Therefore, manifestos and political speeches are similar in that
they both involve collaborative planning. They also share the communicative
purpose of persuasion both as regards the ideology of a political party and as
regards the integrity and values of politicians.
My view of metaphor originates in Richards (1936) tensile view of
metaphor in which metaphor refers to the semantic tension arising from a shift
in the use of a word from one domain or context to another. Metaphors are not
inherent in word forms but arise from the relationship between words and their
contexts. Excellent summaries of metaphor are available in Black (1962), Gibbs
(1994), Goatly (1997), Ortony (1979) and Lakoff and Johnson (1980, 1999).
Some of the background issues relating to defining metaphor are discussed
Charteris-Black (2000), Charteris-Black and Ennis (2001), Charteris-Black and
Musolff (2003), Charteris-Black (2004).
Metaphor in British and American political discourse 135
3. Research method
The aims may be summarised by the following research question:
What are the similarities and differences between the metaphors employed in
American inaugural speeches and British election manifestos?
Two corpora one American and one British were used to assist in answering
this question; the first is a corpus comprising the 51 Inaugural addresses of
American Presidents spanning approximately 200 years from George Washington
to Bill Clinton and was 98,237 words in length. The second was a corpus
comprising the party political manifestos for the Labour and Conservative Party
in the period 1945-1997 inclusive and was 132,775 words in length.1 For
convenience I will refer to the corpus of Inaugural speeches as the American
corpus and the corpus of political manifestos as the British corpus.
While diachronic variation was not the main focus of this study, it is
possible that some of the differences observed between the two corpora may be
partially attributed to the difference in the time period they cover. However, since
Inaugural speeches often refer intertextually to the speeches of earlier presidents
it was considered acceptable to treat them as a coherent and homogeneous body
of texts. In the same way, the structure of the British party manifestos has
remained relatively unchanged during the period covered and so this was also
considered a coherent body of texts. In both cases the function of language
combines communication of ideas with persuasion and this similarity of
communicative purpose makes them comparable genres. However, diachronic
factors could be taken into account in interpreting the findings and may indeed
form the central focus of future research in this area.
The methodology combines qualitative with quantitative approaches.
Initially, qualitative analysis of a sample of each corpus revealed a set of words
that have the potential to be used as metaphors. Identification of metaphor was
based on the definitions discussed above. The procedure was to analyse
potentially metaphorical linguistic forms in the two corpora to establish whether
on each occasion of use they should be classified as metaphor.
For example, windfall typically refers to apples blown down by the
wind; however, it is also found in New Labour discourse in expressions such as
windfall tax; this innovative use in a political context is the basis for identifying
all uses in this corpus as metaphor. Admittedly, for some speakers windfall may
more commonly be used in the context of taxation than that of fruit or the weather
however, there are invariably subjective issues of variation between language
users that influence metaphor identification. A further example from the
American corpus is that words such as path and step may be used as
metaphors that draw on the domain of journeys (see example 9); however, the
President sometimes refers to the white steps on which he is standing while
speaking. Evidently, such a use of step refers literally to the steps of the White
House. It is necessary to examine each of the contexts of words and phrases that
have the potential to be used metaphorically to establish whether there is the

presence or absence of the semantic tension that is the basis for their
classification as actual metaphors.
Metaphors were classified according to the lexical fields of their linguistic
forms; these are generally referred to as source domains. It was then possible to
undertake further qualitative analysis to propose conceptual bases for metaphor
clusters using a cognitive semantic framework (cf. Lakoff and Johnson 1980,
1999). So, for example, a conceptual metaphor LIFE IS A JOURNEY could be said to
motivate uses of step, path etc when occurring in a context that does not refer
to physical movements in space.
I established the resonance of source domains for metaphors using a
simple statistical measure: first the types and tokens of the metaphors are
calculated. Types are separate unique linguistic forms while tokens are the
number of times each form occurs irrespective of whether it has already occurred
tokens include repetitions of identical linguistic forms whereas types do not.
Then the total number of types for each lexical field, or source domain, is
multiplied by the total number of tokens in that source domain; this provides a
measure of its resonance. This can then be converted to a percentage by dividing
the resonance of each source domain by the total of the resonances for all the
source domains. This calculation overcomes the problem of the difference in the
size of the two corpora; it also facilitates comparison of source domains in a way
that takes into account their productivity in terms of the types and tokens of the
metaphors that they produce. It permits identification of similarities and
differences between the productivity of metaphor source domains within a single
corpus and between different corpora.
4. Findings
I will first present an overview of the findings then address each part of the
research question. The findings as regards the resonance of metaphor source
domains are shown in Table 1.
From the bottom row we can see that more than twice as many types of
metaphor were identified in the American corpus as compared with the British
one. However, the larger British corpus contained many more tokens of metaphor
indicating that in the British corpus there is a tendency for metaphors to repeat the
same linguistic forms. There are areas of similarity in metaphor domains between
the two corpora since conflict, journeys and buildings are the three most resonant
lexical fields in each corpus. These domains account for 66% of total resonance
in the American corpus and 89% of total resonance in the British corpus;
however, there are also variations in metaphor use.
Table 1. Comparison of metaphor types in British and American political corpora

classified by source domain
Source domain American corpus British corpus
Types Tokens Reson- % of Types Tokens Reson- % of
ance total ance total
(types x (types x
tokens) tokens)
Conflict 18 116 2,088 36 9 494 4,446 54
Journeys 12 76 912 16 5 187 935 11
Buildings 12 66 792 14 7 287 2009 24
Fire & light 15 51 765 13 - - - -
Physical environment 16 35 560 9 - - - -
Plants - - - - 5 150 750 9
Religion 6 72 432 7 4 46 184 2
Body part 4 76 304 5 - - - -
Total 83 492 5,853 100 30 1,164 8,324 100
First, we may notice that while conflict is the most common lexical field
for metaphor in both corpora it is more resonant in the British than in the
American Corpus. Perhaps this may be explained by the combative discourse
function of the pre-election party political manifesto as compared with the
postelection inaugural speeches where there is less need to combat a defeated
opposition party.
Another interesting distinction is that while journey metaphors are more
common in the American Corpus, building metaphors are more common in the
British Corpus where they account for nearly a quarter of all metaphors. This is
an interesting difference in resonance that can be explained with reference to
different cultural experiences new experiences arising from journeys are salient
for Americans while the sense of security and solidity arising from buildings are
salient for the British.
A few lexical fields occurred in only one of the corpora; for example,
metaphors based on fire and light were only found in the American corpus. Fire
and light metaphors often convey idealism in the American corpus, while in the
British corpus religious metaphors such as vision are used for this purpose.
Another major difference between the two political corpora is that the
American corpus tends to employ physical environment metaphors in situations
where plant metaphors are employed in the British corpus (cf. example 10
below); in each case these two lexical fields constitute 9% of the total resonance.
This may have a cultural explanation in that gardening is a major pastime in
British society; the garden is a domain of private but external space. Conversely,
American cultural and historical experience draws on undomesticated space and
this reflects in the use of words such as valley, horizon, jungle, mountain
or desert in a political context. For the majority of English people nature is

conceived as something to be controlled by physical intervention, whereas for
Americans nature is conceived as larger and more elemental and to be controlled
by travel. Such metaphor preferences demonstrate the influence of cultural
practice and the physical environment on metaphor use.
In answering the first part of the research question that refers to
similarities I will consider the three lexical fields for metaphor that were common
to both corpora.
5. Metaphors common to both American and British political discourse
Conflict metaphors
Metaphors from the lexical field of conflict were originally identified in relation
to spoken language in terms of debate and represented as ARGUMENT IS WAR (cf.
Lakoff and Johnson 1980). Conflict is the most common lexical field in both
corpora and provides evidence of a conceptual metaphor POLITICS IS CONFLICT. I
suggest that metaphors of conflict are chosen to emphasise the personal sacrifice
and physical struggle that is necessary to achieve social goals while subliminally
creating the opportunity for positive evaluation of actual military conflict. Table 2
shows some examples from the American corpus.
Nearly all the conflict metaphors have a very similar rhetorical pattern: in
pragmatic terms the choice of a conflict metaphor determines the nature of the
speakers evaluation. The conflict is either for abstract social goals that are
positively evaluated such as rights, freedom, faith etc. or against social
phenomena that are negatively evaluated such as poverty disease, injustice etc.;
these social ills are conceptualised as enemies. In addition, the stages through
which social progress is to be made are conceptualised in terms of the stages of a
military action: the trumpet that calls to action, attack, retreat, truce and eventual
victory or surrender. POLITICS IS CONFLICT implies an isomorphic relationship
between the domains of politics and war.
Similarly, in the British corpus both parties defend abstract social goals
that are positively evaluated by their own party but perhaps because of the
combative function of the election manifest imply that such goals are under
threat from political opponents:
(1) We will defend the fundamental right of parents to spend their money on
their childrens education should they wish to do so. (Conservative)
(2) While continuing to defend and respect the absolute right of individual
conscience (Labour)
Table 2. POLITICS IS CONFLICT The American Corpus
Conflict Tokens Examples

lexicon
enemies 10 we will fight our wars against poverty, ignorance,
and injustice for those are the enemies against
which our forces can be honorably marshaled.
(Jimmy Carter)
destroy 8 wise and correct course to follow in taxation and
all other economic legislation is not to destroy those
who have already secured success but to create
conditions (Calvin Coolidge)
victory 6 Every victory for human freedom will be a victory for
world peace. (Ronald Reagan)
struggle 6 as a call to battle, though embattled we arebut a
call to bear the burden of a long twilight struggle,
year in and year out, "rejoicing in hope, patient in
tribulation" (Jimmy Carter)
fight 3 We will be ever vigilant and never vulnerable, and
we will fight our wars against poverty, ignorance, and
injustice (Jimmy Carter)
trumpet 3 We have heard the trumpets. We have changed the
guard. (Bill Clinton)
battle 3 We know the race is not to the swift nor the battle to
the strong. Do you not think an angel rides in the
whirlwind and directs the storm? (George W. Bush)
Parties may also defend social institutions or groups in society that are positively
evaluated:
(3) Labour created the National Health Service and is determined to defend it.
(Labour)
(4) We will continue to defend farmers and consumers. (Conservative)
However, while defence metaphors are used in similar ways by both the major
British parties, attack metaphors are used rather differently; The following
examples show that the Labour party fights against or attacks a general range of
social ills while the Conservative party defends social virtues:
(5) Economic success is not an end in itself. For the Labour party, prosperity
and fairness march hand in hand on the road to a better Britain. During the
next Parliament, we intend to continue our fight against all form of social
injustice. (Labour)
(6) We will fight against crime and violence which affects all Western
societiesAt the same time, we shall attack the social deprivation which
allows crime to flourish. (Labour)
(7) We have to compete to win. That means a constant fight to keep tight
control over public spending and enable Britain to remain the lowest taxed
major economy in Europe. It means a continuing fight to keep burdens off
business. (Conservative)
(8) We will continue to fight for free and fair trade in international
negotiations. (Conservative)
When fight is used in a Conservative manifesto it signifies defending something

that is represented as under attack from Labour. Conversely, fight is used in
Labour manifestos to represent its policies as an attack on negatively evaluated
social ills, the cause of which is not usually identified.
One explanation of the extensive use of metaphors related to the
conceptual metaphor POLITICS IS CONFLICT in both corpora is that decisions about
whether or not to engage in political conflict are perhaps the most important
decisions that politicians have to make. This is particularly the case in the USA
where the President has the sole right of veto over whether or not nuclear missiles
should be launched. This reflects in the use of the word defend in the
presidential oath: clearly the American constitution emerged out of conflict both
against external political enemies (Britain and France) and internal enemies (the
native inhabitants, later the Civil War). Interestingly, terrorism can be defined
either as an external or as internal threat. With the continuing importance of
American military involvement as a basis for international power, and the role of
Britain in legitimising post-colonial dominance, it is of little surprise that conflict
remains a highly potent source domain in both American and British political
discourse. When metaphors based on POLITICS IS CONFLICT are used by
democratic politicians, they treat the struggle against social ills in the language
usually reserved for military conflict. In subliminal terms this creates the potential
for military combat to be represented as socially beneficial (cf. Lakoff 1991,
Jansen and Sabo 1994).
Journey metaphors
Journey metaphors have quite a long history in cognitive linguistic research.

Originally, Lakoff and Johnson (1980: 44) proposed LOVE IS A JOURNEY; Lakoff
and Turner (1989) then proposed LIFE IS A JOURNEY. A more generic
representation is PURPOSES ARE DESTINATIONS (Lakoff and Johnson 1999: 52-53).
In these representations a journey is taken as a prototype purposeful activity
involving movement in physical space from a starting point to an end point or
destination. Since politicians are concerned with goal-oriented social activity, I
propose a similar representation: PURPOSEFUL SOCIAL ACTIVITY IS TRAVELLING
ALONG A PATH TOWARD A DESTINATION. Normally, journey metaphors evaluate

policies positively because the ends are socially valued ones.
A single conceptual metaphor can account for the semantic coherence of a
whole speech; this is evident from an analysis of Lyndon Johnsons inaugural
address (January 20th 1965). The theme of this speech is change; the sentences in
example 4 are chosen from a total of thirty-four sentences and their positions are
shown in parentheses:
(9) Even now, a rocket moves toward Mars. (4)
They came here the exile and the stranger, brave but frightened to
find a place where a man could be his own man. (7)
First, justice was the promise that all who made the journey would share in
the fruits of the land. (8)
Think of our world as it looks from the rocket that is heading toward Mars.
It is like a child's globe, hanging in space, the continents stuck to its side
like colored maps. We are all fellow passengers on a dot of earth. (16)
For this is what America is all about. It is the uncrossed desert and the
unclimbed ridge. It is the star that is not reached and the harvest sleeping
in the unplowed ground. Is our world gone? We say "Farewell." Is a new
world coming? (31)
To these trusted public servants and to my family and those close friends
of mine who have followed me down a long, winding road, (32)
The journey metaphors attempt to evoke the original historical experience of the
Pilgrim Fathers (7 and 8); the opening up of the American west (31) and the
space programme (4 and 16). These are integrated with the more general use of
journey metaphors to describe human relationships as implied by Lakoff and
Johnsons LIFE IS A JOURNEY metaphor, as in (32). We can see from the
distribution of these metaphors that they form a path through the text inviting the
listener to participate in a journey. In metaphorical terms the President is
represented as a guide; since only the guide knows the destination, the speech
provides a type of map towards it. Since journeys may be to unknown
destinations, the choice of this conceptual basis has the rhetorical goal of
persuading the American people to accept innovation and social change.
However, travel can be slow and arduous because of impediments to
movement and hence there will be barriers to overcome and burdens to bear.
Lakoff and Johnson (1999: 188) represent metaphoric use of words such as
burden as DIFFICULTIES ARE IMPEDIMENTS TO MOVEMENT. In a political context
these metaphors express the need for patience since it takes time and effort to
reach a destination. This is rhetorically effective because it implies that the
electorate should not expect instant results and that, at times, they may need to
suffer to achieve goals; it also implies that hardships are to be tolerated because
these goals are worthwhile. The extracts shown in the following examples all
share the notion of a burden whose weight should be endured (or ignored
altogether) because of the value placed on the destination:
(10) indeed all free men, remember that in the final choice a soldiers pack
is not so heavy a burden as a prisoners chains. (Dwight Eisenhower)
(11) Let us accept that high responsibility not as a burden, but gladly gladly
because the chance to build such a peace is the noblest (Richard Nixon)
Building metaphors
Metaphors from the source domain of building are typically evaluative, carry a
strong positive connotation and are employed to express aspiration towards
desired social goals such as peace, democracy and progress towards a better
future. They emphasise social cohesion, social purpose and control of ones
environment. These metaphors can be divided into two types. First, there are
those that refer to the parts of a building foundations, threshold, doors, etc. and
others that refer to types of building such as house, or bridge.
The most frequent part of a building that is used metaphorically is the
foundations. In such metaphors an abstract phenomenon is positively evaluated
which permits us to infer a conceptual representation A WORTHWHILE ACTIVITY IS
A BUILDING. Laying foundations is a conventional metaphor for a solid and
valuable policy although it may not in fact be taken through to completion. We
know that any building which is to be durable must first have foundations and
that these may take a long time to construct; however, we also know that the
laying of foundations does not necessarily imply the completion of a building. If
the money to buy materials or to pay builders runs out then the building will not
be built. So in reality it is very difficult to predict the extent to which laying
foundations will guarantee the successful completion of a construction.
Building metaphors make an interesting comparison with journey
metaphors. Building and travelling are conceptually related, as they are both
activities in which progress takes place in stages towards a predetermined goal.
Topographically, both involve increase in the surface that is covered; in the case
of journeys this is linear movement along a horizontal path whereas for buildings
there is three-dimensional increase along a vertical path. Both activities highlight
the need for patience since they require time and effort. Difficulties entail a need
to make sacrifices and not to expect instant outcomes. Since we think of
achieving goals as inherently good, in pragmatic terms, both journey and building
metaphors imply a positive evaluation of political policy. They require a plan or
map and an architect or guide, and it may be this conceptual proximity accounts
for their resonance in the corpus.
I will now address the second part of the research question by considering
metaphors that only occurred in one of the corpora.
6. American corpus: light and fire metaphors
The analysis suggests that light & fire metaphors are particular to American
political discourse. The lexical field of light has traditionally been linked with the
target domain of understanding and metaphors that draw on it are motivated by a
conceptual metaphor KNOWING IS SEEING (cf. Lakoff and Johnson 1999: 53-54).
However, for this political data I suggest a conceptual metaphor HOPE IS LIGHT
that invariably implies a positive evaluation. It is likely that spiritual notions will
be evoked because of the importance of hope in religious discourse. As we can
see from Table 3, light is contrasted with darkness that is associated with
ignorance, failure to understand and evil.
Light is always positive because of its polarity with darkness. In other
circumstances fire metaphors can also be used for positive evaluation. This is
because George Washington first used the fire metaphor in an inaugural address
and the metaphorical link between fire and liberty has become a source of
intertextual reference in presidential addresses as we can see from the following
examples:
(12) since the preservation of the sacred fire of liberty and the destiny of the
republican model of government (George Washington)
(13) The preservation of the sacred fire of liberty and the destiny of the
republican model of government are justly (Theodore Roosevelt)
(14) He would extinguish the fire of liberty, which warms and animates the
hearts of happy millions (James Polk)
Fire is represented as the guarantor of liberty. This may be because it implies that
some form of burning or destruction will be necessary: this is in keeping with
Americas revolutionary wars, struggle for independence and Civil War. In these
metaphors conflict can be represented as a means to peace. Consider the
following examples:
Table 3. Light and fire metaphors (American Corpus)
Light & Fire Tokens Examples

lexicon
(n=15)
light 15 I have spoken of a thousand points of light, of all the
community organizations that are spread like stars
throughout the Nation, doing good.
(George Bush)
dark 8 Finally, to those nations who would make themselves
our adversary, we offer not a pledge but a request:
that both sides begin anew the quest for peace, before
the dark powers of destruction unleashed by science
engulf all humanity in planned or accidental self-
destruction
(John F. Kennedy)
fire(s) 7 and since the preservation of the sacred fire of liberty
and the destiny of the republican model of
government are justly considered, perhaps, as deeply,
as finally, staked on the experiment entrusted to the
hands of the American people.
(George Washington)
bright 4 These principles form the bright constellation which
has gone before us and guided our steps through
(Thomas Jefferson)
dawn 3 so that together, we can see the dawn of a new age of
progress for America, and together, as we celebrate
our 200th anniversary as a nation,
(Richard Nixon)
beacon 2 We will again be the exemplar of freedom and a
beacon of hope for those who do not now have
freedom. (Ronald Reagan)
(15) And it is imperative that we should stand together. We are being forged
into a new unity amidst the fires that now blaze throughout the world. In
their ardent heat we shall, in God's Providence, let us hope, be purged of
faction and division (Woodrow Wilson)
(16) Mill fires were lighted at the funeral pile of slavery. (Benjamin Harrison)
In such cases fire originates in Washingtons fires of liberty and provides

evidence of the conceptual metaphor SOCIAL PURIFICATION IS HEAT. Therefore,
different aspects of the source domain are highlighted in particular choices of
metaphor. For example, it seems that when words such as kindled or flames
are used metaphorically to convey notions of anger, it is the speed and rate of
burning that are important rather than heat. In this corpus heat is a positive rather
than a negative attribute of fire because it is associated in scientific senses with
the notions of purification (as when impure metals are converted to pure ones by
the application of heat). Similarly, fire metaphors are also positive when they
highlight the quality of fire to produce light as in metaphorical uses of beacon.
In this respect it depends on which aspect of the source domain is highlighted
whether a President conveys a positive or negative evaluation. Such malleability
makes fire a useful and potent cognitive domain as it can combine different
aspects of our knowledge of an element to convey an evaluation that is
appropriate to a specific discourse context. Similarly, light and darkness provide
prototype poles for creating contrasts between spiritual or moral notions of
goodness and evil. Light and fire metaphors therefore share both a cognitive and
pragmatic role in American political discourse.
7. British corpus: plant metaphors
Metaphors from the domain of plants are an important group comprising 9% of

all metaphors in the British Corpus. Many of these were accounted for by a
conventional metaphor for 'growth' in the context of describing economic
expansion. We also find a similar use of flourish to imply a strong positive
evaluation:
(17) As we want small businesses to flourish, we will go even further.

(Conservative)
(18) To build a responsible society which protects the weak but also allows the
family and the individual to flourish. (Conservative)
In these cases flourish identifies those social entities that are highly valued. In
some cases these are the same for both parties, for example 'families', but in
others they are specific to parties, for example 'business' is claimed to 'flourish'
under the Conservatives and 'democracy' is claimed to 'flourish' under Labour.
There is also evidence of effective use of plant metaphors for the purpose
of political persuasion. Let us consider the use of the term windfall. As in the
Labour Party 1997 manifesto, it is always used in a nominal compound form
windfall levy. The use of this metaphor is important in that it conceals agency:
it is not clear that this is in fact a tax imposed by the government of the day. The
Bank of English corpus shows that the other familiar collocations of this word are
windfall tax, cash windfall, and windfall profits. Here public revenue is
conceptualised as being obtained without any effort because it is through the
natural process of the wind blowing. There is no victim and no effort involved in
obtaining a social benefit. This is an example of a creative use of metaphor that
deliberately construes an event as effortless because there is no animate agent and

positive because it is seen as a gift of nature. In the 'windfall' metaphor the agency
of government is concealed.
Many plant metaphors imply a strong positive evaluation because of the
connotation formed by the association of fertility with life, as in the following
examples:
(19) We will nurture investment in industry, skills, infrastructure and new

technology (Labour)
(20) More realistic attitudes to profit and investment take root. (Conservative)
Here the expansion of investment is represented as a natural process in which

there is an analogy between the roots that are the pre-requisite of a healthy plant
and the investment that is conceptualised as the pre-requisite of a healthy
economy. This is based on the fact that both are invisible causes of visible effects
they create a semantic association between consumer wealth and fertility.
Metaphors such as nurture and took root are extensions of the highly
conventionalised use of growth to refer to economic expansion. As with the
building and journey metaphors, there is, then, an isomorphic correspondence
between the sequence of events that led to a successful outcome in the natural
world and in the world of business.
8. American corpus: physical environment metaphors
I decided to combine two sub-domains that are both related to the physical
environment; these are weather metaphors and metaphors for natural
geographical features. Such metaphors may appeal particularly to that significant
minority of the North American population that inhabits rural and semi-rural
areas such as the vast Midwest.
Weather metaphors are a conventional source domain for conveying
abstract notions of change and associated ideas; they have been related in the
cognitive linguistic literature to a conceptual key CIRCUMSTANCES ARE WEATHER
(e.g. Grady et al. 1997: 109). For example, our knowledge that wind brings about
a change in the weather provides a useful metaphorical representation of cause
and effect.
(21) Thus across all the globe there harshly blow the winds of change. (Dwight
Eisenhower)
(22) in the shadows of the Cold War assumes new responsibilities in a world
warmed by the sunshine of freedom but threatened still by ancient hatreds
and new plagues. (Bill Clinton)
It is significant that metaphors associated with changing conditions are much

more common than those associated with stable ones. The more intense the
weather condition, the more intense the change implied. Weather metaphors
evoke either a positive or a negative evaluation. I propose that in the domain of
politics, therefore, a specific conceptual metaphor is A SOCIAL CONDITION IS A
WEATHER CONDITION. This is related to the more generic conceptual metaphor
CIRCUMSTANCES ARE WEATHER.
Geographical metaphors highlight a particular aspect of a physical
geographical feature of the landscape; typically, this is either vertical (e.g. valley,
mountain) or horizontal (e.g. desert, horizon).
(23) Together let us explore the stars, conquer the deserts, eradicate disease, tap
the ocean depths, (Dwight Eisenhower)
(24) Vitality has been preserved. Courage and confidence have been restored.
Mental and moral horizons have been extended. (Theodore Roosevelt)
Physical environment metaphors have the pragmatic effect of evaluating social

conditions as if they were physical ones and are specific realisations of the
generic level conceptual representation STATES ARE LOCATIONS (cf. Lakoff and
Johnson 1999: 180).
9. Metaphor borrowing: religious metaphors
The lexical field of religion comprised 7% of the resonance in the American

corpus and 2% in the British corpus; most of the uses in the latter occurred in the
more recent section of the corpus which implies a degree of borrowing. It would
be interesting to compare this with an earlier corpus of British political speeches,
although it was not possible to locate one on this occasion. It should come as no
surprise that religious metaphors are commonly used in American political
speeches: religion has played an important part in the evolution of the USA and
Christian evangelism has been an important source of inter-racial and inter-ethnic
harmony. Religion serves as a source domain for invoking spiritual aspirations
into the political domain and links the President with a commitment to Christian
religious belief. This suggests further evidence for a conceptual metaphor
POLITICS IS RELIGION. Example (25) contains extracts from Bill Clintons first
inaugural speech:
(25) A spring reborn in the world's oldest democracy, that brings forth the
vision and courage to reinvent America (3)
Though we march to the music of our time, our mission is timeless. (5)
We must bring to our task today the vision and will of those who came
before us. (16)
Our democracy must be not only the envy of the world but the engine of
our own renewal. (19)
The brave Americans serving our nation today in the Persian Gulf, in
Somalia, and wherever else they stand are testament to our resolve. (35)
An idea ennobled by the faith that our nation can summon from its myriad
diversity the deepest measure of unity. (40)
And so, my fellow Americans, at the edge of the 21st century, let us begin
with energy and hope, with faith and discipline, and let us work until our
work is done. The scripture says, "And let us not be weary in well-doing,
for in due season, we shall reap, if we faint not." (41)
From this joyful mountaintop of celebration, we hear a call to service in

the valley. We have heard the trumpets. We have changed the guard. And
now, each in our way, and with God's help, we must answer the call. (42)
Thank you and God bless you all. (End)
Clearly, the references to vision, faith, mission etc. form a cohesive chain
that prepares the way for the strongly religious theme of this coda. This is a
further example of how metaphor can be used systematically to create coherence
in a political text.
I propose that the New Labour Party in Britain has borrowed from
American political discourse to introduce the lexical field of religion into British
political metaphors. It is no secret that Tony Blair had a close social relationship
with Bill Clinton as well as sharing a similar political allegiance to social
democracy. Example (26) shows some typical uses of vision metaphors in the
1997 New Labour manifesto:
(26) But a Government can only ask these efforts from the men and women of
this country if they can confidently see a vision of a fair and just society.
(New Labour)
The vision is one of national renewal, a country with drive, purpose and
energy. A Britain equipped (New Labour)
Our vision for Britain is founded on these values. Guided by them, we will
make our country more (New Labour)
An independent and creative voluntary sector, committed to voluntary
activity as an expression of citizenship, is central to our vision of a
stakeholder society (New Labour)
The vision metaphor is based on the conceptual metaphor SEEING IS

UNDERSTANDING (Lakoff and Johnson 1980: 48); it implies that there is an
altruistic objective that is understood by the party and towards which its policies
are directed. It is one that is analogous to spiritual progress because it claims that
the objective is to make the world a better place to live in. These metaphors
provide evidence that the conceptual metaphor POLITICS IS RELIGION has entered
British political discourse from American political discourse.
10. Conclusion
In this cognitive semantic and corpus-based comparison of metaphors in British

election manifestos and American Inaugural speeches I have found both
similarities and differences. The three most common lexical fields for metaphor
are shared by both varieties: conflict, journeys and buildings. I have argued that
conflict metaphors are the most common in both varieties because of the salience
of conflict in relation to politics and because they emphasise notions of struggle
and personal sacrifice to attain social objectives. I have also suggested that use of
such metaphors may create the potential for passive acceptance of actual military
conflict because it is subliminally associated with objectives that are evaluated as
being socially beneficial as in the current war on terrorism.
I have also identified some lexical fields that only occur in one of the
varieties: plants in the British corpus and fire & light and the physical
environment in the American corpus. I have suggested some culturally and
historically related explanations such as the British passion for gardening leading
to the positive associations of words such as growth and nurture, and the
American experience of struggling for independence leading to a positive
evaluation of fire metaphors.
I have also suggested that the recent introduction of metaphors from the
lexical field for religion into British political discourse by New Labour is
borrowed from American political discourse where they have more established
origins.
Further research is necessary to establish whether a wider range of text
types taken from political discourse confirms or conflicts with these findings. It
would also be interesting to establish whether differences in the use of metaphor
occur between varieties of general English or whether they are restricted to
particular domains of language use, such as politics. It would also be relevant to
study diachronic shifts in the use of metaphor within the domain of politics.
Finally, it would be interesting to find out whether the types of evaluation
that I have suggested motivate the use of metaphor in political discourse achieve
their intended effect by collecting empirical data on reader/hearer response to
metaphor in political contexts.
Notes
1. The British party manifestos are available on the web at

http://www.psr.keele.ac.uk/platform.htm and the American Inaugural
addresses at http://www.bartleby.com/124/.
References
Black, M. (1962), Models and metaphors. Ithaca, N.Y.: Cornell University Press.
Charteris-Black, J. (2000), Metaphor and vocabulary teaching in ESP
economics, English for Specific Purposes 19: 149-165.
Charteris-Black, J. and T. Ennis (2001), A comparative study of metaphor in
English and Spanish financial reporting, English for Specific Purposes
20: 249-266.
Charteris-Black, J., and A. Musolff (2003), Battered hero or innocent victim? A
comparative study of metaphors for euro trading in British and German
financial reporting, English for Specific Purposes. 22:153-176
Charteris-Black, J. (2004) Corpus Approaches to Critical Metaphor Analysis.
Basingstoke: Palgrave-MacMillan
Gibbs, R.W. (1994), The Poetics of the mind: figurative thought, language and
understanding. Cambridge: Cambridge University Press.
Goatly, A. (1997), The Language of metaphors. London & New York: Routledge.
Grady, J.E., T. Oakley and S. Coulson (1997), Blending and metaphor, in: R.W.
Gibbs and G.J. Steen (eds), Metaphor in cognitive linguistics. Amsterdam
& Philadelphia: Benjamins, 101-124.
Halliday, M.A.K. (1985), An introduction to functional grammar. 2nd ed.
London: Edward Arnold.
Jansen, S.C. and D. Sabo (1994), The sport/war metaphor: hegemonic
masculinity, the Persian Gulf war, and the New World order, Sociology of
Sport Journal 11: 1-17.
Lakoff, G. (1991), The Metaphor System used to justify war in the Gulf,
Journal of Urban and Cultural Studies, 2(1): 59-72.
Lakoff, G. and M. Johnson (1980), Metaphors we live by. Chicago: University of
Chicago Press.
Lakoff, G. and M. Johnson (1999), Philosophy in the flesh : the embodied mind
and its challenge to Western thought. New York: Basic Books.
Lakoff, G.and M. Turner (1989), More than cool reason: a field guide to poetic
metaphor. Chicago: University of Chicago Press.
Ortony, A. (1979), Metaphor and thought. Cambridge: Cambridge University
Press.
Richards, I.A. (1936), The philosophy of rhetoric. New York and London: Oxford
University Press.
Signalling spokenness in personal advertisements on the Web:
The case of ESL countries in South East Asia
Peter K. W. Tan, Vincent B. Y. Ooi and Andy K. L. Chiang
National University of Singapore
Abstract
The continuing impact of the World Wide Web (or the Web) on everyday life
focuses our attention on the ways in which the notions of speech community,
culture and language are patterned in this mega corpus of all time. This paper
investigates how people in South East Asia in particular Brunei, the
Philippines, Malaysia and Singapore use English in personal advertisements on
the Web. The study is part of a Web corpus project investigating related questions
in computer-mediated communication (see Herring 1996). The corpus is
currently being built and is derived entirely from the Web.
In ESL (English as a Second Language) nations, or outer circle (Kachru
1992) countries, English is often relegated to the position of a neutral and
transactional (as opposed to interactional) language where affect (emotion)
is played down and less developed in the private and personal (as opposed to
public) domains. We might assume English used for informal purposes to be less
developed. Yet, Web gurus recommend the use of spoken, as opposed to written,
norms when writing for the Web. This paper then focuses on how this tension is
resolved.
Using a combination of a pen-and-paper and corpus-based approach (see
Ooi 2001), we specifically focus on the use of appraisal, attested by Eggins and
Slade (1997) to characterise spoken language. Specifically, we examine a range
of amplification items. We compare the frequencies of the items found in our
personal advertisement sub-corpus and selected written and spoken portions of
the Singapore component of the International Corpus of English (ICE-SIN) and
attempt to account for the patterns discovered.
The results suggest that although South East Asian netspeak is aligned to
spoken language, this alignment is partial.
1. Preamble
This paper arose out of a project at the National University of Singapore whose
main aim was to investigate E-English (Netspeak, English in cyberspace or
computer-mediated communication).1 We have, at the moment, collected the bulk
of the data, which run into 3.6 million words. The question that we asked was
152 Peter Tan, Vincent Ooi and Andy Chiang
whether we could discern a sense of a speech community through examining the
kind of English used (see Herring 1996).
1.1 The corpus
With the help of a research assistant, we targeted data associated with four South
East Asian nations: Singapore, Malaysia (West and East), the Philippines and
Brunei (see Figure 1).
One of the things that distinguish these nations from others in the region,
like Indonesia, Thailand or the Indo-Chinese nations, is that these nations have
undergone the colonial experience under English-speaking colonial powers
Britain and the US. These nations have therefore had a longer history of having
employed the English language and, it might be surmised, a higher likelihood of
having indigenised forms of English.
At the moment, the corpus consists of four sections: (a) news, (b)
electronic discussion groups, (c) personal advertisements and (d) electronic chat.
Figure 1. Map of South East Asia
1.2 Netspeak and the New Englishes
There has been much discussion about Netspeak and a very popular assumption is
that Netspeak is much closer to spoken language than written language. Many
Signalling spokenness in personal advertisements 153
might very easily be led to believe this particularly as Web gurus and style guides
(e.g. Hale and Scanlon 1999) push for more spoken styles. There has been a lot of
sociological, but hardly any linguistic, investigation into the nature of Netspeak.
Crystal (2001) gives a very useful coverage of the issue and his conclusion is that
it is plain that Netspeak has far more properties linking it to writing

than to speech [] Netspeak is better seen as written language
which has been pulled some way in the direction of speech than as
spoken language which has been written down. (Crystal 2001: 47)
At the end of his book, however, he claims that Netspeak is something

completely new [] From now on we must add a further dimension to
comparative enquiry: spoken language vs. written language vs. sign language vs.
computer-mediated language (Crystal 2001: 238).
Baron, discussing the email component of Netspeak in a chapter entitled
Why the Jurys Still Out on Email, concludes that Email is clearly a language
form in flux (Baron 2000: 252) and describes it, like pidgins and creoles, as a
(bilingual) mixed contact system, which therefore accounts for its seemingly
schizophrenic character (part speech, part writing) (p. 258).
What is obvious from our vantage point is that Netspeak is evolving. It
will continue to adapt the linguistic resources already available, but whether the
development will be more in the direction of the spoken or written norms of the
language remains to be seen. We accept Bibers position that it is not always
helpful to see the spoken-written dimension in absolute terms (1988: 25);
however, in a later study comparing spoken and written registers, he was able to
identify a fundamental distinction between written and spoken registers (Biber
2001: 238) based on the way complexity is exploited. The spoken-written
distinction is therefore not just virtual but real (see Collot and Belmore 1996 and
Yates 1996).
In our corpus, though, a further complication arises. For various historical
as well as linguistic reasons, Singapore, Malaysia, Brunei and the Philippines
have traditionally been labelled ESL (English as a Second Language) nations.
Alternatively, employing Kachrus (1992) labels, these nations represent outer
circle countries (as opposed to the inner circle countries of the UK, US,
Canada, Australia, etc. and the expanding circle countries of China, Japan, etc.).
Firstly, in the linguistic ecology of these nations, English is sometimes
relegated to the position of a neutral and transactional (as opposed to
interactional) language (Brown and Yule 1983) where affect (emotion) is
played down. Writing about Kenya and Nairobi (also outer circle countries),
Hudson-Ettle and Schmied also comment on the use of English (or the lack of it)
there:
The man or woman on the street speaks Kiswahili and despite the fact
that English is the medium of secondary and tertiary education and
other public domains, the language preferred for conversations is
Swahili. (Hudson-Ettle and Schmied 1999: 4)
We can also therefore see this in terms of languages associated with the public
and private spheres. This can be regarded as being analogous to the position of
Latin in medieval Europe. Like Latin, English is the language employed largely
only in writing and in situations where one is on ones best behaviour.
Yet there are parts of the Web that deal with situations where affect is
important and which veer more towards the private sphere such as personal
advertisements. We might expect a variety more associated with spoken English
to be employed here.
Secondly, where more informal or colloquial versions of English exist,
these tend to be more divergent from the informal varieties of the inner circle.
This stands to reason: standardisation arose out of the need to minimise variation
(see, for example, Bex and Watts 1999; Milroy and Milroy 1999; Crowley 1989),
and therefore standard, written Englishes tend to be more similar to each other
than informal, spoken Englishes.
1.3 Mode fluidity
Social commentators have already noted that genres and text types are not rigid
and unchanging. For example, Fairclough (1994) and others have commented on
the conversationalisation of public discourse like advertising, so that print
advertisements can take on features associated with spoken conversation.
Conversationalisation is, for him, in part to do with shifting boundaries between
written and spoken discourse practices (Fairclough 1994: 260). This observation
is of course not entirely new. Indeed, years earlier, Leech had already commented
on the use of the public-colloquial style for advertising (Leech 1966: 75). And
we are also aware that language associated with computers is fluid and tolerances
are being tested (Bruthiaux 2001).
The question is whether this could be said of personal advertisements on
the Web as well.
2. The focus
In this paper therefore, we will focus on the personal advertisements sub-corpus.

The genre of personal advertisements has received some previous attention (Ooi
2001; Kadir 2000). These studies, though, do not focus on the main question that
we will try to answer in this paper: to what extent are the resources of spoken
English capitalised on (presumably to convey the notion of affect mentioned
above) in South East Asia?
3. Methodology
3.1 Evaluation and appraisal
Clearly, it is not possible here to try to examine many dimensions of spoken

English in the sub-corpus. For the purpose of this paper, a very small section of
the lexico-semantic system of English has been extracted. The notion of
evaluation (e.g. Hunston and Thompson 2001) or appraisal has been receiving
much attention of late. Eggins and Slade (1997) see appraisal (together with
humour and involvement) as elements that characterise casual conversation.
Figure 2 summarises the appraisal system available (Eggins and Slade 1997:
137). (It should be added that the phenomenon of evaluation in narrative has
already been commented on by Labov and Waletzky (1967) much earlier.)
reaction
Appreciation
composition
(of text/process)
valuation
(un)happiness
Affect
(in)security
(emotion)
Appraisal (dis)satisfaction
Judgement social sanction

(behaviour) social esteem
enrich
Amplification augment
mitigate
Figure 2. The Appraisal System

3.2 Augmenters and mitigators
We select from Eggins and Slades (1997) appraisal system the sub-system of
amplification. Within that, we select a range of what we will call augmenters and
mitigators. Elsewhere in the literature other labels are used. Carter and McCarthy
(1995) talk about intensifiers and hedges in relation to the Cambridge and
Nottingham Corpus of Discourse in English (CANCODE). Biber (1986a, 1988),
and Conrad and Biber (2001) subdivide each of our categories into two.
Unmarked augmenters are called amplifiers (completely, greatly), whereas
informal ones are labelled emphatics (for sure, a lot). Similarly, unmarked
mitigators are termed downtoners (almost, merely), whereas informal ones are
christened hedges (more or less, sort of). Emphatics and hedges seem to occur
together and are dominant in informal conversation (Biber 1986b). Even with all
four categories taken together, the total mean frequencies for written genres are
distinct from those for conversational genres: 7.7 for academic prose and 10.9 for
romantic fiction, as opposed to 21.8 for face-to-face conversation and 21.0 for
telephone conversation (Biber 1988: 255, 260, 264, 265). The use of augmenters
can also be a feature of Opinionated (as opposed to Objective) style (Biber
1986: 18), one dimension distinguishing spoken and written texts.
The augmenters that we have chosen to examine are: very, a lot, really,
too, ever, incredibly and lah. These items were selected partly on the basis of
Eggins and Slades examples, and partly with the aim of combining intuitively
common items like very and less common items like incredibly. The inclusion of
lah needs further explanation. We follow Gupta (1992) in her analysis of lah as a
pragmatic particle with an assertive function.2 (Pragmatic particles in her analysis
can serve one of three functions: contradictory, assertive or tentative.) Lah as a
pragmatic particle is available in the informal varieties of English in Singapore,
Malaysia and Brunei and is potentially employable by the majority of the
advertisers.
The mitigators that we will examine are: only, just, a bit and somewhat.
To ensure that the selected items represented a combination of more
frequent and less frequent items, we subjected the sub-corpus to a CLAWS
(Constituent Likelihood Automatic Word-Tagging System) tagging and examined
the rank of the adverbs. Multi-word items could not be examined (which meant
the exclusion of a lot and a bit); and of course lah, being a regionalism, could not
be included; incredibly was also not included. The result showed items from a
range of rankings, including a number of high ranking ones (see Table 1). We
were satisfied with our selection of high- and low-ranking augmenters and
mitigators.
Table 1. Ranking of adverbs using CLAWS tagging
Item Rank
just 1
very 3
really 9
only 16
too 21
ever 26
somewhat 151
3.3 Sub-corpora from ICE-SIN as points of reference
As a point of reference we made use of sections of the Singapore Component of

the International Corpus of English (ICE-SIN) to contrast the spoken and written
tendencies. ICE-SIN was chosen because there was no readily available corpus
that included Malaysian, Bruneian and Filipino English. We felt that corpora that
contained mainly British or American English (which are more readily available)
would not be appropriate for comparison with our sub-corpus. Given that, for
historical reasons, Singaporean English shares many features with Malaysian
English and Bruneian English, we felt that it would not be inappropriate to use
ICE-SIN as a reference point for South East Asian English.
For our purpose, we extracted the private dialogues portion of ICE-SIN
(a section totalling 217,121 words) to represent spoken language, and the
informational printed portion (a section totalling 277,154 words) to represent
written language (see Appendix 1, where these varieties have been highlighted in
bold face). The reason for selecting these particular sub-corpora had partly to do
with some compatibility in terms of size (each contains 100 texts), but more
importantly we wanted to focus on stereotypical spoken and written texts because
this contrast is what is in peoples mind when they are encouraged to write for the
Web as if they were speaking.
4. The personal advertisements
The personal advertisements were taken from three websites:
(a) Lavalife (http://www.all-dating-online.com/lavalife.html),

(b) One and Only (http://www.oneandonly.com/) and
(c) Excite Personals (http://exsiteads.freeservers.com/)
Country-specific advertisements were chosen, and we also chose advertisements

from all four sex and orientation options:
(a) MSW (men seeking women),
(b) WSM (women seeking men),
(c) MSM (men seeking men) and
(d) WSW (women seeking women).
The composition and structure of the personal ad sub-corpus is summarised in

Table 2. Table 3 gives the sizes of the files in the sub-corpus laid out in the same
way as in Table 2. The highest proportion of advertisements was from the One &
Only site.
The size of the sub-corpus is 110,247 words, with Malaysian, Philippine
and Singaporean adverts contributing roughly equal proportions of text and the
Bruneian adverts contributing a substantially smaller proportion. This is
appropriate because Brunei has a much smaller Web presence than the other
countries not surprising, if we consider Bruneis population of 350,000 in
contrast to Singapores 4 million, Malaysias 22 million and the Philippines 81
million. Singapores relatively smaller population is compensated for by its
higher computer and Internet penetration.
Table 2. Composition of sub-corpus

Data source Lavalife One & Only Excite Classifieds
Region
MSW
Brunei WSM
MSM
MSW
WSM
East Malaysia
MSM
MSW MSW
WSW
WSM WSM
MSW
MSM MSM
WSM
West Malaysia WSW
MSM
WSW
MSW MSW MSW
WSM WSM WSM
Philippines
MSM MSM MSM
WSW WSW WSW
MSW MSW MSW
WSM WSM WSM
Singapore
MSM MSM MSM
WSW WSW WSW
The sub-corpus can also be divided according to the sex and orientation of the
advertisers (Table 4). These proportions also represent the kinds of adverts
available on the Web: heterosexual women and homosexual men seem to be more
highly represented in these personal advertisements. Overall, as well, seen against
the sexual orientation of the population as a whole (popular estimates do not give
a figure of more than 10% homosexuals), we can see that homosexual adverts
(representing over 43% of adverts) have a strong presence.
Table 3. Size of the sub-corpus

Data source Lavalife One & Only Excite Country Country
Region Classifieds totals percentages
Brunei 3,956
1,778 6,042 5.5%
308
East 352
Malaysia 438
549 1,220 1,764
503 438 2,591 35,181* 31.9%*
West 588 3,813 4,362
Malaysia 5281 493
8,609
4,180
Philippines 436 3,160 1,058
2,198 5,185 9,574
570 5,898 3,388 34,488 31.3%
86 2,717 218
Singapore 1,400 3,769 1,971
2,856 7,246 2,629
2,323 4,797 3,546 34,525 31.3%
109 3,745 134
Website 11,618 66,890 31,728 110,236
totals
Website % 10.5% 60.7% 28.8% 100.0%
*East and West Malaysia together
Table 4. Sex and Orientation Figures
Sex/Orientation Words Percentages
MSW 22,228 20%
WSM 40,279 37%
MSM 35,609 32%
WSW 12,120 11%
Total 110,236 100%
5. Results and discussion
Tables 5 and 7 (and the accompanying Figures 3 and 4) show the number of
occurrences and the normalised frequencies of the augmenters and mitigators
selected. We use the abbreviations PA, SP and WR for the personal
advertisements sub-corpus, the selected spoken sub-corpus from ICE-SIN and the
selected written sub-corpus from ICE-SIN respectively. Normalised figures
represent the number of tokens per 10,000 words.
Table 5. Augmenters
PA SP WR
tokens normalised tokens normalised tokens normalised
incredibly 1 0.1 0 0.0 2 0.1
lah 2 0.2 1677 77.2 0 0.0
ever 37 3.4 34 1.6 22 0.8
a lot 56 5.1 331 15.2 18 0.6
really 167 15.1 495 22.8 32 1.2
too 176 16.0 334 15.4 113 4.1
very 327 29.7 1087 50.1 270 9.7
Total 766 69.6 3958 182.3 457 16.5
100
80
PA
60
SP
40
WR
20
0
incredibly lah ever a lot really too very
Figure 3. Normalised frequencies of augmenters
All the items were also checked for statistical significance using chi-square (2,
p>0.05); each item in each sub-corpus was checked against the same item for
each of the other sub-corpora. The results for the augmenters are found in Table
6.
Table 6. Significant differences in the distribution of augmenters

incredibly lah ever a lot really too very
PA SP too small* yes yes yes yes no yes
SP WR too small* too small* yes yes yes yes yes
PA WR too small* yes yes yes yes yes yes
*too small indicates that the frequencies were too low for the computation of
statistical significance
We should probably disregard items where all three sub-corpora register very low
normalised frequencies (say, below 5 per 10,000 words): this would push out
incredibly and ever. All the other augmenters have a significantly higher
frequency in SP than in WR. The most dramatic case is lah, where SP had a
normalised frequency of over 77 and WR had a frequency of 0. Thus far, then, we
can say that the frequencies of the augmenters provide a fairly reliable index to
the spokenness or writtenness of a text and therefore confirm the tendencies seen
in Biber (1988).3
The difference in the PA frequencies from the SP and WR frequencies are
also statistically significant with an exception in the case of too, where the
differences in the PA and SP scores are not statistically significant. In most cases,
PA frequencies fall between those of SP and WR. In fact, in the case of too (and
of ever, which we discarded from consideration earlier), the PA frequency
exceeded that of SP. By and large the figures confirm the expectation: that PA
tends towards SP norms but not quite reaching them, in most cases. However,
the situation is not always that clear-cut. We could arguably say that the PA
frequency for a lot tends towards WR norms. However, the outstanding item is
lah, which only occurred twice, once in the Malaysian portion of the sub-corpus
and once in the Bruneian portion. There is certainly strong resistance to the use of
lah in personal advertisements in the region.
Table 7. Mitigators
PA SP WR
tokens normalised tokens normalised tokens normalised
somewhat 5 0.5 6 0.3 16 0.6
a bit 45 4.1 132 6.1 10 0.4
only 151 13.7 369 17.0 478 17.2
just 398 36.1 1,094 50.4 167 6.0
Total 599 54.3 1,601 73.7 671 24.2
60
50
40
PA
30 SP
20 WR
10
0
somewhat a bit only just
Figure 4. Normalised frequencies of mitigators
Table 8. Significant differences in the distibution of mitigators

somewhat a bit only just
PA SP no yes yes yes
SP WR no yes no yes
PA WR no yes yes yes
Tables 7 and 8, together with Figure 4, show the figures for mitigators. What is
interesting is that mitigators do not seem to display as robust a difference between
the sub-corpora as the augmenters. Figure 4 shows three curves more or less
keeping step with each other until we reach just. The distributions of somewhat
and only in the sub-corpora are not always significantly different. Of those that
are a bit and just the pattern seen in the augmenters is replicated. The PA
frequencies lie between those of SP and WR, and closer to SP rather than WR.
This might suggest that some mitigators are more important than others in
showing up the differences between the three modes. Interestingly, the
importance of a mitigator is not dependent on whether it is a high- or low-
frequency item: only, which is more frequent than a bit and less frequent than
just, is not important as a distinguisher between the three modes.
How then can we account for the less dramatic difference in the case of the
mitigators? We could perhaps hypothesise that it might be more important to
mitigate (as opposed to augment) in written texts. This also makes sense in the
light of the distinction between the Opinionated and Objective style (Biber
1986a): augmenters contribute to the Opinionated style but mitigators do not. The
difference here is also understandable if we consider that in a stereotypical
written genre, academic writing, it is important to not over-generalise and delimit
ones conclusions.
6. Conclusion
So, to what extent are the resources of spoken discourse relied on in PA? On the
basis of the augmenters and mitigators selected, we could say that personal
advertisers tend to make use of features of spokenness. The analysis shows,
corroborating Crystal, that the language of PA represents written language which
has been pulled some way in the direction of speech (Crystal 2001: 47). Given
the focus on the interactional function of language, it is not surprising that
advertisers try to take on board these features. This is the case despite the fact that
the sub-corpus is from outer circle countries where English tends to be used for
more transactional functions.
However, the situation is not entirely cut and dried. There is very strong
resistance to the employment of the pragmatic particle lah in personal
advertisements. It is not entirely clear to us why this should be the case, although
it is not impossible that the notion of a borderless cyberspace might discourage
advertisers from employing items like lah that point towards the local or suggest
an insular or parochial outlook. A non-local spoken model might also be
preferred if advertisers are open to responses to non-local sojourners in the
region. It would therefore be premature to say at this stage that Netspeak in South
East Asia is closely associated with the norms of spoken language although it
seems to be an important contributor to the norms associated with personal
advertisements.
We obviously need to examine other parts of the corpus, e.g. the chat data,
where localisation does not seem to be such a taboo.
Notes
1. We are grateful for the support of the National University of Singapore,

research project ref. no. R-103-000-019-112, for this paper. We are also
grateful to the Department of English Language and Literature, National
University of Singapore, for the use of the ICE-SIN corpus.
2. Besemeres and Wierzbicka (2003) propose an alternative analysis of lah
based on the claim of common ground (like you know). This analysis does not
contradict our understanding of lah as playing an emphasising function.
3. For a rougher guide, we might also add that the word frequency list (see
Appendix 2) which gives I as the most frequent item seems to suggest
spokenness as well (personal pronouns typically rank very high in spoken
texts).
References
Baron, N. (2000), Alphabet to email: How written English evolved and where its
heading. London: Routledge.
Besemeres, M. and A. Wierzbicka (forthcoming), Pragmatics and cognition: the
meaning of the particle lah in Singapore English, Journal of
Pragmatics and cognition 11(1): 1-36.
Bex, T. and R. J. Watts (eds) (1999), Standard English: the widening debate.
London: Routledge.
Biber, D. (1986a), On the investigation of spoken/written differences, Studia
Linguistica 40(1): 121.
Biber, D. (1986b), Spoken and written textual dimensions in English: resolving
the contradictory findings, Language 62(2): 384414.
Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge
University Press.
Biber, D. (2001), On the complexity of discourse complexity: a multi-
dimensional analysis, in: S. Conrad and D. Biber (eds), Variation in
English: multi-dimensional studies. London: Longman. 215240.
Brown, G. and G. Yule (1983), Discourse analysis. Cambridge: Cambridge
University Press.
Bruthiaux, P. (2001), Missing in action: verbal metaphor for information
technology, English Today 67 (Vol. 17 No. 3): 2430.
Carter, R.A. and M.J. McCarthy (1995), Grammar and the spoken language,
Applied Linguistics 16(2): 141158.
Collot, M. and N. Belmore (1996), Electronic language: a new variety of
English, in: S. Herring (ed.), 1328.
Conrad, S. and D. Biber (2001), Multi-dimensional methodology and the
dimensions of register variation in English, in: S. Conrad and D. Biber
(eds), Variation in English: multi-dimensional studies. London: Longman.
1342.
Crowley, T. (1989), Standard English and the politics of language. London:
Palgrave Macmillan.
Crystal, D. (2001), Language and the Internet. Cambridge: Cambridge University
Press.
Eggins, S. and D. Slade (1997), Analysing casual conversation. London: Cassell.
Fairclough, N. (1994), Conversationalisation of public discourse and the
authority of the consumer, in: R. Keat, N. Whiteley and N. Abercrombie
(eds), The authority of the consumer. London: Routledge. 253268.
Gupta, A.F. (1992), The pragmatic particles of Singapore colloquial English,
Journal of Pragmatics 18: 3157.
Hale, C. and J. Scanlon (1999), Wired style: principles of English usage in the
digital age. New York: Broadway Books.
Herring, S. (ed.) (1996), Computer-mediated communication: linguistic, social
and cross-cultural perspectives. Amsterdam: John Benjamins.
Hudson-Ettle, D. and J. Schmied (1999), Manual to accompany the East African
component of the International Corpus of English: Background
information, coding conventions and list of source texts. Chemnitz:
Department of English, Chemnitz University of Technology.
Hunston, S. and G. Thompson (2001), Evaluation in text: authorial stance and
the construction of discourse. Oxford: Oxford University Press.
Kachru, B.B. (ed.) (1992), The other tongue: English across cultures, 2nd edn.
Urbana: University of Illinois Press.
Kadir, M.A. (2000), Love @ cyberspace: A corpus-based study of personal ads
on the Web. Unpublished MA dissertation, National University of
Singapore.
Labov, W. and J. Waletzky (1967), Narrative analysis: Oral versions of personal
experience, in: J. Helm (ed.), Essays on the verbal and visual arts.
Seattle: University of Washington Press. 1244.
Leech, G.N. (1996), English in advertising: a linguistic study of advertising in
Great Britain. London: Longman.
Milroy, J. and L. Milroy (1999), Authority in language: investigating standard
English, 3rd edn. London: Routeldge.
Ooi, V.B.Y. (2001), Investigating and teaching genres on the World Wide Web,
in: M. Ghadessy, A. Henry and R. L. Roseberry (eds), Small corpus
studies and ELT: theory and practice (Studies in Corpus Linguistics, Vol.
5). Amsterdam: John Benjamins. 175204.
Yates, S.J. (1996), Oral and written linguistic aspects of computer conferencing,
in: S. Herring (ed.), 2946.
Appendix 1. The structure of an ICE corpus
Spoken Dialogues Private (100)

face-to-face conversations (90)
Texts (180)
telephone conversations (10)
(300)
Public (80) classroom lessons (20)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
Mono- Unscripted (70) spontaneous commentaries (20)
logues unscripted speeches: lectures
(100) (30)
demonstrations (10)
legal presentations (10)
Scripted (30) broadcast talks (20)
non-broadcast speeches (10)
Mixed (20) broadcast news (20)
Written Non- Non-professional untimed student essays (10)
Texts Printed writing (20) student examination scripts
(200) (50) (10)
Correspondence (30) social letters (15)
business letters (15)
Printed Inform- Academic humanities (10)
(150) ational writing social sciences (10)
writing (40) natural sciences (10)
(100) technology (10)
Popular humanities (10)
writing social sciences (10)
(40) natural sciences (10)
technology (10)
Reportage press news reports (20)
(20)
Instructional writing administrative/regulatory (10)
(20) skills/hobbies (10)
Persuasive writing (10) press editorials (10)
Creative writing (20) novels/stories (20)
Appendix 2. Word frequency list
The top 10 words in the personal advertisements
Item Frequency
I 3,192
and 2,272
to 1,937
a 1,706
the 1,142
am 820
you 811
in 796
me 770
my 769
Textual colligation: a special kind of lexical priming
Michael Hoey
University of Liverpool
Abstract
Corpus linguistics has not attended much to text-linguistic issues. This paper
argues that lexical choice has a major effect on features such as cohesion, Theme
choice and paragraph division and that corpus investigation can shed light on the
nature of the lexical choices made. It is argued that some lexis has a bias towards
(or against) certain textual functions and that this is an inherent property of such
lexis. It is also argued that lexical choices interlock, creating what I term
colligational prosody.
1. Introduction
Corpus linguistics has for perfectly understandable reasons focused most of its
attention upon lexical and grammatical matters. Although there are a few text-
linguistic and spoken discourse studies that make use of corpus linguistic
techniques (e.g. Hoey 1997; Partington 2003; Partington and Morley 2002), they
are few and far between and are rooted in no explicitly articulated theory. This
paper attempts to redress this lack by articulating a theoretical relationship
between lexis and text-linguistics.
The paper is divided into three uneven parts. In the first, I focus on current
perceptions about the organisation and nature of written discourse, identifying the
features that might be open to investigation in corpora. In the second, I contrast
two theoretically opposite positions, one of which I take to be essentially
incompatible with corpus investigation and the other of which is amenable to
such investigation. Unsurprisingly I shall favour the latter! In the third and much
the longest part, I want to present sample results from a corpus-linguistic
investigation of textual questions. These are hesitantly presented and are intended
only as hints as to how corpus linguistics might proceed. Despite the hesitancy
with which my findings are presented and the weak evidential base on which they
rest, they point to a theoretical suggestion that, if accepted, would place lexis and
text linguistics on a very different footing vis--vis each other.
172 Michael Hoey
2. The nature of text
The following propositions are accepted by a number of text linguists. None

probably have the support of all, but at least none are idiosyncratically my own
alone.
Text is interactively produced and processed (see amongst many others

Bakhtin 1973 [1929]; Goodman 1967, 1973; Winter 1971, 1977; Smith
1978; Widdowson 1979; Hoey 1979, 1983, 2001; Goffman 1981,
Nystrand 1986, 1989).1 In other words it presupposes a writer-reader
interaction in which text is the site or the residue/outcome of the
interaction, depending on whether one takes the readers or writers
perspective.
Text is linearly developed. By this I mean that each sentence builds upon
what has gone before, from the speaker or writers point of view; from the
listener or readers point of view, each sentence that is reached prospects
the sentence or sentences to follow. There are two aspects to this feature of
text. In the first place, the speaker/writer is seeking to meet the
listener/readers expectations and the listener/reader has expectations on
the basis of what the speaker/writer has already said. This point, though
widely agreed, is differently articulated, depending on whose position one
considers. Amongst linguists in whose work one finds some aspect of this
position are Sinclair (1993), Halliday and Hasan (1985), Winter (1971,
1979, 1982), Crombie (1985), Graustein and Thiele (1979, 1987),
Beekman (1970), Beekman and Callow (1974), Bolivar (2001) and Tadros
(1985, 1993). In the second place, this covers the much-described
phenomenon of Theme-Rheme, both as described in the Prague School
(Firbas 1966, 1986; Dane 1974) and within the Systemic-Functional
tradition (most notably, Halliday 1994).
Text is cohesive. Whether this is a by-product of the need to be coherent

(as Morgan and Sellner 1980 have argued) or a prerequisite of coherence
(as was originally argued in Halliday and Hasan 1976) seems irrelevant.
Almost certainly, the relationship works both ways. On occasion, writers
(and more rarely speakers) consciously produce cohesive devices in order
to clarify or emphasise, i.e. to create coherence; one has only to look at the
way that Dickens, for example, exploits repetition for rhetorical effect to
see that cohesion can be a conscious tool. On other occasions, a writers or
speakers coherence is reflected automatically in the language they use,
i.e. in cohesion. Either way, that it is a feature of text cannot be denied and
one, furthermore, that continues to be the subject of study.
Text is chunked. The nature of the chunking is to some extent disputed.

Some posit a strict hierarchical organisation to text (e.g. Graustein and
Textual colligation 173
Thiele 1979, 1987; Mann and Thompson 1986, 1988; van Dijk and
Kintsch 1978). Some posit a patterning or structuring of some kind or
other without assuming that the chunking thereby created accounts for all
texts or all of any particular text (e.g. Labov 1972; Labov and Waletsky
1967; Longacre 1968, 1979, 1983; Hoey 1979, 1983, 2001; Swales 1981,
1990; Halliday and Hasan 1985; Martin 1992). There is a great deal of
commonality amongst these positions but not much actual agreement.
Still, some kind of chunking is acknowledged to exist and is reflected in
our cultural habit of writing books with chapters, sections (in the case of
academic texts) and paragraphs.
Text is shaped in the service of particular communities of users (e.g.

Halliday and Hasan 1986; Ventola 1987; Swales 1990; Martin 1992)
and/or to the advantage of those with vested power in the communities
(e.g. Fairclough 1989, Chouliaraki and Fairclough 2001; Hodge and Kress
1993). The work of the genre analysts in particular warns us that we must
be wary in text-linguistics of over-generalising; some claims can only be
made of a restricted body of data.
To these points of broad agreement in the text-linguistic community, I would

want to add that text is web-like (e.g. Hoey 1991), but since that is not a widely
held view I will not be developing it further here.
3. Two theoretical positions
There seem to be two possible ways of modelling the relationship between lexis
and the features of text I have just outlined. The first is that the relationships
found in text whether they are interactive, linear, cohesive, hierarchical or
structural are independent of the lexis of the language. According to this view,
each sentence is constructed according to the grammar, collocations and
colligations of the language in response to textual needs but without constraints
broader than those particular needs. In other words, each text imposes its own
demands and has its own unique sentence requirements. If this view is correct,
corpus linguistics cannot offer anything useful for text-linguistics except in so far
as it might offer tools for exploring individual spoken or written texts.
The other way of modelling the relationship between lexis and text is to
see textual relationships (interactive, linear, cohesive, hierarchical and structural)
as dependent upon and created by the lexis of the language in a manner not
exhausted by the demands of the individual text. According to this view, each
sentence of every text is constructed along lines that have been laid down by all
the texts that the speaker/writer has encountered in the course of his or her life,
such that the production of a text is in fact in part a reproduction of previous
texts along strictly controlled lines. If this view is correct, corpus linguistics is the
key to the future of text-linguistics.
174 Michael Hoey
The first view is presumably the default position. The second is however, I
hope you will agree, a much more interesting position. Youll reply that reality
hasnt the least obligation to be interesting. And Ill answer you that reality may
avoid that obligation but that hypotheses may not.2
What I want to claim in this paper is that every lexical item is primed for
use in textual organisation. The notion of priming is taken from psychology and
in this context means that our encounters with a word accustom us to expect it to
be used in certain kinds of ways to such an extent that these potential uses
become part of our knowledge of the word and to some extent constrain the way
we are likely to use the word ourselves.
More specifically I want to make the following claims:
1. Every lexical item (or combination of lexical items) may have a positive
or negative preference for participating in cohesive chains.
or negative preference for occurring as part of Theme in a Theme-Rheme
relation.
or negative preference for occurring as part of a specific type of semantic
relation, e.g. contrast, time sequence, exemplification.
or negative preference for occurring at the beginning or end of an
independently recognised chunk of text, e.g. the paragraph.
5. If a lexical item (or combination of lexical items) has any of the above
preferences, it may only or especially be operative in texts of a particular
type or genre or designed for a particular community of users, e.g.
academic papers.
The positive and negative preferences of a lexical item with regard to the textual
features just described are what I would term its textual colligations. The claims
of course allow for the possibility of a lexical item having not only a positive or
negative preference but also a neutral preference for each of these features. If,
however, the great majority of lexical items in a language were to prove to be
neutral with regard to one of the features, then the specific claim would be
disconfirmed with regard to the feature in question, and if that were to prove true
of all the features, then the more general claim would fall also.
If on the other hand the claims were to prove correct, we could envisage a
description of the language from a text-linguistic point of view including a giant
matrix of all the words of the language looking something like that in Table 1. I
will flesh out this matrix with some real examples near the end of this paper. For
the moment, though, we need to examine the evidence for the claims made above.
Table 1: A fragment of a hypothetical text-colligational matrix

Lexical Lexical Lexical Lexical item
item 1 item 2 item 3 4
Preference for negative positive neutral positive
participating in
cohesive chains
Preference for positive positive positive neutral
occurring as part of
Theme
Preference for neutral negative with positive with neutral
occurring as part of a examplifica- affirmation/
specified semantic tion denial
relation
Preference for negative neutral positive for Positive for
occurring at the paragraph paragraph
beginning or end of a initial in initial when
recognisable chunk certain phrases Theme
Constraints on Popular Biography None observed News
operation of fiction
preferences
3.1 Claim 1: Every lexical item (or combination of lexical items) may have
a positive or negative preference for participating in cohesive chains
The first claim was that every lexical item may have a positive or negative
preference for participating in cohesive chains, where a cohesive chain is a set of
at least three lexical (and grammatical) items that either co-refer to a single entity
(identity chains) or cross-connect because of their similarity of meaning
(similarity chains) (Hasan 1984; Hasan in Halliday and Hasan 1985; Parsons
1995). I shall give least attention to this claim, partly for reasons of space and
partly because I have presented detailed evidence elsewhere (Hoey, forthcoming).
Here I will simply note that all the following lexical items occur in my corpus as
members of cohesive chains occurring in a number of different texts:3
army, baby, Blair, gay, lake, music, pit, planet, political, spleen
Preliminary investigations suggest that all the following lexical items show no
tendency in my corpus to occur in cohesive chains, despite their all being fairly
frequent words in my corpus. It is of course much harder to establish that
something does not occur and it is a painfully slow process to move from each
concordance line into the original text to check for possible cohesion, so the
following list must be regarded as provisional:
176 Michael Hoey
afterwards, ago, best, comparison, crossroads, particularly, problem,

surprising
Orthography is significant in this matter. The lexical item crossroads did not
form cohesive chains; the lexical item Crossroads, on the other hand, which
refers to a defunct British soap, chains freely.
The following words are neutral in my corpus with regard to cohesion;
they do not occur in chains often and when they do, the chains tend to be short,
e.g. The first reason, The second reason:
option, reason, sixty
These lists may not seem particularly surprising and it is tempting to account for
their inclusion in other ways. My point here is that a lack of cohesive potential is
as much a quality of the word surprising as the fact that it is evaluative (as in the
previous sentence) and therefore unlikely to be a topic. Notice, too, that lexical
items that do not have a preference for appearing in cohesive chains are every bit
as common in the English language as those that do appear in cohesive chains
(and in many cases more so), despite the fact that one might have predicted that
infrequency would make it less likely that a word would participate in chains.
It is possible to express claims about the cohesive potential more subtly
than I have done here. The chains of particular words may favour repetitions, co-
hyponyms or pro-forms, for example; see Hoey (forthcoming b) for examples and
details. A crude check on the chaining potential of repetition-favouring words can
be obtained by examining the plot of distribution of a word as calculated for a
word by WordSmith (Scott 1999).
a positive or negative preference for occurring as part of Theme in a
Theme-Rheme relation
The second claim, that every lexical item may have a positive or negative
preference for occurring as part of Theme in a Theme-Rheme relation, like all the
claims, requires more detailed support than I am able to give it here. The
following statistics suggest however that the claim is not meritless. 250 instances
of years were examined, and it was found that 37% occurred as part of Theme.
This is slightly higher than would be expected on the basis of random distribution
across Theme and Rheme, though it is hardly a striking result; what is more
interesting is that the great majority of the instances of Thematic years occur as
part of a fronted Adjunct rather than as part of Subject.4 In other words, when
years is thematised, it is usually marked.
Another example is the distribution across Theme and Rheme of instances
of consequence. 1615 of these were analysed (excluding instances of the rarer
importance sense), and it was found that the word consequence occurs in Theme
43% of the time, a considerably higher percentage than would occur on a random
distribution. Again the Adjunct use seems significant. Almost half of the
occurrences of thematised consequence occurred as part of an Adjunct. The word
sixty occurs in Theme 75% of the time, on the basis of a sample of 294 instances.
Again, orthography is relevant: 60 shows no such tendency.
The claim I have just made can be made more subtly and more complexly.
Some words, I would argue, have a tendency to appear in marked Theme (e.g.
years and consequence, as we have just seen); others have a propensity for
appearing as unmarked Theme, i.e. as Subject. Furthermore, it is possible to
combine this feature with the previous one. All the cases I have given of
Thematic preference have either negative or neutral cohesive preference. This
means that sixty, consequence and years are not going to participate in Thematic
progression (Dane1974). There may prove to be a correlation between Marked
Theme and a negative preference for cohesive chains; this would need
investigating.
If a lexical item has a positive preference for both Theme and cohesive
chains, it will inevitably have a positive preference for Thematic Progression;
again, there may be a correlation between a preference for unmarked Theme and
a preference for appearing in cohesive chains, though my claim is not of course
dependent upon such a correlation. It follows that as before one might be subtler
and expect some lexical items to be primed for participation in Simple Thematic
Progression or Linear Thematic Progression, etc.
a positive or negative preference for occurring as part of a specific
type of semantic relation
The third claim was that every lexical item (or combination of lexical items) may
have a positive or negative preference for occurring as part of a specific type of
semantic relation, e.g. contrast, time sequence, exemplification. Such relations
may be the relations between clauses or parts of clauses or between larger chunks
of text; they may also reflect relations between speaker and listener, for example
indicating the relation between a speaker or writers utterance and a listener or
readers utterance. I give here just two examples of what this claim is intended to
cover: ago and reason.
An example of a lexical item associated with a semantic relation is ago.
More specifically ago has an association with contrast when it is part of Theme.
Of 65 Thematised instances of ago examined, 23 (35%) were followed by a
contrast and 5 (8%) preceded by a contrast. If 10 instances of not long ago and as
long ago as are removed from the calculation, the percentage associated with
contrast rises to 51%. These are small figures and not too much can be claimed of
them. But informal examination of larger quantities of data suggests that they are
not misleading. When ago is not part of Theme, there is still an association with
contrast but the manifestations are somewhat different (see Hoey forthcoming a).
The word reason may seem a rather obvious choice of word to illustrate
the association of a lexical item with a particular semantic relation, and of course
178 Michael Hoey
reason is intimately associated with the reason-result relation and other similar
relations, but it is not this association that I wish to draw attention to here.
Consider Table 2, which shows the distribution of the different structures of
postmodification of reason. Five postmodifying options are considered: reason
postmodified by a -clause (as in the reason he can continue to do that), reason
postmodified by a that-clause (as in the reason that this lobbying has had little
effect), reason postmodified by a prepositional phrase headed by for (as in part of
the reason for this), reason postmodified by a why-clause (as in another reason
why pop shows are getting better), and reason postmodified by to + V as in any
reason to celebrate).
Table 2: Distribution of postmodified reason structures and their association with

affirmation/denial
Subject Subject Complement Complement Object Object
reason reason reason reason reason reason
affirmed denied affirmed denied affirmed denied
reason +
clause 698 17 (38) 210 42 14 4
reason +
that clause 77 - 40 9 - 3
reason +
for X 1091 36 (49) 610 392 305 161
reason +
why clause 7 10 (17) 594 629 61 223
reason +
to V 22 3 286 536 732 426
The table shows how these options distribute across three of the functions
available in the clause Subject, Complement and Object and indicates whether
they are associated with an affirmation or a denial. Affirmation occurs when a
reason is asserted. Denial occurs when a reason is declared to be of no
importance, invalid or not known. Examples of reasons affirmed are:
The councils neglect was the reason the flats were falling apart.
The reason that this lobbying has had little impact is that the industry has
failed to construct a convincing case.
The negatives in the latter sentence are of course not denials of the reason but
denials about lobbying and the making of convincing cases.
Examples of reasons denied are:
I see no reason to change it.

There is no good reason why the publishers cant provide a normal

discount to booksellers.
The ratio of positive to negative clauses in general English is 9:1 (Halliday and
James 1993). Where therefore the ratio of affirmation/denial for a particular
syntactic choice is significantly skewed and where the frequency is high as a
proportion of the total number of cases considered, it has been highlighted in the
table. Where it is unclear whether it is the reason or some other aspect of the
clause that is being denied or any other problem of allocation to the categories
arises, the higher figure that results from inclusion of such cases is included in
brackets; it will be noticed that all such cases occur in the Subject options.
Of 7238 instances examined (excluding the doubtful cases), 4747 are
affirming the reason, 2491 denying it, a ratio of close to 2:1. This points strongly
to reason being associated with denial; that means that when reason is used, it
has a good chance of being part of a pre-emptive move by the writer/speaker to
say that s/he does not want to (or cannot) answer the reader/listeners expected
question Why? or that any counter-arguments that might be offered to his/her
position, whether by the reader/listener or by a third party, cannot be supported
with evidence. Either way, the word provides evidence of association with
affirmation-denial (Winter 1979, Williames 1985), a pivotal feature of
writer/reader and speaker/listener relationships.
Looked at more closely, the table allows us to state such an association
more precisely. In the first place, notice that the Subject function is strongly
associated with affirmation (1895:66, a whopping 29:1 ratio of affirmation to
denial), whereas Complement is associated with denial (1740:1608, close to a 50-
50 ratio). (Object has a less marked association with denial.) So if you want to
affirm your reason, put it in the Subject. If you plan to reject it or say that it is
irrelevant or unknown, use the Complement (or Object). This is a useful example
of a complex textual colligation where it is the operation of reason in a particular
grammatical function that has a particular textual implication.
Looked at another way, the different postmodifying structures with which
reason appears also distribute themselves differently between affirmation and
denial. So reason + -clause is associated with affirmation (in a ratio to denial of
15/1); the only structure to come near this weighting towards affirmation is the
relatively infrequent reason + that-clause, with a ratio of 10/1 in favour of
affirmation. On the other hand, reason + why-clause is associated with denial,
there being an absolute majority of cases of denial in all three grammatical
functions. This is another instance of a complex textual colligation, where the
colligation of reason with one or other kind of clause as postmodifier is the
condition that has to be met for a textual colligation to be observable.
Two instances on their own prove nothing, but it is hoped that the two
examples I have given at least elucidate what is meant by the third claim and
suggest it may be worth further investigation.
180 Michael Hoey
a positive or negative preference for occurring at the beginning or end
of an independently recognised chunk of text.
The fourth claim was that lexical items may have a positive or negative
preference for occurring at the beginning or end of an independently recognised
chunk of text, e.g. the paragraph. The problem with verifying this claim is of
course that it is not easy to find independently recognised chunks of text that
have validity.
In writing, of course, there is rough chunking associated with
paragraphing, but it is not difficult to demonstrate that paragraphs have no
internal structure (except in so far that several generations of Freshman English
students in the United States have been taught to write paragraphs with a
particular arbitrarily imposed structure and this structure is becoming a self-
fulfilling prophecy). I have argued elsewhere (Hoey 1985) that paragraphs are a
device used by writers to signal to readers how parts of the text relate to each
other. They are not therefore an ideal starting-point for demonstrating the validity
of my sixth claim. Nevertheless, in the absence of other chunking devices, it is
possible to use paragraph boundaries, section divisions and of course the
beginnings of texts to test the claim.
The hypothesis here is that certain Thematised words or phrases have a
preference for occurring at the beginning or end of a paragraph, or a preference
for avoiding such positions. So a particular sentence-initial word might have a
preference for being paragraph-initial. There is no assumption here that the
sentence-initial word has to have a colligational preference for being sentence-
initial in order to be eligible for consideration with regard to the paragraph-initial
claim. It is perfectly feasible that a particular word or phrase might have no
special preference for being sentence-initial or indeed even have preference for
being in a non-sentence-initial position and still have, when in sentence-initial
position, a preference for being paragraph initial.
In Hoey (1997) I report an experiment carried out to discover whether
there is any relationship between paragraphing and the predilection of certain
lexical items for appearing in paragraph-initial position. The experiment took the
form of asking 67 students to paragraph a short passage from a history textbook
that had been previously deparagraphed.5 The passage in question was the
following:
1 Grant was, judged by modern standards, the greatest general

2 of the Civil War. He was head and shoulders above any general on either
3 side as an over-all strategist, as a master of what in later wars
4 would be called global strategy. His Operation Crusher plan, the
5 product of a mind which had received little formal instruction in the
6 higher area of war, would have done credit to the most finished
7 student of a series of modern staff and command schools. He was a
8 brilliant theatre strategist, as evidenced by the Vicksburg campaign,
9 which was a classic field and siege operation. He was a better
10 than average tactician, although, like even the best generals of

11 both sides, he did not appreciate the destruction that the increasing
12 firepower of modern armies could visit on troops advancing across
13 open spaces. Lee is usually ranked as the greatest
14 Civil War general, but this evaluation has been made without
15 placing Lee and Grant in the perspective of military
16 developments since the war. Lee was interested hardly at all
17 in global strategy, and what few suggestions he did make to
18 his government about operations in other theatres than his own
19 indicate that he had little aptitude for grand planning.
20 As a theatre strategist, Lee often demonstrated more brilliance
21 and apparent originality than Grant, but his most audacious plans were
22 as much the product of the Confederacys inferior military
23 position as of his own fine mind. In war, the weaker side
24 has to improvise brilliantly. It must strike quickly, daringly,
25 and include a dangerous element of risk in its plans. Had Lee
26 been a Northern general with Northern resources behind him he would
27 have improvised less and seemed less bold. Had Grant been
28 a Southern general, he would have fought as Lee did.
29 Fundamentally Grant was superior to Lee because in a modern
30 total war he had a modern mind, and Lee did not. Lee
31 looked to the past in war as the Confederacy did in spirit.
32 The staffs of the two men illustrate their outlooks. It would
33 not be accurate to say that Lees general staff were
34 glorified clerks, but the statement would not be too wide
35 off the mark
The students were not told how many breaks to make; this was left to their
discretion. The number of breaks varied from one to eight, with slightly under
half of the students making three breaks. The choices of paragraph break made by
all informants is given in Table 3, choices being represented in terms of the lines
in which the sentences begin.
Some sentences were clearly seen as strong candidates for beginning a
paragraph (e.g. the sentence beginning on line 13) while others were not chosen
by any informant (e.g. the sentence beginning on line 24). Equally clearly there
was no unanimity as to where to break, with no paragraph boundary finding
universal approval. This undermines any claim that might be made for paragraphs
having a structural status, unless of course my students are thought to have been
deficient in this respect (and since they were not for the most part deficient at the
level of the clause or the group, that would itself be of interest). Instead it points
to there being some non-structural explanation. One such explanation lies in the
textual relations in the passage, and this explanation I have explored fully
elsewhere (Hoey 1985, 1997). However, a more interesting (and not
incompatible) explanation is, as mentioned above, that the students were being
cued by the lexical items that begin the sentences; in other words, it is possible
that the students were choosing to paragraph in one place rather than another
because of the way the sentences began.
182 Michael Hoey
Table 3: The distribution of paragraph break choices across the range of possible
break points
Line on which Number of informants beginning % of informants
sentence starts a paragraph at this point making the choice
2 (He..) 0 -
4 (His..) 11 17%
7 (He..) 22 33%
9 (He..) 0 -
13 (Lee..) 62 94%
16 (Lee..) 7 11%
20 (As..) 32 49%
23 (In..) 32 49%
24 (It..) 0 -
25 (Had..) 2 3%
27 (Had..) 1 2%
29 (Fundamentally..) 42 64%
30 (Lee..) 0 -
32a (The..) 13 20%
32b (It..) 5 8%
With this hypothesis in mind, I set about examining the lexical items that
began each of the sentences that were candidates for beginning a paragraph.
When I undertook that work, my corpus was greatly smaller than it now is, and
the numbers looked at were not large typically between 40 and 100 instances,
and in a couple of cases fewer than this. What I was concerned to do, however,
was focus my hypothesis, not prove by weight of numbers the paragraph priming
of certain words or phrases.
The results of my analysis were provisionally supportive of the hypothesis.
To begin with, my analysis suggested that exactly 50% of single surnames (like
Grant and Lee) in sentence-initial position are also paragraph initial; there are of
course four places where a surname is the first word of a sentence (lines 1, 13, 16,
30) (and a further four where it appears within Theme lines 20, 25, 27, 29).6
There did not however appear to be any tendency for single surnames to be
sentence-initial; rather the opposite. So we have the hypothesis of a negative
priming for surnames in Theme but a positive priming for Thematised surnames
in paragraph-initial position.
The exact opposite appeared for he, which begins three sentences in the
passage (lines 2, 7, 9). The evidence pointed strongly towards he having a strong
priming for Theme. Of 100 instances consulted 30 were sentence-initial, and this
of course discounts instances of he occurring within Theme but not in 1st position
in the sentence. On the other hand there was no tendency for sentence-initial he to
be also paragraph-initial. In fact he occurred in paragraph-initial position in my
corpus two and a half times less often than would have been expected on the basis
of random distribution. So we have the hypothesis of a positive priming for he in

Theme but a negative priming for beginning a paragraph.
The data for his (line 4) illustrated a third possibility. In contrast with he,
his showed a negative priming for being sentence-initial. Those few instances that
were sentence-initial showed an equal negative priming for being paragraph
initial. So we have the hypothesis of a negative priming for both sentence-initial
and paragraph-initial position.
Phrases beginning as a are not all of the same kind. I divided them into
three moderately distinct categories phrases with a non-human nominal group,
e.g. as a legacy of, as a consequence, phrases with a human referent for the
noun acting as head of the group but without implication of function or role, e.g.
as a boy, as a Frenchman, and finally phrases with a human referent for the head
noun that described a function or role, e.g. as a biologist, as a musician. One of
the candidate sentences for beginning a paragraph begins with the third class of
as a(n) X (i.e. As a theatre strategist, line 20). Phrases of this third category were
found, admittedly on the basis of few data, to be positively primed for paragraph-
initial position; no calculation was made of their tendency to be sentence-initial.
It proved very difficult to explore the textual priming of in war (line 23).
In the small corpus I was working with at that time, the phrase in war itself only
occurred once, in second position in a paragraph supporting a generalisation, and
once as in a war in paragraph-initial position. A trawl more recently of the 100
million word supplemented Guardian corpus still only threw up 16 examples of
the phrase. Of these 5 were paragraph-initial, 9 were non-initial and 2 occurred in
quoted speech too short to be subject to paragraphing. The average length of the
paragraphs was five sentences. Trivially sparse though these data are, they
suggest that in war may have a positive bias towards being sentence-initial, a
conclusion I reached when I initially analysed my data on the basis of the
distribution of in plus abstract noun. Again, then, if the hypothesis were to be
supported by better data, we would be looking at a phrase with a strong aversion
to being part of Theme and a strong preference for paragraph-initial position in
the rare circumstances of its being thematised.
Looking next at it functioning as pronoun (line 24), we again have
evidence of a negative colligation. I examined 149 instances of it functioning as
anaphoric pro-form in sentence-initial position. Of these a mere 8% (12) were
also in paragraph-initial position, compared with the 25% that might have been
anticipated if its positioning were the result of random distribution. It is therefore
three times less likely to appear in paragraph-initial position than would be
accountable for in terms of chance.
There were only 29 instances of Had X been (lines 25 and 27) in the
corpus I was using at that time. One in six of these began a paragraph. Again,
though based on few data, this points to a negative colligation with paragraph-
initial position.
With fundamentally (line 29), we again have a lexical item that is
negatively primed for Theme. With my original data I had to use a number of
suspect strategies in order to have enough data to analyse; re-examining the
184 Michael Hoey
lexical item with my current corpus of 100 million words, and with a consequent
786 instances of fundamentally, there were still only 20 instances in sentence-
initial position for investigation. Clearly the word does not like to begin
sentences. On the basis of my original data, I came to the conclusion that
fundamentally had a positive colligational preference for paragraph-initial
position, with 50% more likelihood of beginning a paragraph than was explicable
in terms of random distribution. With the still thin but better data of the later
corpus, I come to the same conclusion. Of the 20 sentence-initial cases, six begin
paragraphs and 13 do not; the final instance begins a one-sentence paragraph, and
this is discounted in the analysis. The average length of the paragraphs is 5
sentences (though this is distorted upwards by one particularly long paragraph).
Again, fundamentally turns out to begin paragraphs 50% more often than one
might expect.7
On the basis of this analysis, it was possible to correlate the positive and
negative priming for paragraph-initial position with the decisions that the students
had made.8 The correlation can be seen in Table 4. I have starred those results
which seem anomalous. It will be seen that for the most part there is a good
match between actual student choice of paragraph boundary and predicted
boundary breaks on the basis of corpus evidence. Where there is a discrepancy,
there are good reasons for it. In terms of the structure of the passage, the sentence
starting at line 4 represents a deviation from the smooth parallelism of the
comparison. Those paragraphing at line 4, despite the negative colligation of his
for paragraph initiation, were doing so to mark this deviation. The rather smaller
number who broke at line 7 were breaking, again in defiance of the negative
colligation of he, in order to mark a return to the parallelism. Those breaking, on
the other hand, at line 16 were doing so with no text-structural grounds for their
Table 4: Informant choices compared with textual colligation

Sentence-initial word or Line Paragraph-initial % of informants making a paragraph
phrase no colligation break at this point (67 informants)
Grant 1 Positive 100% (by default)
He 2 Negative 0%
His 4 Negative 17%
He 7 Negative 33% *
He 9 Negative 0%
Lee 13 Positive 94%
Lee 16 Positive 11%
As a NG (human function) 20 Positive 49%
In NG (generalised noun) 23 Positive 49%
It (pronoun) 24 Negative 0%
Had NG Vn 25 Negative 3%
Had NG Vn 27 Negative 3%
Fundamentally 29 Positive 64%
Lee 30 Positive 0% *
illustrate 32a Neutral 20%
It (anticipatory) 32b Positive 8% *
decision; the only thing going for such a break is the positive colligation of names
with paragraph initiation. The failure to break at line 30 is less interesting; the
fact that the passage is coming to an end at this juncture would have been a
deterrent to some informants, irrespective of the merits of a potential break at this
point.
To test whether these claims were correct and to discover whether the
colligations were having any effect on the judgements of students, I then doctored
the original text slightly as follows; changes are indicated in bold:
1 Grant was, judged by modern standards, the greatest general

2 of the Civil War. He was head and shoulders above any general on either
3 side as an over-all strategist, as a master of what in later wars
4 would be called global strategy. Lees Operation Crusher plan, the
5 product of a mind which had received little formal instruction in the
6 higher area of war, would have done credit to the most finished
7 student of a series of modern staff and command schools. He was a
8 brilliant theatre strategist, as evidenced by the Vicksburg campaign,
9 which was a classic field and siege operation. He was a better
10 than average tactician, although, like even the best generals of
11 both sides, he did not appreciate the destruction that the increasing
12 firepower of modern armies could visit on troops advancing across
13 open spaces. Lee is usually ranked as the greatest
14 Civil War general, but this evaluation has been made without
15 placing Lee and Grant in the perspective of military
16 developments since the war. He was interested hardly at all
17 in global strategy, and what few suggestions he did make to
18 his government about operations in other theatres than his own
19 indicate that he had little aptitude for grand planning.
20 He often demonstrated more brilliance and apparent originality
21 as a theatre strategist than Grant, but his most audacious plans were
22 as much the product of the Confederacys inferior military
23 position as of his own fine mind. The weaker side
24 has to improvise brilliantly in war. It must strike quickly, daringly,
25 and include a dangerous element of risk in its plans. Had Lee
26 been a Northern general with Northern resources behind him he would
27 have improvised less and seemed less bold. Had Grant been
28 a Southern general, he would have fought as Lee did.
29 Fundamentally Grant was superior to Lee because in a modern
30 total war he had a modern mind, and Lee did not. Lee
31 looked to the past in war as the Confederacy did in spirit.
32 The staffs of the two men illustrate their outlooks. It would
33 not be accurate to say that Lees general staff were
34 glorified clerks, but the statement would not be too wide
35 off the mark
The changes were designed to test whether students had been influenced by the
textual colligations. I hypothesised that the change to line 4 would reinforce the
structural pressure to break at this point and that there would be a consequent
increase in the popularity of this sentence as a paragraph boundary. I likewise
186 Michael Hoey
hypothesised that removal of the positive colligation at the beginning of line 16

would render it unattractive as a candidate boundary. I further hypothesised that
the removal of positive colligations from the Themes of the sentences beginning
in lines 20 and 23 would reduce their attractiveness as potential paragraph breaks.
Having made these changes, I gave the task to a set of 32 informants drawn from
the same undergraduate degree programme; needless to say, the second set of
informants did not include any from the first set. Their paragraphing decisions are
recorded in Table 5. The first column represents the potential paragraph breaks
(by line number); I have indicated where the potential break in question has had
its wording altered. The second column indicates the number of informants
choosing to break at this point and the third represents this as a percentage of the
cohort; the final column gives the results from the original experiment for
purposes of comparison.
Table 5 contains broad support for my position. The alteration at line 4
from his with its negative priming for paragraph-initial position to Grant with its
positive priming for such positioning brings with it a surge in the percentage of
informants choosing to paragraph at this point (and a corresponding reduction in
the percentage paragraphing at line 7). The alteration at line 16 in the reverse
direction, i.e. from Proper noun to pronoun, is associated with a reduction in the
Table 5: A comparison of the two cohorts of informants in respect of their

paragraphing decisions on the original and altered de-paragraphed text
Line Number of informants % of informants choosing % of original informants
choosing this point as a this point as a paragraph choosing this point as a
paragraph break break paragraph break
2 0 - -
4 12 38% 17%
7 6 19% 33%
9 1 3% -
13 31 97% 94%
16 1 3% 11%
20 7 22% 49%
23 19 59% 49%
24 0 - -
25 5 16% 3%
27 1 3% 2%
29 19 59% 64%
30 2 6% -
32a 5 16% 20%
32b 1 3% 8%
number of informants paragraphing there. The change at line 20 removing the

positively primed fronted adjunct as a X and replacing the proper noun with a
pronoun sees a halving of the proportion of people choosing to paragraph at this
juncture despite there being good text-linguistic grounds for making such a break.
All these changes in the popularity of the relevant line breaks are in line with
Claim 4.
One result is not as expected. The removal of the fronted adjunct In X sees
an increase rather than a decrease in the percentage of people choosing to break at
line 23. The increase at first sight seems to provide counter-evidence for the claim
advanced in this paper. On closer examination, however, the increase is
supportive of the claim, not challenging to it. The reason is that the structure the
+ adjective + noun turns out to be an even stronger paragraph-initiator than in X.
It is a relatively rare structure. In a concordance of the of 1548 lines created from
seven BNC files, there were only 116 instances with this structure (7.5%). Of
these 116 instances, 51 (44%) were either paragraph initial or text initial,
approximately twice as many as would have been expected on the basis of
random distribution. So the move of In war to non-initial position only had the
effect of bringing to the front of the sentence a structure even more associated
with paragraph initiation than the structure it replaced.
The association of certain words or phrases with paragraph initiation is not the
only kind of chunking that can be attested. Some words or phrases have a strong
tendency to be associated with text initiation. In my corpus, for example, sixty
and today both have strong tendencies to appear in text-initial sentences; in the
case of sixty it also has a tendency to appear at the beginning of its sentence.
Hoey (2000) describes a small-scale experiment to investigate this phenomenon,
making use of a jumbled version a short text by Ingmar Bergman.
Both the text-initiation experiment and the paragraph-initiation experiment
described here (and in Hoey 1997) only hint at the way forward (apart from any
worry one might have about the experimental designs used) because a corpus of
100 million words is quite small when one is counting paragraphs and tiny when
one is counting texts. 100 million words constitutes very approximately only one
million paragraphs and (equally approximately) a mere 200,000 short texts, and a
corpus of one million words would for many purposes be regarded as a modest
corpus while a corpus of 200,000 words would normally be regarded as barely
adequate for all but the most common words in the language. Nevertheless what
evidence there is provides support for the claim that lexical items are primed
(positively or negatively) to appear in paragraph initial or even text initial
position.
3.5 Claim 5: Textual colligations may only or especially be operative in

texts of a particular type or genre or designed for a particular
community of users, e.g. academic papers
The final claim will be given little attention, but it is an important one. This is
that all of the above claims should be regarded as domain-specific. In other
188 Michael Hoey
words, a word is not primed for textual use in all contexts but only under certain
conditions. Thus, for example, ago may, as I have argued here, be associated with
contrast in newspaper text and it might be it has a similar association in academic
articles; a corpus of fictional narrative on the other hand would be unlikely to
throw up this textual colligation. It might well be that both news articles and
fictional narratives favour the word appearing in text-initial position certainly I
have evidence for this in news texts but it is extremely doubtful whether there is
any such association in advertisements. And so on. Textual colligation claims
must be tied to particular genres, text-types, domains, communities of users
(defined temporally as well as in terms of employment and place) and the like. It
is probably this property of domain specificity that has led to textual colligation
being overlooked in corpus studies hitherto, in that large general corpora are ideal
for picking up patterns across a wide range of domains but have to be used
carefully to pick up features true of only certain kinds of text.
4. Colligational prosody
Once we recognise that our generalisations must be bounded, it is possible to

produce a fragment of a matrix of the kind hypothetically posited (as Table 1) at
the outset of this paper. Thus the phrase sixty years ago today can be represented
as shown in Table 6.
Table 6: A fragment of a text-colligational matrix

sixty years ago today
Preference for negative positive negative negative
participating in
cohesive chains
Preference for occurring positive weakly positive neutral
as part of Theme positive
Preference for occurring contrast contrast or contrast contrast
as part of a specified change
semantic relation
Preference for occurring positive for positive for positive for positive for
at the beginning or end paragraph paragraph paragraph paragraph
of a recognisable chunk initial and text initial (when initial in initial (when
initial (when Theme) certain phrases Theme)
Theme) (when Theme)
Constraints on Feature Feature Feature Feature
operation of preferences articles articles articles articles
It will be observed that the individual words that make up the phrase sixty years
ago today share a number of properties. Thus sixty, ago and today share the
property of having negative preference for cohesive chains and that sixty, years
and ago share a preference for being Thematised. We can label this colligational
prosody and its presence is both explained by and helps to explain the power of
collocation.
Table 7 represents the colligational prosodies of our chosen phrase.
Table 7: Colligational prosody in the phrase sixty years ago today

sixty years ago today
Preference for participating in negative negative negative
cohesive chains
Preference for occurring as positive weakly positive
part of Theme positive
Preference for occurring as contrast contrast or contrast contrast

part of a specified semantic change
relation
Preference for occurring at positive for positive positive for positive for
the beginning or end of a paragraph for paragraph paragraph
recognisable chunk initial paragraph initial initial
and text initial and text initial and text initial
(when
initial in certain (when Theme)
Theme)
(when Theme) phrases (when
Theme)
Constraints on operation of Feature articles Feature Feature articles Feature
preferences articles articles
5. Conclusions
I have been arguing in this paper for a new perspective on text-linguistics, one
that is rooted in the lexical item, not, as previously, a perspective that sees lexis as
a network in the text contributing to its cohesion or as contributing to the
signalling of the text organisation (though this perspective is not rendered
obsolete), but a perspective that makes no distinction between the description of
the text and the description of its component lexis. I have tried to show that the
properties of text can all be tackled through the concept of textual colligation.
More specifically I have argued that lexis is primed for textual use, such
that the choice of a lexical item is simultaneously the choice of its primings. Any
lexical item is primed positively or negatively with respect to cohesion, semantic
relations in the text, Theme and textual divisions. This is not to say that the
choice of a lexical item compels certain textual developments but it certainly
makes those developments more likely. The case of paragraphing in particular
suggests that some thorny text-linguistic problems might be amenable to solution,
or at least clarification, if a lexical perspective is adopted. Just as importantly, the
work I have reported here, if supported in subsequent investigations, suggests that
a corpus-centred account of the lexical item that stops at the phrase may be
unnecessarily limited. A full account of the word may be some way off yet.
190 Michael Hoey
Notes
1. Dubois (2002) provides powerful evidence of this; one of his informants

comments on an Epistle of St Paul as he reads it without apparent awareness
of an audience.
2. The sharp-eyed and knowledgeable will recognise that the last two sentences
of this paragraph are a direct quotation from Donald A. Yates translation of
Death and the Compass by J.-L. Borges. They therefore perfectly illustrate
the point about writing being reproduction. The speaker of these sentences in
the story is however led to his doom by false hypotheses so we cannot assume
that the sentiments are safe or Borges own.
3. Here and henceforward the evidence is drawn from concordances created out
of a corpus of 100 million words, made up predominantly of Guardian
newspaper data (approximately 96 million) with a topping up from the BNC
and a Liverpool-constructed database of spoken English of approximately _
million words.
4. Given that there are approximately twice as many tokens in Rheme as in
Theme, a lexical item could be said to occur in accordance with our
expectations if it occurs in Theme a third of the time. Random distribution
would suffice as an explanation of the occurrence of a lexical item in Theme
or Rheme if the occurrence of the lexical item in Theme and Rheme was
unaffected by the nature of the lexical item itself.
5. The passage, from Lincoln and His Generals by T. Harry Williams, was
originally selected and deparagraphed by Richard Young and Alton Becker
and their work was reported in a mimeographed paper. A more general
account of their research was reported in Koen et al. (1969). Although the
research reported here used many more informants than did theirs and my
findings are different from (though not unrelated to) their findings, I wish to
pay tribute to their pioneering work without which I would certainly never
have considered exploring these matters. The line numbering and line breaks
are as in the original Young and Becker experiment. A discussion of their
work and this experiment can be found in Hoey (1985).
6. The full details of my original analysis can be found in Hoey (1997).
7. In addition I analysed paragraph-initial cases of more fundamentally, most
fundamentally, and but fundamentally. There were 31 of these and 19 of them
also began paragraphs.
8. I also analysed the paragraph-initial properties of illustrate and anticipatory it,
but since these are not psychologically likely break points because of the
closeness to the end of the passage, I have not dwelt on them here. For what it
is worth, on the basis of few data, illustrate seemed to have a weak preference
for paragraph-initial position and an equally weak preference for paragraph-
final position; it would probably be best characterised as neutral with regard
to paragraph positioning. Anticipatory it showed quite a strong tendency to be

paragraph-initial, as strong as fundamentally.
References
Bakhtin, M. (1973), Problems of Dostoevskys poetics [1929], translated by R. W.

Rotsel. Ann Arbor, Michigan: Ardis.
Beekman, J. (1970), Propositions and their relations within a discourse, Notes
on Translation. 37: 6-23.
Beekman, J. and J. Callow (1974), Translating the word of God. Michigan:
Zondervan Press.
Bolivar, A. de (2001), The negotiation of evaluation in written texts, in: M.
Scott and G. Thompson (eds), Patterns of text, Amsterdam: John
Benjamins. 129-158.
Chouliaraki, L. and N. Fairclough (2001), Discourse in late modernity: rethinking
critical discourse analysis. Edinburgh: Edinburgh University Press.
Crombie, W. (1985), Process and relation in discourse and language learning.
Oxford: Oxford University Press.
Dane, F. (1974), Functional sentence perspective and the organization of the
text, in: F. Dane (ed.), Papers on functional sentence perspective,
Prague: Academia. 105-28.
Dijk, T. van and W. Kintsch (1978), Cognitive psychology and discourse:
recalling and summarizing stories, in: W.U. Dressler (ed.), Current trends
in textlinguistics, Berlin: Walter de Gruyter. 61-80.
Dubois, J. (2002), What is (natural) discourse? Implications for spoken corpus
research. Paper presented at ICAME 2002 (the 23rd International
Conference on English Language Research on Computerized Corpora of
Modern and Medieval English), Gteborg, 22-26 May 2002.
Fairclough, N. (1989), Language and power. London: Longman.
Firbas, J. (1966), Non-thematic subjects in contemporary English, Travaux
Linguistiques de Prague 2: 239-256.
Firbas, J. (1986), Given and new information and some aspects of the structures,
semantics and pragmatics of written texts, in: C.R. Cooper and S.
Greenbaum (eds), Studying writing: linguistic approaches, Written
Communication Annual, Vol. 1, London/Beverley Hills, Cal.: Sage. 40-71.
Goffman, E. (1981), Forms of talk. Philadelphia: University of Pennsylvania
Press.
Goodman, K. (1967), Reading: a psycholinguistic guessing game, Journal of
the Reading Specialist 6: 126-135.
Goodman, K. (1973), On the psycholinguistic method of teaching reading, in: F.
Smith (ed.), Psycholinguistics and reading, New York: Holt, Rinehart and
Winston. 177-182.
192 Michael Hoey
Graustein, G. and W. Thiele (1979), An approach to the analysis of English

texts, Linguistiche Studien A55: 3-15.
Graustein, G. and W. Thiele (1987), Properties of English texts. Leipzig: VEB
Verlag Enzykapadie Leipzig.
Halliday, M.A.K. (1994), An introduction to functional grammar (2nd ed.).
London: Edward Arnold.
Halliday, M.A.K. and R. Hasan (1976), Cohesion in English. London: Longman.
Halliday, M.A.K. and R. Hasan (1985), Language, context and text: Aspects of
language in a social-semiotic perspective. Geelong: Deakin University
Press (republished in 1989 by Oxford University Press).
Halliday, M.A.K. and Z.L. James (1993), A quantitative study of polarity and
primary tense in the English finite clause, in: J. Sinclair, M. Hoey and G.
Fox (eds), Techniques of description: spoken and written discourse. A
festschrift for Malcolm Coulthard, London: Routledge. 32-66.
Hasan, R. (1984), Coherence and cohesive harmony, in: J. Flood (ed.),
Understanding reading comprehension, Delaware: International Reading
Association. 181-219.
Hodge, R. and G. Kress (1993), Language as ideology (2nd ed.). London:
Routledge.
Hoey, M. (1979), Signalling in discourse. Discourse Analysis Monographs No 6,
Birmingham: ELR, University of Birmingham.
Hoey, M. (1983), On the surface of discourse. London: George Allen and Unwin
(reprinted in Reprints in Systemic Linguistics series, University of
Nottingham, 1991).
Hoey, M. (1985), The paragraph boundary as a marker of relations between the
parts of a discourse, M.A.L.S. Journal 10: 96-107.
Hoey, M. (1991), Patterns of lexis in text. Oxford: Oxford University Press.
Hoey, M. (1997), The interaction of textual and lexical factors in the
identification of paragraph boundaries, in: M. Reinhardt and W. Thiele
(eds), Grammar and text in synchrony and diachrony in honour of
Gottfried Graustein, Frankfurt am Main: Vervuert Verlag & Madrid:
Iberoamericana. 141-67.
Hoey, M. (2000), The hidden lexical clues of textual organisation, in: L.
Burnard and T. McEnery (eds), Rethinking language pedagogy from a
corpus perspective, Frankfurt: Peter Lang, 31-42.
Hoey, M. (2001), Textual interaction. London: Routledge.
Hoey, M. (forthcoming a), The textual priming of lexis, to appear in G. Aston
(ed), Proceedings of TALC, Bertinoro, 2002.
Hoey, M. (forthcoming b), Lexical priming and properties of text, to appear in A.
Partington, J Morley and L Harman (eds), Corpora and Discourse. Bern:
Peter Lang.
Koen, F., R. Young and A. Becker (1969), The psychological reality of the
paragraph, Journal of Verbal Learning & Verbal Behavior, 8.1: 49-53.
Labov, W. (1972), Language of the inner city: Studies in the Black English
vernacular. Philadelphia, Pa: University of Pennsylvania Press.
Labov, W. and J. Waletzky (1967), Narrative analysis: oral versions of personal
experience, in: J. Helm (ed.), Essays on the verbal and visual arts,
Seattle: University of Washington Press. 25-42.
Longacre, R.E. (1968), Discourse, paragraph and sentence structure in selected
Philippine languages. S.I.L. Publications in Linguistics & Related Fields,
No 21, Vols. 1 & 2. Dallas, Texas: Summer Institute of Linguistics
Publications.
Longacre, R.E. (1979), The paragraph as a grammatical unit, in: T. Givn (ed.),
Discourse and syntax, New York: Academic Press. 115-34.
Longacre, R.E. (1983), The grammar of discourse. New York: Plenum Press.
Mann, W.C. and S.A. Thompson (1986), Relational processes in discourse,
Discourse Processes 9.1: 57-90.
Mann, W.C. and S.A. Thompson (1987), Rhetorical Structure Theory: a theory of
text or g a n i z a t i o n . Monica del Rey, Ca: Information Science
Institute/University of Southern California.
Martin, J.R. (1992), English text: system and structure. Amsterdam: John
Benjamins.
Morgan, J.L. and M.B. Sellner (1980), Discourse and linguistic theory, in: R.J.
Spiro (ed.), Theoretical issues in reading comprehension: Perspectives
from cognitive psychology, linguistics, artificial intelligence and
education, Hillside, N.J.: Lawrence Erlbaum Associates. 165-200.
Nystrand, M. (1986), The structure of written communication: Studies in
reciprocity between writers and readers. Orlando: Academic Press.
Nystrand, M. (1989), A social interactive model of writing, Written
Communication 6.1: 66-85.
Parsons, G. (1995), Measuring cohesion in English texts: the relationship
between cohesion and coherence. Ph.D. Thesis, University of
Nottingham.
Partington, A. (2003), The linguistics of political argument: The spin-doctor and
the wolf-pack at the White House. London: Routledge.
Partington, A. and J. Morley (2002), From frequency to ideology: comparing
word and cluster frequencies in political debate. Paper given at the 5th
TALC (Teaching and Language Corpora) conference, Bertinoro, 26-31
July.
Scott, M. (1999), WordSmith Tools, Version 3, Oxford: Oxford University Press.
Sinclair, J.McH. (1993), Written discourse analysis, in: J. Sinclair, M. Hoey and
G. Fox (eds), Techniques of description: spoken & written discourse. A
festschrift for Malcolm Coulthard, London: Routledge. 6-31.
Smith, F. (1978), Understanding reading (2nd ed.). New York: Holt, Rinehart &
Winston.
Swales, J. (1981), Aspects of article introductions. Aston ESP Monographs No 1,
Birmingham: Aston University.
194 Michael Hoey
Swales, J. (1990), Genre analysis: English in academic and research settings.

Cambridge: Cambridge University Press.
Tadros, A. (1985), Prediction in text. Discourse Analysis Monographs,
Birmingham: ELR, University of Birmingham.
Tadros, A. (1993), The pragmatics of text averral and attribution in academic
text, in: M. Hoey (ed.), Data, description, discourse, London:
HarperCollins. 98-114.
Ventola, E. (1987), The structure of social interaction. London: Frances Pinter.
Widdowson, H. (1979), The process and purpose of reading, in: Explorations in
applied linguistics, Oxford: Oxford University Press. 173-84.
Williames, J. (1985), The interactive nature of the newspaper letter, M.A.L.S.
Journal, New Series 10: 108-140.
Winter, E. (1971), Connection in science material: a proposition about the
semantics of clause relations, C.I.L.T Papers and Reports No 7 (London:
Centre for Information on Language Teaching and Research for British
Association for Applied Linguistics). 41-52.
Winter, E. (1977), A clause-relational approach to English texts: a study of some
predictive lexical items in written discourse, Instructional Science 6.1: 1-
91.
Winter, E. (1979), Replacement as a fundamental function of the sentence in
context, Forum Linguisticum 4.2: 95-133.
Winter, E. (1982), Towards a contextual grammar of English. London: George
Allen & Unwin.
Adverbials in IT-cleft constructions
Hilde Hasselgrd
University of Oslo
Abstract
On the basis of material from the International Corpus of English this paper
presents a study of IT-clefts with an adverbial in cleft position. Most of these IT-
clefts are of the informational-presupposition type (Prince 1978), i.e. the cleft
clause conveys new information. Various textual functions of the IT-clefts are
explored, using the classification of Johansson (2002). Unexpectedly, the function
of giving contrastive focus to the cleft constituent does not seem to be
predominant in this material. An alternative hypothesis is explored: the IT-cleft
construction is seen primarily as a thematizing device, whereby the cleft
constituent receives thematic focus (which may imply contrast) and the theme-
rheme division is made particularly explicit.
1. Introduction
The IT-cleft construction has been studied both as a focusing device (e.g. Prince
1978, Gundel 2002) and as a thematizing device (e.g. Gmez-Gonzlez 2000).
The construction is of interest in studies of information structure because it
allows a speaker/writer to spread the information of a single proposition over two
clauses and, consequently, two information units. It is normally assumed that the
cleft construction is a means of steering the focus towards the clefted constituent
(e.g. Gundel 2002: 118). The IT -cleft can have various types of phrases and
clauses as its focus, as shown below.
1
IT-cleft = IT + BE + clefted constituent [NP, PP, AdvP, (non-)finite clause] + cleft clause
Adjunct adverbials, in contrast to conjuncts and disjuncts, can be the focus of an

IT-cleft construction (cf. Quirk et al. 1985: 504), as illustrated in (1):
(1) Can we call a special meeting or something?

Maybe just that its this week that uhm there arent enough people around
< S1B-078 #172-173:1:C>2
The aim at hand is to examine such adverbials in IT-cleft constructions in order to

discover the information structural role of the focused adverbial as well as the
function of the whole IT -cleft construction in context. The study is primarily
based on the British component of the International Corpus of English (ICE-GB).
196 Hilde Hasselgrd
2. Types of adjunct in cleft position
Before going further, I shall briefly outline the kinds of adverbials that occur in
cleft focus position in the ICE-GB. Table 1 shows the occurrence of different
semantic types of adjunct in cleft position. The distribution of semantic types of
adjuncts in cleft position is not surprising; time and place adjuncts are the most
common types of adjuncts in most registers irrespective of position (cf. Biber et
al. 1999: 783 f.).
Table 1. Adjuncts in cleft position in the ICE-GB

Semantic type N %
Time 23 45.1
Place 15 29.4
Manner3 7 13.7
Cause/reason 5 9.8
Condition 1 2.0
Total 51 100
The adverbials in cleft position have different realizations, of which the

most common is the prepositional phrase (Table 2). Table 2 includes a column
containing Johanssons (2002:90) figures for the realization of adverbials in cleft
position in the English part of the English-Swedish Parallel Corpus (ESPC). As
shown in the table, the proportions of different realization types are relatively
similar across the two corpora. The frequency of the different realization types
corresponds quite well to the overall realization of adverbials regardless of
position (Hasselgrd: in prep., Biber et al. 1999: 769). One can thus assume that
adjuncts in cleft position are similar to those in other positions as regards their
semantic types as well as their realization.
Table 2. Realization of adjuncts in cleft position in ICE-GB and the English-

Swedish Parallel Corpus (ESPC figures from M. Johansson 2002:90)
ICE-GB ESPC
Realization
N % N %
Prepositional phrase 34 66.7 58 73.4
Adverb phrase 8 15.7 12 15.2
Noun phrase 2 3.9 1 1.3
Clause 7 13.7 8 10.1
Total 51 100 79 100
Adverbials in IT-cleft constructions 197
3. The information structure of IT-clefts
The most common assumption about the information structure of IT-clefts is that
the clefted constituent represents new, often contrastive, information (e.g. Biber
et al. 1999: 959). The subordinate clause typically conveys presupposed
information (e.g. Prince 1978: 896). She calls this stressed focus IT-clefts, thereby
indicating the discourse function of such clefts, namely to give special focus to
the clefted constituent. Gundel (2002: 118) refers to this information structure in
clefts as prototypical. Similarly, Collins (1991: 84) follows Halliday in claiming
that this information structure constitutes the unmarked type of IT-cleft: the
theme/new combination is unmarked: the construction creates, through
predication, a local structure the superordinate clause in which information
focus is in its unmarked place, at the end.4 This is illustrated in Table 3, from
Halliday (1994).
Table 3. Marked and unmarked information focus combined with unpredicated

and predicated Theme (Halliday 1994: 301)
Unmarked Marked
Non-nominalized you were to blame you were to blame
Theme Rheme Theme Rheme
Given New (focus) New Given
Nominalized its you who were to blame its you who were to blame
(predicated Theme) Theme Rheme Theme Rheme
New Given Given New (focus)
The term unmarked here does not, however, reflect quantitative data. In
Collinss comprehensive study of cleft constructions, only 36% of the IT-clefts
have a new clefted constituent and a given cleft clause (1991: 111), although the
clefted constituent is new in a clear majority of cases.
Another type of information structure in IT-clefts is described by Prince
(1978) (and others after her: Collins 1991, Delin and Oberlander 1995, Johansson
2002), namely the informative presupposition cleft, in which the cleft clause
conveys new information. The clefted constituent may contain either given or
new information in an informative presupposition cleft. In (2), both the clefted
constituent and the cleft clause are new, since the sentence occurs text-initially.
(2) It was just about 50 years ago that Henry Ford gave us the weekend.
(Prince 1978: 898)
According to Prince (1978: 898) the information in the cleft clause is encoded as
a (non-negotiable) fact. Although it is new, it is presupposed rather than asserted,
i.e. it is marked as known to some people although not yet known to the intended
hearer (ibid: 899). Prince says that The whole point of these sentences is to
inform the hearer of that very information (ibid: 898). Delin (1992: 296),
198 Hilde Hasselgrd
however, claims that the information within an it-cleft presupposition appears to

remind rather than inform, even though it may not in actual fact be known to the
hearer (ibid: 297).
4. Information structure in IT-clefts with focused adverbial
Descriptions of IT-clefts in grammars and elsewhere are mostly concerned with

focused nominal elements. This may be partly because of the possibility of
comparing IT-clefts and wh-clefts,5 and partly because nominal elements are the
most frequent type of clefted constituent. In ICE-GB they are almost three times
as frequent as focused adverbials, which agrees quite well with Johanssons
findings (2002: 90).
It is possible that there is a correlation between the type of clefted
constituent and the type of cleft construction. Both Collins (1991: 112) and Prince
(1978: 899) note that the informative presupposition type of IT-cleft is quite
common when the clefted constituent is an adverbial. Conversely, it is possible
that the stressed focus cleft is less apt to accommodate adverbials as the clefted
constituent. In this connection we may note that the most frequently quoted
example of an informative-presupposition cleft has an adverbial in cleft position,
namely (2).
In the ICE-GB material it was indeed quite common to find new
information in both the clefted constituent and in the cleft clause, as in (2). It was
also quite common for the clefted constituent to represent given information and
for the information in the cleft clause to be new, as in (3). This is the reverse of
the canonical information structure in IT -clefts. Although this pattern is
discussed as a variant of informative presupposition cleft by e.g. Prince (1978:
899) and Gundel (2002: 118 f), its frequency was unexpected.6
(3) However there are worrying signs for the Republicans in the contests for
state governors. Because of the shift in population to the warmer parts of
the country states like Florida Texas and California are to be given extra
seats in Congress. The governors of those states will have a big say in
redrawing the boundaries. And its here that the Democrats have made
significant headway. They have won the elections for governor in both
Florida and Texas from the Republicans although Mr Bushs party appears
to have held on to the biggest prize of all California. (S2b 006#15-19)
The occurrence of clefted constituents conveying given information is also briefly

noted by Biber et al. (1999: 962): The focused element in an IT -cleft is not
infrequently a pronoun or some other form which expresses given information.
[] The early position of the focused element makes it suitable both for
expressing a connection with the preceding text and for expressing contrast. The
examples given are of the type it was me, it was then, it is these. In the
material examined for the present study, just over half of the clefted adverbials
were anaphoric, cf. examples (1), (3), and (7). The great majority of the IT-clefts,
close to 90%, were of the informative presupposition type.
Based on the distribution of given and new information over the clefted
constituent and the cleft clause, Johansson (2002: 185 ff.) arrives at four patterns
of IT-clefts:
Type A: Clefted constituent is given/inferable; cleft clause is

given/inferable
Type B: Clefted constituent is given/inferable; cleft clause is new
Type C: Clefted constituent is new; cleft clause is new
Type D: Clefted constituent is new; cleft clause is given/inferable
Assessing the information status of adverbials is difficult because they often

contain some given and some new information, e.g. the nominal complement of a
prepositional phrase may be given while the relation expressed by the preposition
may be new. Such phrases have been classified as inferable, and grouped together
with given information, following the practice of Johansson (2002). Needless to
say, it was necessary to study the examples in their wider context in order to
determine the status of the information conveyed by the two parts of the cleft
construction. Information patterns A-D were all found in the ICE-GB material.
Their distribution is shown in Table 4.
Table 4. Information structure of IT-clefts in the material

Pattern Adverbial Cleft clause N Total %
given given 1
Type A
inferable given 1 5 10
(all given)
inferable inferable 3
Type B given new 18
26 51
(given + new) inferable new 8
Type C
new new 18 18 35
(all new)
Type D new given 0
2 4
(new + given) new inferable 2
Total 51 100
There are quite a few differences between the results given in Table 4 and the
corresponding results of Johansson (2002: 188), which embraces all types of
clefted constituent. First and foremost, Johansson finds that Types A and B are
equally frequent (38% each), while Type C is least frequent in his material (10%).
Type D accounts for 14% in Johanssons material of English original texts.
Although the number of IT-clefts with adverbials is rather too low to give
conclusive results, the comparison of Johanssons material (based on 240
examples) and mine suggests that the information structure of clefts with
adverbials differs markedly from that of clefts with noun phrases. The most
200 Hilde Hasselgrd
important difference is that the IT -clefts with adverbials occur by far most
commonly with cleft clauses conveying new information (86%), while the cleft
clauses of IT-clefts in general seem to be divided about equally between given and
new information (Johansson 2002: 188). It may, however, be noted that Collins
(1991: 111) reports that 63% of his IT-clefts have new/contrastive information in
the cleft clause. It is possible that the differences may be due to the fact that
Collinss material as well as my own from ICE-GB contains both spoken and
written material while Johanssons contains only written material.
5. Discourse functions of cleft constructions
Studying the IT-clefts with adverbials in context, I found that they seem to have a
range of textual functions in the organization of the information flow. Collins
(1991: 106) makes a similar observation: The use of adverbials in cleft focus
position seems to have important textual functions, e.g. by acting as a bridge from
one topic to another or launching a (new) discourse topic. Returning to (3), for
example, we can note that the cleft sentence marks a transition between two
sections of the text. The discourse topic before the cleft sentence is demographic
and political features of Florida, Texas and California. After the cleft sentence
the text has moved on to the success of the Democratic Party, a topic that was
introduced in the cleft clause.
Johansson (2002: 193) proposes four main discourse functions of IT-
clefts (irrespective of the type of clefted constituent). I decided to use these
categories in order to facilitate comparison with his results, and was able to
identify all of them in the material for this study. The categories are the
following:
Contrast (the clefted constituent marks a contrast to something previously

mentioned/assumed.)
Topic Launching (the clefted constituent becomes the topic of the
subsequent discourse.)
Topic Linking (the two parts of the cleft construction clefted constituent
and cleft clause link together two discourse topics.)
Summative (the IT-cleft concludes or rounds off a text or a section of a
text.)
Interestingly, the notion of contrast does not seem to be a particularly prominent

feature of the clefted adverbials in the ICE-GB material. On the other hand, the
notion of focus is present in all the examples; after all, an IT -cleft usually
represents marked syntax as compared to its non-cleft counterpart. There is thus
some extra attention associated with the clefted constituent, even though this need
not be contrastive, and according to Delin (1990: 5 f) it need not be associated
with prosodic focus either.
In the following I shall give examples of the discourse functions found in
the material, using Johanssons classification. As will be shown, there are also
cases where it may be argued that the cleft construction represents a merger of
two categories, and additional categories will be suggested.
5.1 Contrast
In the canonical cleft (Gundel 2002: 118) the clefted constituent conveys new
information which is explicitly contrasted with something mentioned in the
preceding context. The cleft clause represents information that is known to the
speaker. In (4) the subject of reading books at an early age has already been
talked about, while the clefted constituent introduces a reader of a different age
from those previously discussed.
(4) I struggled terribly with them in my early teens and had no success at all.
It wasnt till I was perhaps twenty-five or thirty that I read them and
enjoyed them <S1A-013 #237-238:1:E>
It is also possible to express contrast in examples where the cleft clause conveys
new information. In such cases there is usually also another discourse function
associated with the IT-cleft. For instance, the clefted constituent may launch a
new discourse topic at the same time as marking a contrast (see next section), or
there may be a transition between two discourse topics (section 5.3).
5.2 Topic launching
An IT -cleft can introduce a discourse topic in the clefted constituent. This

constituent may be brand new or inferable, but in any case it is made prominent
by means of the clefted constituent and developed as a topic in the subsequent
discourse. In (5) the clefted constituent introduces those men and women serving
our country in the Middle East, a group which is a discourse topic in the section
that follows. It also represents a shift in the speech, introducing a human angle.
Interestingly, the you in the next sentence refers to the same group, in contrast to
the you in the sentence preceding the cleft, which is much wider in its scope.
(5) We must try to work out security arrangements for the future so that these
terrible events are never repeated <,> and we shall I promise you <,> bring
our own forces back home just as soon as it is safe to do so <,> It is to
those men and women serving our country in the Middle East <,> that my
thoughts go out most tonight # and to all of their families here at home <,>
To you I know this is not a distant war. It is a close and ever present
anxiety <,> I was privileged to meet many of our servicemen and women
in the Gulf last week <,> <S2B-030 #63-68:1:A>
202 Hilde Hasselgrd
Topic launching + Contrast

Example (6) represents a combination of two discourse functions in the IT-cleft.
The clefted constituent to Africa represents a contrast to the previous setting, the
Soviet, but at the same time introduces Africa as a new discourse topic in a text
about the worlds population. Clearly, the cleft clause does not convey given or
even presupposed information. Since all the information in the sentence is new,
the IT-cleft construction provides a means of avoiding placing new information
sentence-initially, or to keep focal material out of surface subject position
(Gundel 2002: 126).
(6) Shortages of food have been a repeated feature of recent Soviet experience
<,> with heavy dependence on grain imported from the United States as
the Soviets own production has failed <,> But the spotlight has been on
empty shops in the towns rather than empty larders in the countryside <,>
It is to Africa that the television cameras go to show what happens when
local natural resources are so inadequate for the population living off them
that drought or continuous small-arms war causes famine for the people
counted in millions and death for many of them <,> Apart from such
disasters in succession in the same or in different places infant mortality is
the main counter to the birth rates effect in Africa <,> and in parts of the
continent the heterosexual incidence of Aids may prove to have halted or
even reversed the growth in the population of potential parents and
condemned large numbers of children to early death <,> <S2B-048
#80:1:A>
5.3 Topic linking: Transition
It is the two-part structure of the IT -cleft that allows it to link together two
discourse topics: the current discourse topic is referred to in the clefted
constituent, while a new discourse topic is introduced in the cleft clause. I would
actually prefer to call this function Transition since it not only links together
two topics but also provides a bridge between two sections of a text. In other
words, the cleft sentence becomes a vehicle for topic shifting. Because the new
topic is presented as known or presupposed according to Prince, it is an
unobtrusive way of introducing new information which can then be the starting
point for the next section of the text. In (3) the cleft clause the Democrats have
made significant headway marked the beginning of a new section of the text. In
example (7) the idea of topic shifting is even clearer, because Cs attempt to shift
the topic is refused by A, who wants to spend more time on the previous topic.
Thus, in (7), the transition does not fulfil its function, while in (3) and (8) the
topic is successfully shifted.
(7) C: But really whats happened with my sort of history is when I met uh did a
little recording with Chandos Records uhm and the Ulster orchestra who
was conducting there came up with enough money to do their first record
and they got Chandos interested. It was then that uh I fell in love with
music like Hamilton Harty and a bit of Stanford <,> and the Arn the
Arnold Bax Saga became something quite uh excellent.
A:Well thats a day we certainly want to come back to a bit later. But if we
could just for a moment concentrate on the latter years of the nineteenth
century. < S1B-032 #22:1:C>
Contrast + transition
Example (8) has a combination of marking a contrast with the clefted constituent
and creating a transition by means of the cleft clause. The contrast is between the
Villa Somalia mentioned earlier and the office building. The introduction of the
letters in the cleft clause starts off a new section of the discourse.
(8) {BEGINNING OF TEXT} The Villa Somalia which was Siad Barres
official residence in Mogadishu still lies abandoned <,> guarded by a
handful of young men from the United Somali Congress the rebel force
which took control of Mogadishu at the end of January <,> But it was in
one of the office buildings that I discovered the letters <,> thousands of
them <,> addressed to His Excellency President Mohammed Siad Barre
but all unopened <,> I picked up one from Britain <,> It had been posted
in September nineteen eighty-eight and was signed by a retired
schoolteacher from Guildford in Surrey <,> writing on behalf of Amnesty
International to plead for the release of a blind Somali preacher whod
been imprisoned for his religious beliefs <,,> < S2B-023 #61:3:A>
5.4 Summative
Summative IT-clefts tend to occur towards the end of a text or a section of a text,
and represent a kind of conclusion or rounding off. Example (9) occurs at the
very end of a speech and contains two cleft constructions. They share a clefted
constituent which is inferable. It is not contrastive, but may have the uniqueness
feature noted by Delin and Oberlander (1995: 469). The cleft clause in the
second cleft is new. Although it is backgrounded by means of subordination
(Delin and Oberlander 1995: 473), it represents a kind of punchline and softens
the war-talk in a clever way.
(9) The purpose of war is to enforce international law. It is to uphold the

rights of nations to be independent and of people to live without fear. It is
in that spirit <,> that the men and women of our forces and our allies are
going to win the war <,> And it is in that spirit that we must build the
peace that follows. <S2B-030 #103-105>
204 Hilde Hasselgrd
Contrast + summative
In example (10) we see both contrast and the summative function, towards the
end of an obituary. The clefted constituent resolves an either-or relationship
(Perzanowski and Gurney 1997: 218) i.e. a writer for children as opposed to
adults - while the summative function is evident from the cleft clause.
(10) Dahls books often portrayed children battling against evil adults. [] As
an adult author Dahls fame was to come much later when his Tales of the
Unexpected were transferred to television. Yet it will be as a childrens
writer hell be remembered. His lasting legacy includes another two books
still to be published. Roald Dahl whos died at the age of seventy-four <,>
{END OF TEXT} < S2B-011 #17:1:B>
5.5 Thematization
In certain cases the main function of the cleft seems to be to make extra clear
what is to be understood as the theme and the rheme of a sentence. A good
example is (11), which represents a complete text. Thus there can be no contrast
involved, nor any topic-linking, topic-launching, or summary. Rather, in this
case, the writer wants to give thematic prominence to the regret he/she feels. It
may be noted that a non-cleft version (11a) cannot easily have the same
constituent in thematic position.
(11) It is with much regret that I find it necessary to send you a copy of the
enclosed letter which is self explanatory. <W1B-026 #121:15> {= entire
text}
(11a) ? With much regret I find it necessary to send you a copy of the enclosed
letter which is self explanatory.
Thematization is an added discourse function as compared to Johanssons,

although he mentions it as a variety of topic linking (2002: 199). One reason why
it seems appropriate to propose it as a separate category is the fact that some
examples simply do not fall neatly into any of the other categories, such as (11)
above. On the other hand, thematization seems to be an accompanying factor in
most of the examples where the cleft can be assigned to one of the functional
categories described above.
Furthermore, SFL tradition views both I T -clefts and WH -clefts as
thematization devices (predicated theme and thematic equative, respectively; see
also Table 3). Here one must bear in mind the basic function of Theme and
Rheme which is a partition of the message into two parts, each of which carries a
type of prominence. Thematic prominence has to do with the functions of Theme
as the point of departure of the clause as message, or the ground from which the
message is taking off (Halliday 1994: 38). Rhematic prominence on the other
hand has to do with the fact that the (end of the) Rheme tends to be the locus of
new information (cf. Fries 1994: 233 f). Gmez-Gnzales (2000: 303 ff.)
describes the IT-cleft as a special theme construction, i.e. one that marks off the
theme of a sentence and gives it extra focus. Similarly, Collins (1991: 171) notes
that the theme in clefts carries a textual form of prominence. If the IT-cleft is
seen as a construction for thematization, it should follow that the theme in this
construction, like other themes, can be given, new, contrastive or non-contrastive.
Perzanowski and Gurney (1997: 214) note that certain types of it-clefts
[] frequently occur in negative contexts. This is also a finding of the present
study, where three examples had not until as the clefted constituent, such as
(12). In such cases, a non-cleft version is not without problems, given that the
speaker wants a particular theme-rheme structure. That is, the corresponding non-
cleft version will require subject-verb inversion (12a). The IT-cleft may thus be a
way of using a marked construction in order to avoid one that is even more
marked.
(12) However it wasnt until his fourth album that the instruments capabilities
were more fully explored <,> <S2B-023 #22:1:A>
(12a) Not until his fourth album was the instruments capabilities more fully
explored.
Gundel (2002) and Johansson (2002) both document that IT -clefts are more
common in Norwegian/Swedish than in English. Looking for more examples
similar to (12) above, I have, however, found several examples of English IT-
clefts corresponding to other thematization structures in Norwegian, where
fronting of adverbials is more common and less marked than in English
(Hasselgrd 1997: 14). In example (13), from the English-Norwegian Parallel
Corpus (ENPC), the Norwegian original does not have a cleft, but a fronted
adverbial. The translator, struggling to keep the thematic structure intact, opts for
a cleft possibly again to avoid an even more marked structure.
(13) Frst p den tredje dagen hadde Aua vknet. (ENPC: MN1)
Lit: first [=only] on the third day had Aua awakened.
It was not until the third day that Aua awakened. (MN1T)
However, the thematizing function of IT-clefts does not only occur where a non-
cleft alternative would be awkward. In (14), for example, a non-clefted alternative
with a fronted adverbial is quite acceptable (14a), although there may be slightly
less focus on the adverbial. The clefted constituent does not mark any contrast,
nor does it close or launch a topic. According to Collins (1991: 175) the IT-cleft
enables an unambiguous mapping of theme on to new information in the
unmarked instance, as themes in IT-clefts are likely to convey new information
(Collins 1990: 111). In (12)-(14) the clefted constituent is indeed new.
(14) It was in nineteen hundred and six that the Queens great-grandfather King
Edward the Seventh decreed that privates in the Household Cavalry should
henceforth to be known as troopers <ICE-GB:S2A-011 #91:1:A>
206 Hilde Hasselgrd
(14a) In nineteen hundred and six the Queens great-grandfather King Edward
the Seventh decreed that privates in the Household Cavalry should
henceforth to be known as troopers
Delin and Oberlander (1995: passim) suggest yet another discourse

function of IT-clefts, in that the content of the cleft clause is marked as prior in
time to the main story line. An example of this may be (15). However, the content
of the preceding sentence is also prior in time to the main story line, so I am not
convinced this property is contributed by the IT-cleft construction. Instead I have
classified the example as Thematizing. As in (11)-(13) a non-cleft version could
not easily have the same theme-rheme structure. In a sense it is also topic-
launching, in that it occurs early in a section concerned with stages in this
persons military career. However, this can also be seen as a function of Theme.
(15) And the Field Officer Brigade waiting rides up to Her Majesty the Queen.
He was not granted security for officer training when he joined the
regiment in nineteen sixty-eight because of his Polish ancestry. It was as a
G u a r d s m a n that he came to the Second Battalion which now he
commands and eventually became a lance sergeant instructor at the Guards
Depot. When he was finally accepted for Sandhurst he went on to win the
Sword of Honour and has since served as an officer with every company
of each battalion of the regiment. < S2A-011 #122-125:1:A>
It is tempting to propose that the basic function of IT-clefts is thematization, and

that other functions are subsidiary to this. In other words, the marking of contrast
with a preceding topic, the launching of a new topic and the preparation for a new
topic may all be seen simply as functions of Theme.
5.6 Discourse functions and information structure
Johansson (2002: 193) suggests that the discourse functions of clefts are
associated with the different patterns of information structure outlined in section
4 above. According to his findings, type A correlates with the discourse functions
contrast and summative, type B with topic linking, type C with topic launching,
and type D with contrast. Table 5 presents a summary of the occurrence of the
various discourse functions of IT-clefts with adverbials in the present material.
This has been correlated with the type of information structure identified in each
cleft sentence.
Because few of the IT -clefts with adverbials were of the stressed focus
type, there are few examples of contrast. There are thus not enough examples of
this function to make a valid comparison with Johanssons results, although it is
interesting that several of the contrast examples belong to type C (all new). It may
be noted that when a clefted constituent conveying new information expresses
contrast, the implication is contrary to expectation rather than contrary to what
has been claimed. Type B (given + new) seems to be a good indicator of Topic
linking, or transition from one topic to another, as in Johanssons material. Type

C (all new) is a relatively good indicator of Thematization, although this type also
has a range of other functions. Type D is too scarcely represented to provide any
basis for even tentative conclusions.
Table 5. Discourse functions identified ranked according to frequency of

occurrence and correlated with information structure
Type A Type B Type C Type D Total
Function
all given given + new all new new + given
Transition 1 18 4 23
Thematization7 5 7 12
Contrast8 2 3 1 6
Summative 2 2 0 1 5
Topic launching 1 1 2
Contrast + transition 0 2 2
Contrast + topic launching 0 1 1
Total 5 26 18 2 51
6. IT-clefts in different registers
Since the ICE-GB includes a range of different spoken and written registers, it
was possible to check whether the registers differed as to the use of adverbials in
IT -clefts. Collins (1991: 181) reports a slightly higher frequency of IT-clefts in
writing (the LOB Corpus) than in speech (the London-Lund Corpus). In the ICE-
GB the difference between speech and writing was the opposite as regards the
frequency of IT-clefts with adverbials: approximately 0.6 vs. 0.4 occurrences per
10,000 words, respectively. However, when the spoken category was divided into
scripted vs. unscripted, a further difference emerged, as shown in Table 6. The
category of scripted speech, making up only 6% of the corpus, accounts for 24%
of the clefts with adverbials. The unscripted spoken categories and the written
categories are then left with the same frequency of clefted adverbials.
Table 6. Frequency of IT-clefts with adverbials in different genres in the ICE-GB

No of clefted
No of words No of clefted
Genre/medium adverbials per
in ICE-GB adverbials
10,000 words
Spoken (unscripted) 572,464 24 0.4
Scripted speech 65,098 12 1.8
Writing 423,702 15 0.4
Total 1,061,264 51 0.5
208 Hilde Hasselgrd
One may speculate that the discourse functions of clefts are well suited to the
(rhetorical) purposes of the scripted speech categories. These categories typically
belong to expository genres, particularly lectures and broadcast narration, but
there are also some official speeches. Possible reasons why the informative-
presupposition clefts with adverbials are handy may have to do with the
possibility of assigning unambiguous thematic prominence to the clefted
constituent and the possibility of presenting new information in the cleft clause
without asserting it. Further, as Delin (1992: 300) claims, the information in the
(presupposed) cleft clause is presented as a non-negotiable fact, which clearly has
its rhetorical advantages. Further exploration of such rhetorical properties of
clefts is, however, beyond the scope of this paper.
7. Concluding remarks
The starting point for the present study was a hypothesis that IT-clefts with
adverbials behave differently from other IT-clefts in discourse. The background
for this hypothesis was that clefts with adverbials in focus position seemed to
have an unexpected information structure, particularly that there seemed to be
many examples of the given + new pattern. One of the conclusions of the present
study must be that these perceived differences have to do with quantity rather
than with quality, as the information structure and the discourse functions found
with clefted adverbials have also been identified and described with other types of
clefted constituent (e.g. by Collins 1991, Johansson 2002). Presumably, many of
the differences arise from the fact that IT -clefts with adverbials tend to be
informative-presupposition clefts, while other IT-clefts are more likely to be
stressed-focus clefts.
The typical information structure of the IT-clefts with adverbials involves a
clefted constituent carrying given information and a cleft clause carrying new
information or, alternatively, one in which both parts of the cleft construction are
new. In both Collins (1991:11) and Johansson (2002: 188) these two types are
less frequent.
It is clear that clefts with adverbials have a range of textual meanings, or
discourse functions. It is equally clear that one must study the clefts in context in
order to get at these functions. The material offered examples of IT-clefts serving
contrastive, topic-launching, transitional, and summative functions. It was
suggested in section 5.5 that these discourse functions can all be regarded as
somehow ancillary to the function of theme (or to the theme-rheme nexus in the
case of transition).
The use of an IT-cleft enhances the textual prominence of the theme. As a
consequence, the construction is well suited for marking off the theme as new or
contrastive. However, IT -clefts are also used when the clefted constituent is
neither new nor contrastive, in which case the construction may simply serve to
make the theme-rheme division of the message extra clear. This may be the case
in clefts marking transition as well as in clefts where none of the other discourse
functions outlined in section 5 can be identified.
The function of transition seems particularly prominent with clefted
adverbials. Clefts with this function typically have a given, often anaphoric,
adverbial in cleft focus position, while the cleft clause introduces a topic for the
subsequent discourse. The speaker/writer thus achieves a smooth transition
between two topics, juxtaposing them by means of a relational clause, and
launching the new topic unobtrusively in a subordinate clause.
It was also shown that there are cases of IT-clefts being used to place an
adverbial in thematic position that would otherwise be difficult to place clause-
initially. This was seen with negative adverbials (e.g. not until), which would
have required subject-operator inversion in a corresponding non-cleft sentence,
and with adverbials such as with much regret (example 11), which probably could
not have occurred in initial position in a non-cleft sentence. Furthermore, the IT-
cleft can give extra thematic focus to clefted constituents that would have come
across as relatively unmarked themes in non-cleft sentences, particularly time
adverbials (e.g. example 14).
In the present study I have made frequent comparison with other studies of
clefts, particularly Collins (1991) and Johansson (2002). There are some
weaknesses involved in these comparisons. First of all, the three studies are based
on rather different corpora. More importantly, assigning information values and
discourse functions is no exact science, and the subjective element involved in
this work may account for some of the differences between previous studies and
my own. Ideally, the present study should have been extended to include IT-clefts
with nominal constituents in the ICE-GB corpus. This might have made the
comparisons with other IT-clefts more reliable. However, this task must be left to
a later study.
Another possible extension of the study would be to explore further the
different uses of IT-clefts in various genres. The investigation reported in section
5.6 showed a clear difference in frequency of the construction across genres,
cutting across the spoken/written dimension. A further study might look more
closely into such genre differences as well as more specific rhetorical uses of the
IT-cleft.
Notes
1. I follow the terminology of e.g. Gundel (2002) and Delin (1992), describing
the IT-cleft construction in terms of a clefted constituent and a cleft clause. No
discussion of the syntactic status of the relative-like subordinate clause will
be undertaken here.
2. In the corpus examples, the IT-cleft construction has been underlined, with the
clefted constituent in italics.
3. The category of manner adjuncts has been defined quite widely and includes
adjuncts of means and comparison.
210 Hilde Hasselgrd
4. The systemic-functional term corresponding to IT-cleft is predicated Theme

(cf. Halliday 1994: 58).
5. It is usually assumed that wh-clefts do not allow focus on adverbials, though
reversed wh-cleft seem to behave differently (cf. Johansson 2002: 96-97, who
gives examples of (reversed) wh-clefts with where and why, e.g. Here is where
I look like Marilyn Monroe).
6. Delin (1992: 294) suggests that the reason that cleft presuppositions are so
frequently assumed to specify information that is mutually known perhaps lies
in the fact that much of the discussion of it-clefts has centred around
decontextualized examples.
7. As it is argued elsewhere in this paper that Thematization may be the primary
function of IT-clefts, it should be noted that the examples classified as
Thematization in Table 5 are those where none of the other discourse
functions could be clearly identified.
8. It is assumed that the function of contrast is not prominent in the examples not
classified as such in Table 5.
References

Collins, P.C. (1991), Cleft and pseudo-cleft constructions in English. London and
New York: Routledge.
Delin, J. (1990), Focus in cleft constructions, Research Series Blue Book Note
No 5, Centre for Cognitive Science, University of Edinburgh.
<http://www.gem.stir.ac.uk/judys_publications/cleftfocus.pdf>
Delin, J. (1992), Properties of it-cleft presupposition, Journal of Semantics, 9:
179-196.
Delin, J. and J. Oberlander. (1995), Syntactic constraints on discourse structure:
the case of it-clefts, Linguistics, 33: 465-500.
Fries, P.H. (1994), On Theme, Rheme and discourse goals, in: M. Coulthard
(ed.), Advances in written text analysis. London and New York:
Routledge, 229-249.
Gmez-Gonzlez, M.. (2000), The theme-topic interface. Evidence from
English. Amsterdam and Philadelphia: John Benjamins.
Gundel, J.K. (2002), Information structure and the use of cleft sentences in
English and Norwegian, in: H. Hasselgrd et al. (eds), Information
structure in a cross-linguistic perspective. Amsterdam: Rodopi, 113-128.
Halliday, M.A.K. (1994), An introduction to functional grammar. London:
Edward Arnold.
Hasselgrd, H. (1997), Sentence openings in English and Norwegian, in: M.
Ljung (ed.), Corpus-based studies in English. Papers from the seventeenth
international conference on English language research on computerized

corpora. Amsterdam: Rodopi, 1-14.
Hasselgrd, H. (in preparation), Manner, place, time. A corpus-based study of
adverbials in present-day English.
Johansson, M. (2002), Clefts in English and Swedish: A contrastive study of IT-
clefts and WH -clefts in original texts and translations. PhD Thesis, Lund
University.
Perzanowski, D. and J. Gurney (1997), The functionality of it-clefts in selected
discourses: The message in the medium. Word, 48: 207-236.
Prince, E. (1978), A comparison of w h -clefts and it-clefts in discourse.
Language, 54: 883-906.
Corpus material
The International Corpus of English, British component (ICE-GB); see

<http://www.ucl.ac.uk/english-usage/ice-gb/ >
The English-Norwegian Parallel Corpus (ENPC); see
<http://www.hf.uio.no/iba/prosjekt/>
On the pragmatic functions of lets utterances
Bernard De Clerck
University of Ghent
The difference between a boss and a leader: a boss says, 'Go!' a

leader says, 'Lets go!'
E. M. Kelly, Growing Disciples, 1995
Abstract
This paper presents the results of research into the pragmatic functions of lets
utterances in the spoken component of the ICE-GB.1 The first part of the paper
gives an overview of the grammatical features and the pragmatic uses of lets
utterances as described in the literature. The second part presents a detailed
analysis of the attested lets utterances in the corpus. Apart from testing the force
and accuracy of the existing descriptions, the paper also examines the
frequencies of occurrence of these functions and possible relationships with the
different text categories they occur in. The goal is to provide an answer to such
questions as who uses lets utterances where, why, and how.
1. Introduction
Constructions with lets are intriguing. When one considers the possible
meanings of the pair
(a) Let us have a drink
(b) Lets have a drink
one can see that (b) is not just an informal variant of (a) with the abbreviated
objective pronoun us. On a semantic and a pragmatic level the picture is clearly
more complex than that. The meaning of example (a) is ambiguous and can be
interpreted in two ways. On the one hand, it can be interpreted as a non-inclusive
request for permission (i.e. the hearer does not belong to the group referred to by
us), which can be paraphrased as Allow us to have a drink. On the other hand, it
can be interpreted as a hearer-inclusive proposal or suggestion for joint action,
involving both the speaker and the hearer. Example (b), however, is restricted in
semantic scope and has lost its non-inclusive interpretation. It no longer has the
meaning allow us to have a drink. In contrast with (a), its illocutionary function
is restricted to a hearer-inclusive proposal for joint action and as such Shall we
214 Bernard De Clerck
have a drink? comes closer as a paraphrase than Allow us to have a drink.2 It

appears then that lets constructions seem to have gone through (and might still
be going through) a process of semantic bleaching (Huddleston and Pullum
2002: 924) which also affects their pragmatic illocutionary functions.
This paper focuses on these pragmatic functions and investigates the
influence of semantic bleaching on the different pragmatic uses of lets
constructions in present-day British English. However, before moving on to a
more fine-grained pragmatic and corpus-based analysis, I shall briefly examine
the influence of the semantic bleaching process on the grammatical properties of
lets constructions.
2. Grammatical properties of lets constructions and the status of let and

lets
There has been a great deal of debate on the syntactic properties of let and lets in
the existing literature, including the way they should be labelled or categorised.
What is interesting about this discussion is that it shows the shortcomings of
traditional grammatical distinctions whenever they are confronted with the hybrid
syntactic nature of a certain language item. The discussions also exemplify the
interconnectedness of the pragmatics, semantics and grammar of a language and
the descriptive and analytical problems that arise when the consequences of
certain changes in a construction affect these three levels at different rates.
Indeed, when reviewing the relevant literature one can see that the syntactic
properties of let and lets in the constructions at hand are actually described as a
mixture of auxiliary and non-auxiliary-like syntactic properties, whose syntactic
behaviour is often explained in terms of idiosyncratic construction-specific
characteristics.
One way of accounting for these properties is found in Seppnen (1977),
who identifies let as a hybrid modal auxiliary with a mixture of features,
characteristic of central and marginal modals. Seppnen points to the fact that,
like a regular modal, let occurs only in combination with a main verb, forming
with it a complex predicate where the semantic contribution of let is the notion of
volition (Seppnen 1977: 517).3 Furthermore, like the modals, let is always
followed by a bare infinitive form of the main verb. According to Seppnen, other
shared characteristics include the absence of non-finite forms, the lack of
inflection in the third person singular and the past tense, its use for negation and
emphatic stress. However, unlike the modals, negation with do is possible (Dont
lets do it vs. Lets not do it), as is sometimes the case with the semi-modals
ought to and used to. Yet, in order to explain differences in use between let in
lets constructions and these semi-modals, Seppnen has to resort to idiosyncratic
properties. This is especially the case with regard to the use of let as the operator
of the sentence in combination with do (Do lets try it again vs. Didnt they ought
to like it?). Furthermore, in his analysis, he regards the NP following let as its
subject, which forces him to conclude that another unique and idiosyncratic
property of let in these constructions is that they require the subject to be in the
On the pragmatic functions of lets utterances 215
accusative (objective) form when it is a pronoun, hence, us, me, him, her, them.
The problem with Seppnens argumentation is that he is forced to attribute these
properties to the idiosyncratic nature of let as a modal verb.4 If let is analysed as
the imperative form of a full verb, the latter properties can be explained more
prosaically.
Treating let as the imperative of a full verb, however, is far from problem-
free either. In one approach, put forward by Costa (1972), no distinction is made
between the full lexical let and let as it occurs in lets constructions. In the latter
case, Costa still regards let as a straightforward imperative of a full lexical verb,
i.e. allow. The effect of let as an imperative is described as exhorting the
second person to allow the desired event () to take place (1972: 142).5 This
view on the meaning of let, however, cannot account for the ambiguous meaning
of Let us have a drink, described above, and seems to ignore the fact that in the
contracted variant Lets have a drink the interpretation request for permission
(similar to allow us) has all but disappeared.6 There are other distinctive features
of let in lets constructions which remain unexplained in Costas approach. One
of the most obvious characteristics of let in these constructions is of course the
fact that the accusative form of we, i.e. the pronoun us, can be, and most of the
time is, contracted to s. In all other types of imperatives (including those with a
full lexical let) us cannot be contracted. Other features that distinguish the lets
construction from the imperative with the full lexical verb include (a) the
occurrence of shall we instead of will you in tag questions, (b) the non-
omissibility of lets in ellipsis, (c) the difference in semantic scope in negative
utterances and (d) the fact that lets cannot occur with a subject. A full discussion
of these features is to be found in Huddleston and Pullum (2002: 934-935). I will
restrict myself to giving a few examples that illustrate these contrastive
distinctions.
(a) Lets have another drink, shall we? Let her have another drink, will you?
(b) Lets go with her. *Yes, do. *No, dont
Yes, lets. No, lets not. (Huddleston
and Pullum 2002: 934)
(c) 1a. Dont lets go with her. 1b. Dont let her go with you.
2a. Lets not go with her. 2b. Let her not go with you.
(Huddleston and Pullum 2002: 935)
(d) *You lets go. You let her go.
In (c) there is a clear difference in meaning between the ordinary imperatives (1b)
and (2b): in (1b) let is inside the scope of the negation, so it is paraphrasable as
Dont allow X to do Y. In (2b) let is outside the scope of the negation: in this
case the utterance can be paraphrased as Allow X (not) to do Y. All these
distinctive properties clearly show that the interpretation of let as an imperative of
the full lexical verb let (allow), as proposed by Costa does not account for these
differences.
More recent views on let and lets, set out by Quirk et al. (1985) and Biber
et al. (1999) label lets as a pragmatic particle. Quirk et al. (1985: 148), for
example, say that it is a pragmatic particle with a quasi-modal status, an
unanalysed particle pronounced /lets/. Along the same lines Biber et al. (1999:
1117) say that in present-day English it is for practical purposes an invariant
pragmatic particle introducing independent clauses in which the speaker makes a
proposal for action by the speaker and the hearer. According to Quirk et al., the
particle status of lets is also supported by the existence, in familiar AmE, of the
pleonastic variant lets us, lets dont and of the construction lets you and me in
which the addition of the second person pronoun indicates that s is no longer
associated with us.
Huddleston and Pullum (2002) make similar observations and point out
existing differences between what they generally call dialects A and B of the
English language. Dialect B is characterised as more lenient towards
constructions such as Lets you and I, which are similar to the pleonastic
variants given by Quirk et al. with regard to AmE. In this dialect these uses
would appear to be widely enough used to qualify as acceptable informal style in
standard English (Huddleston and Pullum 2002: 935). According to them, these
constructions indicate that syntactically the specialisation of let has been taken a
significant step further:
the s in these constructions is not replaceable by us (), and also

because of the prosody, it is not plausible to treat the NP you and I
as being in apposition to s. It seems clear rather, that let and s
have fused syntactically as well as phonologically, and are no
longer analysable as verb+object: they form a single word which
functions as a marker of the first person inclusive imperative
construction (Huddleston and Pullum 2002: 935).7
However, when talking about the less lenient dialect A (i.e. a dialect which does
not have the pleonastic variants illustrated above), Huddleston and Pullum say
that there is no compelling reason to suggest that there has been a reanalysis of
the syntactic structure (and hence no reason to regard lets as a pragmatic
particle). In their opinion, the data are compatible with an analysis where let is
still a catenative verb, used with an NP object (us or s) and (except in ellipsis) a
bare infinitival clause as second complement. In their view, then, the analysis of
lets as a pragmatic particle proposed by Quirk et al. (1985) and Biber et al.
(1999) would then only apply to AmE and not (yet) to BrE. Davies (1986) holds a
similar view in regarding let in lets constructions grammatically as the
imperative of the full verb let, with additional and exceptional features, the
possibility of contracting let us to lets being one of them. Syntactically, it still
has the status of an imperative of a main verb, but [t]o provide a plausible
account of both the form and the interpretation of the let-construction, then, it
seems necessary to acknowledge a certain lack of correspondence between the
two (Davies 1986: 250).
None of the pleonastic variants could be attested in my analysis of the

ICE-GB (which only comprises the British variant of the English language).
Therefore, the observations made by Davies (1986) and Huddleston and Pullum
(2002) and the distinctions that are made between AmE and BrE are also
supported in this paper.8 It is indeed so that let in lets constructions is undergoing
a semantic bleaching process, which takes it further away from its original
meaning of allow. At a syntactic level, there is reason to believe that this is
happening at different rates in AmE and BrE. In AmE the loss of semantic
meaning of let has led to a reanalysis of its syntax, thereby allowing the existence
of constructions such as lets us and lets you and I, while in BrE let in lets
constructions still shares a number of syntactic properties with the lexical let in its
fossilised syntax but is semantically moving away from it. It appears then that the
syntactic properties have not kept up with semantic changes in BrE, while in
AmE lets is more and more used as a pragmatic particle introducing or
announcing a joint activity of speaker and addressee. In the next section we will
focus more deeply on the meaning of lets constructions and their different
pragmatic uses in British English. It will appear that this construction is used as a
means to different ends, in which the illocutionary point of joint action
instigator often serves as a starting point to reach other illocutionary forces and
perlocutionary effects.
3. Pragmatic functions of lets utterances
This section comprises a summary and discussion of the different illocutionary

forces and pragmatic functions of lets constructions, as described in the
literature. I will start with the most stereotypical function and move on to less
frequent and more ambiguous functions which have been noticed fairly recently.
This order of describing the different pragmatic functions coincides with a
pragmatic movement from an ideational level of joint action instigation, to a more
textual or interactional level and eventually to the use of lets constructions at an
interpersonal level.
3.1 Lets utterances as proposals for joint action
As indicated above, lets utterances normally have the directive illocutionary

force of a proposal for joint action: the speaker commits herself to an action and
seeks the addressees agreement. For this reason, a verbal response is normally
expected, indicating agreement or refusal (Huddleston and Pullum 2002: 936).
Compliance with the directive (i.e. the perlocutionary effect) generally involves
joint action by the speaker and the hearer, possibly involving others as well.
Recent studies (Hamblin 1985: 60, Davies 1986: 229, Halliday 1994: 87, Swan
1996: 316, Biber et al. 1999: 1117, Huddleston and Pullum 2002: 936) bring this
aspect of joint action instigation to the fore: unlike second person singular
imperative structures, the addressees or intended agents of lets utterances include
both the speaker and the hearer. Because of this joint nature of the proposed
action, they could be regarded as essentially collaborative or even convivial

(Leech 1983: 104) in kind, as their illocutionary goal is indifferent to or
coincides with the social goal (ibid.). Huddleston and Pullum (2002), however,
remark that the speakers attitude towards compliance can range from strongly
wanting it (Come on, lets get going; the bus leaves in five minutes) to merely
accepting it (Okay, lets invite Kim as well, if thats what you want) (Huddleston
and Pullum 2002: 936), which implies that differences in conversational or
institutional power might overrule the convivial aspect of jointness which seems
to be inherent in the meaning of lets. From the analysis of the data in Section 4,
it will also appear that the actual pragmatic effect of the utterance will indeed
greatly depend on the interlocutors themselves and their interpretation or
evaluation of each others power at that specific moment in the conversation.
3.2 Speaker and hearer-oriented uses of lets utterances
The literature also mentions uses of lets which move away from this idea of joint
agency. Quirk et al. (1985: 830), for example, mention that in very colloquial
English, lets is sometimes used for a first person singular imperative as well:
Lets give you a hand.9 Biber et al. (1999: 1117) also refer to this use, saying
there is also a tendency veering towards a first person singular (exclusive)
meaning, which they equate with Let me.10 Similarly, Huddleston and
Pullum (2002: 936) refer to cases where the action is in fact just carried out by
just one (typically the speaker). In an example such as Lets open the window, it
is possible that the actual aim is that of securing your agreement to my opening
it, rather than a proposal to open the window together (ibid.). This shift in
meaning is even more obvious in cases where the hearer cannot (appropriately)
perform the action presented in the verb, but where the agreement and co-
operation of the hearer are needed in order to carry it out successfully. This use of
lets would then explain why it is often found in a medical context, when a doctor
or specialist is talking to a patient:
Lets have a look at your tongue (Biber et al. 1999: 1117)
In these cases, the speaker is trying to find agreement with the hearer for an
action that will be carried out by the speaker only. Although the hearer has a role
to play in the process of having her throat examined, the kind of action that is
expected is not the same as the one presented in the verb following lets. The
actual perlocutionary effect is to get the patient to open her mouth and not to have
a look at her own throat together with the doctor. Rather than a genuine proposal
for joint action, it is more a (self-)exhortative announcement of the next step in
the examining process with the implication that the hearer will have to open her
mouth at one stage.
There is another tendency in the use of lets utterances that moves in the
opposite direction and merges with a second person singular imperative meaning,
i.e. the hearer is the intended agent of the action presented in the lets utterance.
Biber et al. (1999) refer to this use as a second quasi-imperative meaning, as
this use proposes an action which is clearly to be carried out by the hearer. This
crypto-directive style (Biber et al. 1999: 1117), which aims to camouflage an
authoritative speech act as a collaborative one, is used especially by adults when
addressing children:
You all have something to do for Ms. <name>? Lets do it please.
(Biber et al. 1999: 1117)
De Rycker (1990: 311) also notes this use and says that () it has been
frequently observed that they [= lets utterances] also function as thinly disguised
addressee-oriented directive acts. In these cases the lets utterance can be
paraphrased roughly as a second person singular imperative. Used as a crypto-
directive, Lets open our books on page 5 then actually means Open your books
on page 5. The tact involved in this use of lets utterances can be perceived, not
as a face-maintaining strategy, but rather as a signal of insincerity and
condescension (De Rycker 1990: 313). As lets utterances that serve as indirect
directives are largely restricted to these rank-sensitive contexts, where direct
types [i.e. 2nd pers. sg. imperatives] of directive realisation patterns are involved,
they may well be interpreted as an unnecessary display of superiority on the part
of the speaker (ibid.).
In a similar way, one can also find the use of the first person plural
pronoun we to refer to the hearer only. This use often occurs in the interaction
between medical staff and patients, where this aspect of pretended togetherness is
found in questions such as How are we feeling today?, Did we sleep well last
night?, Did we take our pill?, where we is used to refer to the patient only. This
convivial use is often felt to be patronising as it resembles the use of we that is
sometimes found in the discourse of parents or teachers when addressing children
(e.g. Did we do our homework yesterday?). It will become clear from the analysis
below that lets utterances can also be used (deliberately) in a patronising way,
especially when they are seemingly used to smooth out rank differences where
there are none.
3.3 Lets utterances as conversational imperatives
In the literature, the uses of lets utterances as action instigators (whether joint or
not) are mostly illustrated by examples that involve a non-linguistic action (cf. the
typical example Lets go for a drink). It is presented as the means par excellence
to make a proposal for the speaker, the hearer and possibly others to do
something together outside the linguistic boundaries of the ongoing conversation.
De Rycker (1990: 229), however, remarks that lets utterances can also have
specific functions within the linguistic context of the ongoing interaction itself.
He says that, as conversational imperatives (), they frequently serve no other
function than managing part of the topical and structural development of the
interaction itself and perform actions relevant to the talk exchange. Their
illocutionary functions have to do with conversational activity (Levinson 1983:
228): speaking, to stop with speaking, paying attention, listening, and
interactional operations such as turn-taking, holding or yielding the floor,

interrupting, butting in. In other words, the functions that lets utterances
perform cannot only be considered in the light of their purpose as illocutionary
acts but also as conversational moves, i.e. acts that regulate the ongoing talk
exchange itself by initiating, responding, interrupting or redirecting (Stubbs
1984: 149). This use is illustrated in the following corpus example, where the
broadcaster uses the lets utterance to steer the conversation in a new direction:
(1) B: in the Academy they taught them to use more pencil but in the College
more rubber <,>
A: Well lets talk about Arnold Bax because the names already come up in
our conversation uhm and uh hes obviously a very important figure and
both of you have recorded quite a good deal of music by Bax
A: Where did he look for his sources of inspiration?
(ICE-GB/S1B-032#90-92; broadcast discussions)
Other lets utterances that work at the level of ongoing interaction are
those which function as pragmatic formatives of the commentary type (Fraser
1987: 187) or as prospective or retrospective metapragmatic comments
(Thomas 1985: 770). They signal how the primary illocutionary act (performed
by the utterance of which they are part) fits into the ongoing conversational
structure (Fraser 1987: 187). Some examples are Lets say or Lets face it. Their
use as true suggestions for a joint activity of saying or considering something is
secondary to their use as metalinguistic utterances that provide clues to the
interpretation of the utterance as a whole.
In the next section, we will have a closer look at the frequency of the uses
mentioned above in the spoken component of the ICE-GB and investigate
whether their description in the literature can account for all attested uses.
Attention will be paid to the frequency of lets utterances in different text
categories, their different pragmatic functions and the relationships that can be
established between these functions and the speakers who use them. A distinction
will be made between conversational and non-conversational uses and between
truly joint and speaker/hearer-oriented lets utterances, and related to these,
whether they are to be seen as negotiable proposals or as conversational moves.
4. Lets utterances in the spoken ICE-GB
4.1 Frequency of lets utterances in the spoken component of the ICE-GB
In this study, I examined the use of lets utterances in the spoken component of
the ICE-GB corpus. The spoken component consists of 300 texts, hierarchically
organised in different text categories, which are represented in Table 1.
164 instances of lets utterances could be attested, unevenly spread across
the different text categories.11 In the dialogues the frequency of lets is 1/3000
and in the monologues 1/5000 words.12 Moving further down to the text
categories, one can trace more specific differences in frequency.
Table 1. Text categories in the spoken ICE-GB

(Figures in parentheses indicate the number of 2,000-word texts in each category)
dialogue (180) private (100) face-to-face conversations (90)

phone calls (10)
public (80) classroom lessons (40)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
monologue (100) unscripted (70) spontaneous. commentaries (20)
unscripted speeches (30)
demonstrations (10)
legal presentations (10)
scripted (30) broadcast talks (20)
non-broadcast speeches (10)
mixed (20) broadcast news (20)
7
6
5
4
3
2
1
0
tes tions ches calls tions talks iews aries ches tions tions ions sons tions
ba v t s
y de enta spee one mina cast inter men spee ersa nstra scus m les nsac
n ta pr cast eleph -exa road ast com ted conv emo st di sroo s tra
r e s
me legal road t ss b adc us crip ce d ca las nes
rlia cro s
bro aneo un -to-f
a ad c usi
pa n-b al nt bro b
no leg o fa c e
sp
Figure 1. Distribution of lets in the spoken categories of the ICE-GB (tokens per
10,000 words)
Figure 1 shows the distribution of lets utterances in the dialogue and

monologue text categories. The differences in frequency of lets utterances across
the text categories are quite large. Lets is most common in the dialogue text
categories of business transactions, classroom lessons and broadcast discussions.
Within the monologue text categories it is especially frequent in the
demonstrations (in fact more frequent than in the face-to-face conversations) and
in the unscripted speeches.
In an attempt to explain these differences in frequency, I looked at the
different functions of lets utterances in the corpus and tried to establish
relationships between the functions and the text categories in which they
occurred.13
4.2 Pragmatic functions of lets utterances in the ICE-GB
4.2.1 Joint, speaker or hearer-oriented agency
I first investigated the use of lets utterances in the corpus by using the distinction
that is made in the literature between joint, speaker and hearer-oriented agency.
The pie chart in Figure 2 shows the distribution of the intended agents of lets
utterances in the corpus.
38%
joint agency
intended agent ~hearer
54% intended agent ~speaker
8%
Figure 2. Intended agents of lets in the spoken ICE-GB
As we can see, about half of the lets utterances are proposals for joint action in
which the hearer and the speaker are the intended agents of the proposed action.
The following corpus example illustrates this use:
(2) A: Let s have a good uh
A: So let s play Trivial Pursuit as well after or something
B: Mm
A: Shall we
(ICE-GB/S1A-048#123-126; face-to-face conversations)
Both speaker and hearer will be involved and actively participate in the proposed
action. In 38% of the lets utterances the intended agent of the action is the
speaker. In most of these cases, the hearer neither gives nor has the opportunity to
give a verbal response and is undergoing the action performed by the speaker. In
(3) we have an example of this speaker-oriented use in which a woman addresses
herself while trying to solve a slide feed problem. The use of lets is similar to
that of let me in this case:
(3) A: Think I have a slide feed problem <,,>
A: Here lets try the next one <,>
(ICE-GB/S2A-029# 97-98; unscripted speeches)
Hearer-oriented utterances only comprise 8% of the cases. Example (4) illustrates
this use:
(4) A: So you go up on O and come down on Ooo and see if we can get to it that
way <,,>
B: That was a bit That was certainly easier <,>
A: Well lets do it again only this time your little <unclear-words>
meantime
B: Yeah
(ICE-GB/S1A-044# 114-117; face-to-face conversations)
The music teacher is using a lets utterance to give instructions, but does not carry
out the proposed action herself.
100%
90%
80%
70%
60% intended agent= speaker

50% intended agent =hearer
40% intended agent= we
30%
20%
10%
0%
DIALOGUE MONOLOGUE
Figure 3. Intended agents in dialogue and monologue text categories
Figure 3 shows the distribution of these three types in the monologue and
dialogue text categories. A closer examination of these different sub-categories
provides the picture shown in Table 2.
Table 2. Intended agents of lets in the subcorpora of the ICE-GB

Text categories Intended Intended Intended agent =
agent = we agent = speaker
hearer
face-to-face conversations 44 7 12
telephone calls 1 0 0
classroom lessons 8 1 10
broadcast discussions 9 2 8
broadcast interviews 2 0 2
legal cross-examinations 2 0 0
business transactions 7 0 6
spontaneous commentaries 5 0 3
unscripted speeches 7 0 15
demonstrations 1 2 5
broadcast talks 1 1 3
The table shows a higher concentration of speaker-oriented lets utterances in

unscripted speeches, demonstrations, classroom lessons and broadcast talks. Lets
utterances with joint agency are most frequent in face-to-face conversations,
spontaneous commentaries, business transactions and broadcast discussions
(although in these categories there is a fair number of speaker oriented lets
utterances as well). In order to find an explanation for these different frequencies
in the various text categories, I analysed their pragmatic function and the
background information on the speakers. The following sections deal with these
parameters and the influence they have on the use of lets utterances.
4.2.2 Conversational use of lets
Apart from distinguishing between the different kinds of agent, the analysis also
focused on the pragmatic function as conversational and non-conversational
imperatives. The first important observation is the high frequency of
conversational lets utterances in the ICE-GB. No less than 67% of joint and
speaker-oriented lets utterances consisted of process types that were aimed at
influencing the conversational flow of the interaction. An example of this
conversational use is shown in (5), where the lets utterance is used by a
university lecturer to structure the topical organisation of his talk:
(5) A: So we 've got the nerve is having a kind of trophic action on muscle but
ma muscle actually also a a a acts in a trophic way towards the nerve
A: But let s just stick with the nerve affecting the muscle for the moment.
(ICE-GB/S1B-009#159-160; classroom lessons)
The correspondence between speaker-oriented lets utterances and its
conversational uses was most obvious in the monologue text categories of
unscripted speeches, demonstrations and broadcast talks. About 70% of the lets
utterances are speaker-oriented, 70% of which are used as conversational

imperatives. Examples (6) and (7) exemplify their conversational use as
expository directives (Huddleston and Pullum 2002: 931) in these text
categories:
(6) A: So I said there was going to be a lot about Ravenna in this lecture
A: And uh lets just start by establishing the idea of the basilica plan uhm
because of course many of the famous
A: churches of Ravenna are built in this style.
(ICE-GB/ S2A-060#44-46; demonstrations)
(7) A: So its to our advantage that the Greenhouse Effect exists
A: So lets just backtrack for a second
A: What Ive talked about is uh the following
(ICE-GB/S2A043#115-117; unscripted speeches)
Rather than being genuine proposals for a joint action to which a verbal response
of agreement or disagreement is expected, they are more like announcements of a
topical shift that round off the present topic and introduce the next step in the talk.
Apart from structuring the speakers own talk, they are also aimed at engaging the
active participation of the addressee in the speakers exposition, but do not really
expect a verbal response from the audience. As conversational imperatives they
are often used in combination with other utterance launchers such as well, right,
OK, so or hesitation markers (uhm, hmm). This was the case in about 70% of the
conversational imperatives. They typically mark the end of one particular topic,
or introduce a kind of conclusive utterance that rounds off the topic and paves the
way for a new one (cf. (6) and (7) above).
A similar use of speaker-oriented lets utterances can be attested in the
dialogue categories. The first important observation that can be gleaned from the
analysis is that only 5% of the speakers using speaker-oriented lets utterances
had less institutional or conversational power (they were interviewees, who used
lets in lets see as a hesitation device). The rest of the speakers using this type
had the same or more institutional or conversational power. Institutionally and/or
conversationally more powerful speakers include teachers, doctors, student
counsellors, careers counsellors and professors, broadcasters, interviewers and
journalists. 80% of the utterances introduced by speaker-oriented lets consisted
of process types that were directly aimed at influencing the conversational flow of
the interaction, both with regard to topical management and turn-taking.
Especially in broadcast discussions, broadcast interviews and classroom lessons,
they were used by the more powerful speaker who creates the illusion of a joint
and convivial interaction, but who is in fact skilfully using lets as a way to steer
the discourse and to decide which actions are to be taken (whether jointly or not)
and when. These lets utterances are actually performative in nature, as utterance
launchers or idiomatic overtures (Biber et al. 1999: 1073); the speaker is
announcing the next step to be taken in the interaction rather than presenting a
genuine proposal for joint activity. Apart from propelling the conversation in a
new direction, or orienting the listener to the following utterance, especially in

relation to what has preceded, their role also consists in providing the speaker
with a planning respite, during which the rest of the utterance can be prepared for
execution (ibid.). In this way, the broadcasters use of lets see as a hesitation
marker in example (8) does not just represent a mental process of cognition,
semantically speaking, but rather a conversational process of the organisational
type: a linguistic means of bridging a possible gap during ones contribution and
hence a means of turn-holding and/or preventing another participant from
claiming the turn.
(8) B: But in fact you 've both recorded the Enigma Variations and both with
the with the London Philharmonic Orchestra
B: But uh let s see
B: Watch Which one are we going to have
B: That 's the problem
A: Have Jack 's it 's
(ICE-GB/S1B-032#61-65; broadcast discussions)
Table 3 shows the distribution of conversational and non-conversational uses of
lets in the dialogue text categories.14
Table 3. distribution of conversational and non-conversational uses of lets in

dialogue text categories
Text categories Conversational lets Non-conversational lets
business transactions 7 6
broadcast interviews 3 1
broadcast discussions 12 7
classroom lessons 16 3
legal cross-examinations 2 0
telephone calls 0 1
face-to-face conversations 16 47
parliamentary debates 0 0
The conversational use of lets prevails in classroom lessons, business

transactions, broadcast discussions, broadcast interviews and legal cross-
examinations. Lets utterances used as conversational imperatives then seem to be
part of the repertoire of chairmen, interviewers and teachers, generally speaking
of interactionally more powerful speakers, who present the conversation as a joint
enterprise, but actually try to control it by restricting the hearers influence to a
minimum. Rather than giving the floor to the hearer and providing her with the
opportunity for verbal agreement or disagreement, the speaker keeps the floor and
starts carrying out the action immediately after announcing it.
Apart from this use at a structural or topical level, conversational lets
utterances also function as pragmatic formatives in 10% of the cases. This
means that they function as metalinguistic utterances, which signal how the
primary illocutionary act (performed by the utterance of which they are part) fits
into the ongoing conversational structure (Fraser 1987: 187). Their use is briefly
illustrated in the following examples:
(9) A: And the bar for a start
A: Right
A: which is unlikely, lets face it.
B: So I said to him
(10) A: But I still believe that <,> what is
A: I mean let s say Charles Dickens is communicating through you
A: It 's still <,> got to be through your through the medium of you and
therefore the writer
(ICE-GB/ S1B-026#235-237; broadcast discussion)
(11) A: Well it 's not that wonderful a film really <,>
A: let s be honest
A: I 'm sure we 'll find something
B: No
When used as a pragmatic formative, lets face it can be seen in the first place as
an indication or a signal that the speaker feels very strongly about the primary
communicative act. According to De Rycker (1990: 403), it may also have a
defensive side to it: it expresses the speakers awareness that s/he is saying
something that is either controversial or obvious, but which in each case may well
lead to a potentially unfavourable reaction by the listener. Lets is used as a
connective and in this way it still retains some of its prototypical pragmatic
meaning as an act of suggesting a desirable joint activity. Similarly, it can be
argued that lets say is an indication that part of what follows counts as a
hypothetical example, a rough guess or anything else about which the speaker is
not entirely sure (De Rycker 1990: 402). Of course it can also be used as a
convenient hesitation marker, thereby allowing the speaker to take more time to
formulate his or her thoughts. In this way lets say can be both a conversational
imperative and a pragmatic formative. Lets be honest also gives indications as to
how the utterance of which it is part should be interpreted: it seems to indicate
that although the information in the proposition might be controversial, it is still
something that the speaker supports. By using the connective lets and appealing
to the hearers sense of honesty, the speaker tries to repair common ground or at
least indicates that s/he is aware of a possible discrepancy between the
propositional attitudes of the partners in the conversation. On the whole, it
appears that the function of lets utterances is primarily interactional, when used
as conversational imperatives. Especially in the case of stereotyped or formulaic
uses taken from a stock of ready-made utterances, their illocutionary force as a
proposal for joint action is fairly weak and secondary to their use as idiomatic
overtures, hesitation markers and pragmatic formatives.
Lets utterances, however, were not used as conversational imperatives

only. In the next section I will briefly comment on the non-conversational uses
and pay particular attention to a number of special cases which are not really
action-oriented but which seem to present evaluative statements or emotions.
4.2.3 Non-conversational uses of lets utterances
Non-conversational lets utterances account for 33% of all attested utterances.

Most of them were typical proposals for joint agency and occurred in face-to-face
conversations between intimates. They exemplify the convivial nature of the
proposed action often described in the literature (cf. example (2) above). Non-
conversational speaker-oriented utterances were used as self-monitoring devices
during the process of carrying out an action (cf. example (3) above). Used as self-
addressed imperatives, rather than managing the ongoing interaction, they are
aimed at managing the speakers own actions. The attested cases of hearer-
oriented lets utterances (8%) were non-conversational in kind (cf. example (4)
above). They were especially frequent in demonstrations where the audience were
asked to perform a number of activities on a computer. In the text category of
face-to-face conversations, they all occurred in one and the same interaction,
where an institutionally more powerful informant gives instructions to a
colleague who is also trying to solve a computer problem. As mentioned above, it
is possible that the indirectness of the utterance can be perceived, not as a face-
maintaining strategy, but rather as a signal of insincerity and condescension.
However, there was no evidence in the linguistic output of the hearers in the
corpus that shows that they regarded the use of lets in these cases as insincerely
over-polite. In the attested cases this was probably due to the fact that the
instructions were given for the benefit of the hearer.
Interestingly, not all hearer-oriented lets utterances occurred in a rank-
sensitive context, where it was used by the more powerful speaker. There is one
instance in the corpus where lets is used between intimates, which exemplifies
yet another pragmatic use. In the following example the lets utterance is not
primarily used as an action instigator but as an evaluation of the hearers
behaviour:
(12) A: God you really know how to put someone down don't you
B: Oh lets not get touchy touchy <,,>
A: Very difficult Moses when you 're around <,>
In this example one can see that a negative evaluation of the hearers behaviour
has been poured into a lets utterance. Bearing in mind the fact that hearer-
oriented lets utterances are primarily used in rank-sensitive contexts (cf. above),
one could see this as a conscious use of patronising language by the speaker in
order to be deliberately offensive, ironic or sarcastic in criticising the hearers
behaviour. From this position of assumed authority the speaker reprimands the
hearer and monitors her behaviour in a way similar to parents reprimanding

children.
It appears that lets utterances cannot only be used as typical action
instigators, but also as reactive acts of (dis)agreeing performed in response to
prior claims, affirmations, statements and similar assertive illocutions. Many
formulaic and colloquial imperatives, such as lets be real, come off it, dont be
stupid, apart from being directive utterances, are also and maybe in the first
place retrospective evaluative comments about the addressees behaviour or
statements. In a sense, they are assertions, expressions of the speakers disbelief,
disapproval, accompanied by a directive dimension aimed at redirecting the
addressees behaviour. Next to expressions of evaluation of the hearers
behaviour (such as disapproval in example (12)), lets was also used to express a
more positive attitude. In example (13), the lets utterance is not really being used
as an action instigator, but rather as a genuine expression of positive attitude
towards the hearer:
(13) A: Thanks a lot
B: All right
B: Let s hope it works
A: Yep
Rather than acting as a request or proposal for joint action (Shall we hope it
works?), it is an expression of (shared) concern, an expression of sympathy and
empathy.
In this way, then, it seems that lets utterances cannot only be used as
action instigators or as conversational imperatives, but also as expressions of
emotion or approval. In these cases, their directiveness is less prominent.
5. Conclusion
Apart from their typical function as a (genuine) proposal for joint action, lets
utterances have various other pragmatic uses. From the analysis it has become
clear that, in the ICE-GB, their most frequent function is that of a conversational
imperative, aimed at regulating the conversational flow of the interaction or the
structure of the speakers own talk. This use is especially frequent when the lets
utterances are speaker-oriented or in cases of joint agency when they are used by
the interactionally more powerful speaker. In such cases, their illocutionary force
of proposal for joint action is secondary to their use as conversational managers.
Other non-typical functions were attested in non-conversational uses of
lets utterances which were not really oriented towards joint action, but which
aimed at presenting evaluative statements or feelings on the part of the speaker at
an interpersonal level. Rather than being proposals for joint action, they can be
seen as retrospective evaluations.
Further research will require a more detailed pragmatic analysis of lets

utterances and a comparison with other let constructions (e.g. let me) and other
imperative structures, such as look, listen, tell me and the way these relate to lets
utterances as conversational imperatives.
Notes
1. The research reported on in this paper was made possible by the Research
Fund of Flanders.
2. There are other more specialised uses of lets utterances, which will be
focused on in Sections 3 and 4. None of these uses, however, is completely
compatible with the strict notion of request, as in allow.
3. Seppnen rightly remarks that in this way, let is both syntactically and
semantically, close to the modal may, as used in wishes: May I (we) never see
that day! (Seppnen 1977: 517).
4. Another argument Seppnen uses to support his subject interpretation of the
following NP, is the existence of utterances such as
(a) Let you and I cry quits

(b) Let all these matters pass and we three sing a song
which have the nominative form and can hence not be explained in the
analysis of let as the imperative of a main verb. As an example of a parallel
development he refers to Dutch, where similarly both the accusative and the
nominative are used with the verb laten: Laat me/ik voorzichtig zijn, Laat
hem/hij maar komen, Laat ons gaan/even Laten we gaan. The parallelism
with Dutch, however, is far from complete. In fact, the only instances where
these rare examples of the nominative form after let are found in English, are
restricted to cases where the NP consists of two co-ordinated NPs and where
the NP is separated from the verb. According to Davies (1986), the
nominative form in (a) could be felt to result from the same sort of
hypercorrection that is responsible for the now frequent use of forms like
between you and I, or Thats for you and I to decide, while (b) could be
considered to result from hypercorrection in reaction to use forms like us
three even as subjects (Davies 1986: 237). Apart from that, no examples of
*Let we see or *Let I see can be attested in the English language, which, all in
all, makes the argument of parallelism with the uses of laten in Dutch rather
doubtful.
5. A similar view is taken by Ukaji (1978). With regard to lets constructions,
the meaning is described as one in which the speaker prays the hearer to
allow a group of persons among whom the speaker and the hearer are
included to carry out a particular action (Ukaji 1978: 120).
6. Davies (1979) errs in the other direction by assigning special status to all
examples involving let, even those examples which are obviously instances of
the imperative form of the lexical verb.
7. To support their argument Huddleston and Pullum refer to the occurrence of a
negative construction used by some speakers of dialect B that provides
evidence for the reanalysis: lets dont bother. Huddleston and Pullum (2002:
935) remark, though, that this is much less common than the construction
with an NP after lets, and cannot be regarded as acceptable in standard
English. Still, according to Huddleston and Pullum (ibid.), its syntactic
interest is that it shows conclusively that let is no longer construed as a verb: a
subjectless dont could not appear in the complement of a catenative verb.
This, in fact corroborates what Quirk et al. say about the lets dont in AmE.
They do not say, however, that it is not acceptable as standard English.
8. It is possible, of course, that some of these constructions do occur in certain
dialects of English. However, from the fact that they do not occur in the ICE-
GB, we might tentatively conclude that they are not used widely enough to
qualify as acceptable informal style in standard British English.
9. There are other uses of us with a first person reference in colloquial English.
Instances such as Give us a hand, Tell us a story, Give us a kiss all feature
uses of us with a first person singular reference. In spoken language us is
often abbreviated to /s/, which makes its resemblance to this particular use of
lets even more striking.
10. Although Biber et al. (1999) say that the meaning of lets is equivalent to let
me in these cases, I tend to believe that this equivalence is not complete. It
seems to me that there is a slight difference in illocutionary force and in the
freedom given to the addressee to reject the proposal. Let me, just like let us,
can still be used or interpreted in two ways: as a true request for permission
from a person or as a (self-addressed) exhortative. As mentioned, the
permissive interpretation of lets has faded and an example like Lets have a
look at your throat can no longer be interpreted as a true request for
permission the way Let me have a look at your throat can. Both utterances
aim at getting the hearers agreement, but do so in a slightly different way. By
playing with the requestive interpretation of the ambiguous let me, one can
give the impression of actually looking for or asking for agreement, whereas
when one uses the convivial lets ranks being equal agreement is taken
for granted and can be taken as a starting-point to proceed with the proposed
action. Consequently, one could say in a more accurate way that the meaning
of first person lets is equivalent to the exhortative meaning of let me only.
11. All in all, the corpus contains 178 lets utterances, 14 of which were found in
the written part of the corpus.
12. The mixed category of broadcast news did not feature any lets utterances.
13. Clearly, the fact that some text categories allow certain specific pragmatic
uses more easily than others, is just one possible way of explaining these
differences in frequency. Of course one should also bear in mind stylistic

reasons, to name but one aspect in explaining differences between the text
categories. In parliamentary debates or legal presentations, for example,
which contained no instances of lets, speakers might have chosen a less
informal construction in order to comply with the formal nature of interaction
that is taking place.
14. In the text categories where only a few lets utterances were attested (e.g.
telephone calls, broadcast talks, legal cross-examinations), it should be clear
that the distribution of conversational and non-conversational uses is not to be
taken as representative of the general use of lets in these categories. Further
research, including the pragmatic analysis of a larger amount of data, will be
needed to corroborate the attested uses.
References
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), T h e

Longman grammar of spoken and written English. London: Longman.
Costa, R.M. (1972), Lets solve lets!. Papers in Linguistics 5: 141-144.
Davies, E.C. (1979), On the semantics of syntax: mood and condition in English.
London: Croom Helm.
Davies, E.E. (1986), The English imperative. London: Croom Helm.
De Rycker, T. (1990), Imperative subtypes in conversational British English: an
empirical investigation. Unpublished PhD dissertation. Department of
Linguistics, University of Antwerp.
Fraser, B. (1987), Pragmatic formatives, in: J. Verschueren and M. Bertuccelli-
Papi (eds), The pragmatic perspective: Selected papers from the 1985
International Pragmatics Conference. Amsterdam: Benjamins, 179-194.
Edward Arnold.
Hamblin, C. (1987), Imperatives. Oxford: Basil Blackwell.
Huddleston, R. and G.K. Pullum (2002), The Cambridge grammar of the English
language. Cambridge: Cambridge University Press.
Leech, G. (1983), Principles of pragmatics. London: Longman.
Levinson, S. (1983), Pragmatics. Cambridge: Cambridge University Press.
Seppnen, A. (1977), The position of let in the English auxiliary system.
English Studies 58: 515-529.
Stubbs, M. (1984), Discourse analysis: the sociolinguistic analysis of natural
language. Oxford: Blackwell
Swan, M. (1996), Practical English usage. Oxford: Oxford University Press.
Thomas, J. (1985), The language of power: towards a dynamic pragmatics.

Journal of Pragmatics 9: 765-783.
Ukaji, M. (1978), Imperative sentences in early modern English. Tokyo:
Kaitakusha.
Methodological problems in corpus-based historical
pragmatics. The case of English directives
Thomas Kohnen
University of Cologne
Abstract
This paper gives a summary of some methodological problems a corpus-based

diachronic analysis of speech acts has to face, with particular emphasis on the
case of English directives. Among the issues discussed are the difficulties with a
complete inventory of the different manifestations of directives in the history of
English, the scarcity of historical data, problems of interpretation and the
relationship between the number of the particular manifestations of directives
found in a corpus and the 'underlying' total number of directives. In a second part
the paper presents some illustrative results of corpus-based investigations tracing
the history of English directives.
1. Introduction
The study of the diachronic development of speech acts with the help of
electronic corpora raises serious questions which challenge both the reliability of
existing data collections and the results of the investigations which are based on
them. Any attempt to write a corpus-based illocutionary history is faced with
basic problems involving the methodology of historical pragmatics and the design
and use of historical corpora. This paper aims to give a summary of some
important problems which I encountered in my corpus-based research, illustrating
them with several studies on the history of English directive speech acts. It falls
into two parts. The first part addresses some basic methodological issues; the
second part is devoted to some illustrative results of studies exploring aspects of
the history of English directives.
2. Methodological issues
One of the rather basic methodological problems a corpus-based diachronic study

of directives is faced with is finding all the different manifestations or patterns
associated with directive speech acts in the history of English. The research
carried out in the fields of speech-act theory and pragmatics has made it
sufficiently clear that with speech acts there is no predictable link between form
and function. We may know that a directive speech act is an attempt by a speaker
or writer to get the addressee to carry out an act (Searle 1969: 66; 1976: 11) and
238 Thomas Kohnen
we may assume that this illocutionary function remains stable throughout the
history of English. But we do not know in advance what linguistic form a speaker
or writer may employ for his directive. We can only rely on the fact that people
tend to use more or less fixed phrases or patterns in order to perform certain
speech acts. Since corpus searches must be based on forms rather than functions,
the study of a history of directives has to start with a selection of forms we would
consider typical manifestations of directives in the different periods of the
English language.
What are the most important manifestations of directives in the history of
English? The most straightforward examples which come to mind are explicit
performatives (I order you to carry this message to the king), imperative
sentences, constructions with let (lets do it) and constructions involving the
subjunctive, especially those with inverted word order (go we).1 Clearly,
however, there are many other manifestations of directive speech acts. The
number of possible candidates becomes even greater if we include those
realisations which are sometimes called indirect, because they involve sentence
types different from the imperative format.2 Some typical examples are
declarative sentences with the second person pronoun plus a modal involving
obligation (you must leave, you ought to do this etc.), declarative sentences with a
first person pronoun plus a verb involving volition (I want you to do this, I would
like you to do this etc.) and different kinds of interrogative manifestations (Can
you open the door? Will you do the washing up? Why dont you come in? cf.
Quirk et al. 1985: 1477-78).
Quite clearly, this enumeration could be continued. It is difficult to give a
comprehensive list of all the typical manifestations of directive speech acts for all
periods of the English language. What does this mean for a corpus-based analysis
of speech acts? Basically, there seem to be two kinds of procedure. First, since
we are faced with an open, heterogeneous and highly variable set of forms we can
restrict our analysis to an eclectic illustration of the speech act under
consideration. That is, we look around in the periods of English and see what
typical realisations we find, for example, showing some imperatives and inverted
constructions in Middle English texts and some interrogative constructions in
Early Modern texts, perhaps adding some intuitive judgments about changes
which we assume to be typical. Secondly, we can base our analysis on a
deliberate selection of typical patterns which we trace by way of a representative
analysis throughout the history of English. For example, we could examine the
development of imperatives, constructions with let or interrogative directives
throughout the history of English. I call the first kind of procedure illustrative
eclecticism, the second structured eclecticism. Given the fact that the research is
doomed to be eclectic, I think corpus linguists should opt for structured
eclecticism.
Another methodological problem is difficulties of interpretation. Speech
act assignment is in many cases a matter of interpretation which requires careful
consideration of contextual knowledge. For example, we are liable to find
imperative constructions which do not serve as directives, but as imprecations or
Methodological problems in corpus-based historical pragmatics 239
wishes (Quirk et al. 1985: 831-832, Biber et al. 1999: 220). This problem, of
course, becomes more serious with indirect manifestations because their
indirectness is due to their openness to different speech-act assignments. These
are difficulties with functional interpretation, which apply to any linguistic data,
historical or contemporary. But in the history of English directives we also
encounter difficulties which relate to semantic or syntactic changes. This often
results in what I would call pragmatic false friends, constructions which, against
a contemporary background, suggest a wrong pragmatic interpretation. I will
present four examples.
The first example is taken from Chaucer's Canterbury Tales. Here a young
knight urgently needs to know what women desire most.
(1) My leeve mooder, quod this knyght, certain

I nam but deed but if that I kan seyn
What thyng it is that wommen moost desire.
Koude ye me wisse, I wolde wel quite youre hire.
(c 1395, Geoffrey Chaucer, The Wife of Bath's Tale, 1005-1008)
Quite clearly, the meaning of the last line is not Could you please instruct me? I
would certainly reward your efforts, that is, an indirect request, but rather If you
could tell me (knew how to instruct me), I would reward your efforts. The
difficulty of interpretation is here due to the fact that the inverted clause pattern
(koude ye) could be interrogative or conditional in Middle English and that the
verb cunne was still used as a full verb in this period.
The second example is taken from an official letter by Henry V:
(2) we wol and charge you. fiat ye se and ordeyne at hasty restitucion
of e forsaide goodes be maad and at ye do compelle our saide
sougettes to make restitucion abouesaid
(Helsinki Corpus, Letters, 1418/1419, Henry 5, 99)
The difficulty of interpretation is here due to the fact that willan had an additional
speech-act meaning in Old English and Middle English. Thus the expression we
wol must be taken as a performative phrase, where wol has the speech-act
meaning of order, command. The speech-act meaning seems likely since wol is
in a co-ordinate construction with another performative directive (charge). Thus
we are not dealing with some kind of indirect directive along the lines of I would
like you to but with a performative expression.
(3) Ford: Blesse you sir.

Fal.: And you sir: would you speake with me?
(Helsinki Corpus, 1623 [1597], William Shakespeare, The Merry
Wives of Windsor, 46.C1)
240 Thomas Kohnen
In the third example, which is from Shakespeare, Falstaff has already been
informed that Ford wants to talk to him. Thus the utterance Would you speak with
me? is not a request (Would you talk to me?) but rather a real question which
serves to identify the man who wanted to talk to Falstaff (Did you want to talk to
me?). In Modern English this interpretation would not be possible because would
cannot be taken as referring to the past.
(4) But now my good masters since we must be gone

And leaue you behinde vs, here all alone:
Since at our last ending thus mery we bee,
For Gammer Gurtons nedle sake, let vs haue a plaudytie.
(Helsinki Corpus, 1575 [1552-63], William Stevenson, Gammer
Gvrtons Nedle, 70)
The fourth example contains a construction with let us. It is found at the ending
of Stevenson's play Gammer Gvrtons Nedle. Here the phrase let vs haue a
plaudytie clearly is an invitation addressed to the audience to give applause,
which should be paraphrased with allow us / cause us to have some applause.
Thus, although this construction is an imperative, it cannot be considered a
hortative or periphrastic imperative construction (cf. Rissanen 1999: 279). Rather
it is a construction with the full verb let. This is quite disturbing against a
contemporary background since we would like to rely on the assumption that
constructions with let us are always periphrastic constructions.
The discussion of the excerpts has shown that a corpus-based analysis
which selects items on the basis of form must be extremely careful with the
interpretation of the examples. Each individual item which we assume to be a
manifestation of a directive speech act requires careful consideration if we want
to avoid pragmatic false friends.
The third methodological issue involves the relationship between the
examples of particular manifestations of directives found in a corpus and the
underlying total number of directives. If we compare the different frequencies
of selected manifestations in the periods of English, do the increasing or
decreasing numbers only reflect an increase or decrease in the respective
manifestations or do they suggest a general change in the use of directives as
well? For example, if we find more directive performatives in Late Middle
English letters than in Modern English ones, is this because people choose to use
different means for expressing their requests in letters today or is it because they
use fewer requests? In other words, would a decreasing frequency of
performative directives point to an increase in alternative manifestations (e.g.
imperatives, constructions with let, interrogative directives) or to a general,
underlying decrease of directives? I think this problem can only be tackled if we
base our analysis on comparable text types or genres and if we assume a more or
less stable functional profile for these text types or genres. For example, we could
assume that religious instruction requires directive speech acts in the Middle
Ages as well as today. Or we might assume that text types involving spoken
interaction are liable to contain a stable amount of directives because, when

people talk to each other (especially in an everyday setting), they are likely to
perform requests.
The fourth methodological problem to be mentioned here is how to deal
with the lack of sufficient data. In the early periods of English the number of
relevant examples (especially of indirect manifestations) found in a classic
corpus (like the Helsinki Corpus) tends to be fairly low. An analysis based on
such numbers cannot be held to be valid. Among the options of dealing with this
problem is extending the database by using large dictionary corpora (e.g. MED
or OED) or finding functional patterns in the restricted number of items at hand.
Each approach has advantages as well as disadvantages.
When studying constructions with let me (e.g. let me ask you to do this) I
found that the Helsinki Corpus had only 14 relevant items in the Middle English
section.3 In order to obtain more examples I searched all the quotations in the
electronic version of the Middle English Dictionary. Here my search produced
231 relevant items. On the basis of these data one can show, for example, that in
Middle English there are no combinations with let me plus an illocutionary verb
(let me ask you, let me entreat you). I presume that such a result is much more
reliable if based on the MED data rather than on the 14 items found in the
Helsinki Corpus. On the other hand, with electronic corpora like the MED we
do not know the proportion of the individual text types and we cannot get
regularised frequencies.
When studying interrogative manifestations of directives which mention
the addressee (Can you pass the salt? Would you do the washing up?) I found
only 36 examples in the Helsinki Corpus. This number may seem very small but
in the Early Modern English section these data showed an exceptional degree of
distributional consistency. First, all instances (apart from one item) belong to
spoken interaction, and the large majority of the items (85%) belong to two text
types: plays and trials. If we focus on these two text types (containing 79,000
words out of the total number of 551,000 words), we can see a consistent increase
of the items with a reasonably high frequency reaching 4.15 per 10,000 words at
the end of the 17th century (see Table 1; for a detailed discussion of the data see
Kohnen 2002).
Table 1: Addressee-based directives in plays and trials in the Helsinki Corpus

(tokens per 10,000 words)
1420-1500 1500-1570 1570-1640 1640-1710

0 2.26 3.45 4.15
So the scarcity of data may be balanced if we focus our attention on indvidual

text types and their functional profiles. On the other hand, the concentration on
242 Thomas Kohnen
individual text types or genres may raise doubts about the representativeness of
the analysis.
To sum up this methodological section, I would like to advocate the
procedure which was called structured eclecticism. It makes up for the
heterogeneity of the data by systematic selection, comprehensive diachronic
statistical analysis and careful consideration of each item. In addition, it has been
shown that a diachronic analysis of speech acts should be embedded in a
reasonably stable functional profile of text types. This as well as the notorious
lack of data call for more extensive text-type specific corpora for historical
pragmatic studies.
3. Aspects of a history of English directives
What is the outline of a history of directives in English? It seems best to start with
a general consideration of the speech-act class of directives. Since a directive
aims at an act to be performed by the addressee, it can be seen as a threat against
the addressee's freedom of action and freedom from imposition, that is as a threat
against what is usually called the addressee's negative face (Brown and Levinson
1987). It seems that in the history of English considerations of face have assumed
increasing importance, changing the manifestations of directives more towards
polite and indirect realisations. This tendency can be illustrated by a decrease of
direct realisations of directives, for example performatives and imperatives, and
by an increase of prototypical indirect manifestations, for example interrogative
directives and constructions with let.
With regard to performatives, I found that the frequency of directive
performatives in the Old English section of the Helsinki Corpus is seven times as
high as that found in the LOB Corpus (4 vs. 0.55 per 10,000 words). In addition,
the frequency of performative verbs referring to acts of ordering and
commanding is far higher in the Old English section than in the LOB Corpus (1.5
vs. 0.07). And whereas the LOB Corpus shows a clear predominance of
suggest/advice verbs, the Old English part of the Helsinki Corpus has none (for
a detailed discussion see Kohnen 2000). So it seems that performatives, which at
least today are a rather direct and mostly face-threatening manifestation of
directives, are significantly less common in contemporary (written) English than
they were during the Anglo-Saxon period. If they are used today, they tend to be
employed in rather mild requests, like suggestions or advice. On the assumption
that people perform as many directives in written English today as they did in
Anglo-Saxon times we may infer that the performative option of directives has
fallen out of favour and that other possibly less face-threatening means are
employed instead.
With regard to the imperative manifestation of directives it can be shown
that the number of imperatives decreases from Middle English to Modern
English. I looked at imperatives in the religious treatises in the Penn-Helsinki
Parsed Corpus of Middle English (PPCME2, Kroch and Taylor 2000) and in the
Brown Corpus and LOB Corpus. Religious treatises, both in their Late Medieval
and their modern form, can be assumed to have a basic instructional function,
which makes it likely for imperatives to be used there. Since I wanted to focus on
instructional imperatives which directly serve the purpose of religious instruction
associated with the text type, I excluded those sentences which appear in direct
speech, that is imperatives contained in narrative sections, quotations from the
Bible, etc. (see Figure 1).
5 4,5
3
1,9
2
0,8
1 0,6
0
1200-1375 1390-1450
PPCME2 LOB BROWN
Figure 1: Imperatives (excluding direct speech) in the PPCME2, the LOB Corpus
and the Brown Corpus (tokens per 1000 words).
Figure 1 shows that the frequency of imperatives in the period 1200-1375 is fairly
high (4.5). It goes down significantly in Late Middle English (1.9). But the
frequency in the LOB Corpus (0.6) and in the Brown Corpus (0.83) is less than
half of the Late Middle English figure.4 Since we may assume that all texts share
the same basic function of religious instruction, the decrease in imperatives can
be explained either by the hypothesis that religious instruction uses fewer
directives today (and, for example, more representative speech acts instead) or by
the hypothesis that imperatives no longer enjoy wide currency as a means of
directives in religious instruction but are replaced by other manifestations, for
example indirect directives. It is difficult to determine with certainty which
hypothesis is correct, but there is some evidence which renders the second option
more probable than the first one. Frequent constructions with let us and with
modals point to the fact that directives are still being used in religious instruction.
And, of course, a number of directives employing the imperative can still be
found in the contemporary data. There is, however, a remarkable difference
between most of these imperatives and the typical imperatives in the Middle
244 Thomas Kohnen
English religious instructions. Whereas the Middle English texts involve

straightforward acts which are or should be part of the addressees everyday life
(like doing good deeds or praying, see examples 5 and 6), the imperatives
found in the modern texts typically denote mental acts which help the addressee
to decode the text or to grasp the point of the text (see examples 7 and 8).
(5) ... flerfore do flou fli-silf alle fle gode deedis wifl-oute deuocioun, fle
whiche ou didist bifore with deuocioun.
(a 1396, Walter Hilton, Hilton's Eight Chapters on Perfection,
PPCME2 - 4.23)
(6) ... and preie e hertly for hem, that God of his greet mercy eue to
hem very knowing of scripturis, and meekenesse, and charite.
(a 1397, John Purvey, Purvey's General Prologue to the Bible,
PPCME2 I, 49.2033)
(7) What about religion and politics? They are not in two watertight
compartments. Think of the number of laws that have just as much to
do with a mans soul as with his body.
(LOB, D16, 11-13)
(8) If the people had kept the Lord before them and observed His words
through the former prophets, things would have been far otherwise.
And what was His word now through Zechariah, but just what it had
been through them. Take Isaiahs first chapter as an example. He
accused the people of moral corruption, whilst maintaining
ceremonial exactitude.
(LOB, D11, 39-44)
It is far less face-threatening to guide people through a text using imperatives

than employing imperatives in requests which affect their everyday lives. So it
seems here that considerations of politeness may explain some of the changes
noted.5
Table 2: Indirect manifestations of directives in the Helsinki Corpus (tokens per

10,000 words)
1420-1500 1500-1570 1570-1640 1640-1710
Speaker volition 0.37 0.79 1.16 1.11
Interrogative 0.19 0.42 0.47 0.82
Since the data indicates that the frequency of more direct and possibly face-
threatening manifestations of directives decreases in the history of English, it
makes sense to ask whether these forms are replaced by other, possibly more
polite directives. To find a possible answer, it is instructive to look at the
evolution of so-called indirect manifestations of directives. Table 2 shows the

development of directives involving speaker volition (the type I would like you to
do this) and of interrogative realisations referring to the addressee (Can you pass
the salt?) in the Helsinki Corpus (for a detailed analysis, see Kohnen 2002).
Although the frequency of the items found is rather low (but see the
discussion in Section 1), the general picture of the development of the
constructions is quite clear. Both constructions increase noticeably during the
Early Modern English period, with some precursors occurring in Late Middle
English. Interrogative manifestations develop more slowly in the data and one
could claim that they are later in appearing. At least their increase during the
Early Modern English period is slow and the frequency is clearly lower than that
of the other construction. It is only towards the end of the Early Modern period
that the frequencies of the two constructions converge. The result that is most
relevant here is that the indirect manifestations do not become common until the
Early Modern period. It is quite striking that this is just the time (the end of the
Middle English period) when the imperative forms seem to recede.
Another illustration of indirect manifestations of directives are
constructions with let me plus a (mostly directive) illocutionary verb:
(9) Sam. I pray you let mee intreat you: foure or five houres is not so
much.
Dan. Well, I will goe with you.
(Helsinki Corpus, 1593, George Gifford, A Dialogue Concerning
Witches B2R)
(10) Venat. Come my friend, Piscator, let me invite you along with us;
I'll bear you charges this night, and you shall bear mine to morrow;
for my intention is to accompany you a day or two in Fishing.
(Helsinki Corpus, 1676, Izaak Walton, The Compleat Angler 212)
With this construction the speaker, instead of issuing a straightforward directive

performative, asks permission for doing so. However, since this permission is
never actually granted, the construction can be seen as a more or less
conventionalised polite or indirect formula for performing a directive. Table 3
shows the development of this construction in the Helsinki Corpus. Although the
number of items is again rather small, the general picture is quite clear.6 The
construction appears during the 16th century and its frequency increases
significantly during the Early Modern period. Here again it is the Early Modern
period which witnesses the rise of an indirect directive.
Table 3: Let me constructions including illocutionary verbs in the Helsinki

Corpus (tokens per 100,000 words)
1420-1500 1500-1570 1570-1640 1640-1710
0 0.5 5.3 8.2
246 Thomas Kohnen
4. Conclusion
Although the research presented here is the result of what I called structured
eclecticism and although corpus-based historical pragmatics faces several
methodological problems, the general picture of the history of English directives
is quite consistent. During the history of English directives become less explicit,
less direct and less face-threatening. By contrast, the number of indirect
manifestations increases. The important period for the evolution of indirect
manifestations seems to be the Early Modern period, whereas the frequency of
direct manifestations seems to decrease after the Middle English period. The
underlying motivation seems to be the growing importance of considerations of
politeness.
Notes
1. Cf. Fischer (1992: 248) and Rissanen (1999: 228-229, 279-280); for a survey
of directives, see Quirk et al. (1985: 827-833).
2. See, for example, Searle (1976) and Levinson (1983: 263-276).
3. For information on the Helsinki Corpus, see Kyt (1996).
4. Text 9 from the Brown Corpus (Organizing the local church, 2056 words)
was excluded from the analysis, since, as instruction for instructors, it is
focused on organisational matters and does not contain religious instruction in
its proper sense.
5. Biber et al. (1999: 222) say that in contemporary academic prose imperatives
are used as a means of guiding the reader in interpreting the text.
References
Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman

Brown, P. and S.C. Levinson (1987), Politeness. Some universals in language
usage. Cambridge: Cambridge University Press.
Fischer, O. (1992), Syntax, in: N. Blake (ed.), The Cambridge history of the
English language, Vol. II 1066-1476. Cambridge: Cambridge University
Press. 207-408.
Kohnen, T. (2000), Explicit performatives in Old English: A corpus-based study
of directives. Journal of Historical Pragmatics 1 (2): 301-321.
Kohnen, T. (2002), Towards a history of English directives, in: A. Fischer, G.
Tottie and P. Schneider (eds), Text types and corpora. Studies in honour of
Udo Fries. Tbingen: Gunter Narr. 165-175.
Kroch, A.S. and A. Taylor (2000), The Penn-Helsinki Parsed Corpus of Middle
English PPCME2.
Kyt, M. (1996), Manual to the diachronic part of the Helsinki Corpus of English
Texts. 3rd ed. Helsinki: University of Helsinki.
Levinson, S.C. (1983), Pragmatics. Cambridge: Cambridge University Press.
MED: Middle English Dictionary, electronic version, in: F. McSparran et al.
(eds) (1999), The Middle English compendium. Ann Arbor, Mi: University
of Michigan Press.
OED: The Oxford English dictionary second edition on compact disc (1994),
Version 1.13. Oxford: Oxford University Press.
Rissanen, M. (1999), Syntax, in: R. Lass (ed.), The Cambridge history of the
English language, vol. III: 1476-1776. Cambridge: Cambridge University
Press. 187-331.
Searle, J.R. (1969), Speech acts. Cambridge: Cambridge University Press.
Searle, J.R. (1976), A classification of illocutionary acts. Language in Society 5:
1-24.
Measure noun constructions: degrees of delexicalization and
grammaticalization
Lieselotte Brems
University of Leuven
Abstract
In a narrow sense, the term Measure Noun (MN) refers to such nouns as acre
and kilo, which typically measure off a well-established and specific portion of
the mass or entity specified in a following of-phrase, e.g. a kilo of apples. When
used like this, the MN is generally considered to constitute the lexical head of the
bi-nominal noun phrase. However, the notion of MN can be extended to include
such expressions as a bunch of and heaps of, which, strictly speaking, do not
designate a measure, but display a more nebulous potential for quantification.
The structural status of MNs in this broader sense, then, is far from
straightforward and most grammatical reference works of English are either
hesitant or silent with regard to the issue. Two main analytical options seem to
suggest themselves. Either the MN is interpreted as constituting the head of the
NP, with the of-phrase as a qualifier of this head, or the MN is analysed as a
modifier, more specifically a quantifier, of the head, which in this case is the
noun in the of-phrase.
Starting from the structural analyses of MN constructions offered by such
linguists as Halliday and Langacker, my paper goes on to discuss a corpus study
aimed at charting and elucidating the structural ambivalence observed in MN
constructions. The framework eventually opted for is that of
grammaticalization. The focus of the corpus study is synchronic
grammaticalization (Lehmann 1985). More specifically, it investigates the
degree of grammaticalization of the various MNs looked at, viz. bunch(es) of,
heap(s) of, pile(s) of and load(s) of.
1. Introduction: Whats in a name?1
Measure Nouns (henceforth MNs) or nouns of measurement in the strict sense

are nouns such as acre, litre, pound, ounce, etc. that measure off a well-defined
standard-like portion of the mass or entities specified in the of-phrase following
the MN, as in an acre of wasteland.2
However, this paper extends the use of the term MN to include nouns
which, strictly speaking, do not designate a measure, but display a more
nebulous potential for quantification. More specifically, the MN expressions in
this broader sense focused on in this study are bunch(es) of, heap(s) of, pile(s) of
250 Lieselotte Brems
and load(s) of. The type of construction in which they are used is a bi-nominal
noun phrase of the kind illustrated in the following set of examples. All examples
in this paper are from the Cobuild corpus.3
(1) The fox, unable to reach a bunch of grapes that hangs too high, decides
that they were sour anyway
(2) A jilted girlfriend got revenge on the boyfriend who dumped her by
dumping a foot-high pile of manure in his bed.
(3) We still have to move loads of furniture and other stuff.
(4) The surrogate mum to princes William and Harry shared heaps of fun
with them at a fair yesterday while father Charles was otherwise engaged.
(5) I would take up a pile of commonplace books like Lord David Cecils
Library Looking Glass, John Julius Norwichs Christmas Crackers, Rupert
Hart-Daviss A Beggar in Purple, etc.
(6) Then I noticed, under a pile of other books on my nightstand, the worn
journal my father had given me those weeks ago.
The central question with regard to MN constructions is the status of the MNs
bunch, pile, loads and heaps within their respective NPs. Does the MN constitute
the head noun, or does it function as a quantifier of the head noun in the of-
phrase? Naturally, assessing the status of the MN within the bi-nominal NP has
repercussions at clause level, most notably on the question of subject-finite
concord whenever the MN-nominal occurs in subject position. The central
question of the study is of a comparative nature and focuses on possible
differences in the extent to which the various MNs have already come to function
as a quantifier (Section 3).
A quick glance at the above set of examples illustrates the specific rub of
MN constructions. Sentences (1) and (2), and (6) are rather unproblematic: in (1),
(2) and (6) the MN is the head noun, displaying the literal and collocationally
restricted meaning of bunch and pile. This analysis is reinforced by the verb
agreement between hangs and bunch in (1), under in (6) and the premodifier foot-
high in (2), which stresses the fact that it is a literal pile taking up a certain space.
In (4), on the other hand, the MN functions as a quantifier of the noun in the of-
phrase (N2).
The lexical constellation specifics of heaps have bleached into a mere
quantifier, which allows the MN to be used with an abstract noun like fun that
surely cannot be made into actual heaps. Examples (3) and (5) are more
problematic: do the furniture and stuff constitute actual loads or does the sentence
simply mean a large quantity of furniture and stuff without it necessarily being
arranged in a literal load? Do the commonplace books in (5) together make up an
actual pile or is it merely implied that the number of books could constitute a
pile?
Measure noun constructions 251
Examples (3) and (5) are intermediate between examples such as (1) and
(2), with the original and fully lexical meaning of the MN, on the one hand, and
the grammatical quantifier meaning of (4) on the other. I will suggest that the
developments observed in MN constructions are best looked at as a case of
ongoing delexicalization and grammaticalization in MNs (Section 2.3). The by
now more or less full-blown quantifiers a lot of and lots of can be considered
historical precursors in these developments.
Synchronic variation in verb agreement patterns is an important argument
for claiming that the structural status of the MN is changing from head to
quantifier. Wherever subject-finite concord is observable, analysis of data reveals
consistent patterns. When the MN is head of the bi-nominal group it controls verb
agreement (examples 7 to 9), when it functions as a quantifier, the finite agrees
with N2 (examples 10 to 12).
(7) I can show you a van-load of weapons that was confiscated at the gate.
(8) Three plane-loads of food have been ferried into the town in the past three
weeks.
(9) The fox, unable to reach a bunch of grapes that hangs too high, decides
that they were sour anyway.
(10) A bunch of drunken, braindead louts seem determined to disgrace our
team.
(11) But then, when I needed one, there were a load of excuses as to why I
couldnt borrow one.
(12) They are threatening to kill off a bunch of select committees that have
been around for a long time.
Pedagogical grammars and mainstream accounts like Kruisinga (1925), Jespersen

(1970-74), Quirk et al. (1985) and Biber et al. (1999) often bring up MN
expressions with regard to the perceived fluctuation in verb agreement. They
rarely incorporate the idea of a diachronic shift in structural status of MNs in
explaining the variation in concord. Most grammars invariably assign head status
to MNs in MN constructions.
When the verb agrees in number with the MN, strict grammatical concord
is said to be satisfied. Sentences in which the finite does not agree in number with
the MN are explained in terms of conflicting concord principles. In those cases
notional concord is said to overrule strict grammatical concord, i.e. in sentences
such as (10) to (12), the MN is still considered to be the head of the nominal
group, but it is the idea of number of the MN that determines the number of the
verb. Alternatively, the principle of proximity or attraction is invoked, which
states that whatever element of structure most closely precedes the finite controls
the number of the verb. Both notional concord and proximity concord are invoked
in an ad hoc fashion, to explain away incongruent verb agreement.
The possibility of N2 actually constituting the head in at least some cases

is not systematically considered, just like the status of the MN and N2 as such is
not systematically questioned (Jespersen 1970-74: 179; Kruisinga 1925: 306;
Quirk et al. 1985: 264; Biber et al. 1999: 184-185). I would like to argue that in
all MN constructions grammatical concord holds between the head and the verb,
depending on whether the MN or N2 is the head.
As hinted at in discussing examples (1) to (6), I will also argue that the
most adequate way of seeking order in the perceived chaos is by bringing in the
perspective of grammaticalization (cf. Lehmann 1985 and Hopper and Traugott
1993). I will argue more extensively for this approach in Section 2.3.
The following sections will first survey two relevant descriptive accounts
of MN constructions. Section 2.1 discusses Hallidays analysis of what he calls
measure nominals (Halliday 1985: 173) and Section 2.2 sums up some of
Langackers (1991) pertinent insights into MNs. Section 2.3 presents the
framework eventually adopted in this paper, viz. grammaticalization. Section 3
then reports on the most important findings of a corpus study of MNs which I
carried out based on data from the Cobuild Corpus, The Bank of English. The
focus of this study is a comparison of the extent to which heap(s) of, pile(s) of,
load(s) of and bunch(es) of have already become grammaticalized.
2. Theoretical-descriptive starting point
2.1 Hallidays account of Measure Noun constructions
Halliday (1985: 173) deals with MN constructions in a section following his

expos in favour of a twofold analysis of the nominal group on the ideational
level, i.e. the level of lexicogrammatical organization concerned with the
representation of experience.
Within the ideational level Halliday distinguishes between two layers of
analysis, one in terms of constituency and the other in terms of dependency. The
constituency layer offers an analysis of the nominal group as a multivariate
structure, i.e. as constituting a constellation of distinct functional slots which in
some way characterize the Thing of the nominal group, which itself designates a
class of entities and establishes the semantic core of the nominal group, as shown
in Table 1.
Table 1. Experiential structure of the nominal group

Deictic Numerative Epithet1 Epithet2 Classifier Thing Qualifier
those two splendid old electric trains with pantographs
The dependency layer, on the other hand, analyses the nominal group as a
univariate structure, viz. in terms of the recursive head-modifier relationship
displayed by the nominal group, as shown in Table 2.
Table 2. Logical structure of the nominal group (head-modifier)
Premodifier Head Postmodifier

those electric trains with pantographs
In the default case the Thing of the experiential layer and the logical Head
coincide. However, there are a few types of nominal group where Head and
Thing do not coincide and those involving a measure of something () [i.e.]
measure nominals (Halliday 1985: 173) are an example of such a discrepancy
between Head and Thing. Halliday goes on to analyse so-called measure
expressions (id.: 169) in the following way:
In the logical structure, the measure word (pack, slice, yard) is Head,
with the of phrase as Postmodifier. The Thing, however, is not the
measure word but the thing being measured: here cards, bread, cloth.
The measure expression functions as a complex Numerative.
(Halliday 1985: 173)
This dual analysis can be visualised by the box diagram for a pack of cards
shown in Table 3.
Table 3. Twofold analysis of the nominal group; discrepancy Head/Thing

(Halliday 1985: 173)
A pack of cards
Numerative Thing Experiential structure
Modifier Head Postmodifier Logical structure
Halliday comments further that
[i]t is not that one [analysis] is right and the other wrong; but that in
order to get an adequate account of the nominal group, [] we need
to interpret it from both these points of view at once. [Italics LB]
(Halliday 1985: 172-173).
This comment seems to imply a certain flexibility in the interpretation of MN

constructions. However, it does not allow head status to shift from the measure
noun to the noun designating the matter being measured. This makes it hard to
see how diachronic variation in the status of the MN, which is reflected so clearly
in (synchronically) distinct subject-finite concord, can be captured.
What is unhelpful about the proposed dual analysis is that Numerative and
Head status of the MN are divided over two simultaneous levels of analysis, thus
suggesting that in each use the MN is always both. Against this, the description of
MN constructions proposed in this article will involve two synchronically distinct
analyses, with the second one being treated as a (diachronic) re-analysis of the
first:
A lot of land Lots of paper

A heap of paper Heaps of people
Head Postmodifier Quantifier Head
Langackers discussion of MNs, reviewed in the following section, does bring in

the diachronic perspective and is compatible with the framework that I will adopt
in the present corpus study, viz. grammaticalization studies.
2.2 Langacker: the diachronic angle
Langacker (1991) turns to the issue of MNs in his general discussion of the
function of quantification in the NP. What is interesting about his observations is
that they immediately address the question of MNs from a diachronic angle, i.e.
MNs as an emergent means of quantification.
Langackers observations pertain to bi-nominal MN phrases such as a
bunch of carrots, a bucket of water and a lot of sharks. He observes that the
nouns which appear as heads constitute a diverse and open-ended class. [italics
LB] (Langacker 1991: 88). MNs are by default attributed head status, despite the
ambivalent semantics of appear. He continues by remarking that
[s]ome of these nouns still have an interpretation in which they

designate a physical, spatially-continuous entity that either serves as a
container for some portion of the mass (bucket, cup, barrel, crate, jar,
tub, vat, keg, box) or else is constituted of some such portion (bunch,
pile, heap, loaf, sprig, head, stack, flock, herd). [italics LB]
(Langacker 1991: 88)
In addition, most of these nouns have developed a more figurative sense. Such
metonymic extensions are possible because the above MNs all incorporate a
conception of their typical size, which is part of their encyclopedic
characterization. In the extended senses the physical entity designated by the
MNs has become secondary to the size specification provided by the noun: For
instance, a bathtub may contain a bucket of water without there being any bucket
in it it is only implied that the water would fill a bucket were it placed in one
(Langacker 1991: 88). Or in other words (Langacker 1991: 88-89), The notion
of a discrete physical object has faded, leaving behind the conception of a
schematically characterized mass (the mass that, in the original sense, either fills
or constitutes the object). When a noun is interpreted in this way, it can,

according to Langacker, be regarded as a quantifier.
He thus notices a diachronic process of bleaching of lexical meaning in
certain MNs which may eventually lead to a reassessment of the structural status
of the MN, viz. from head to quantifier. As we will see in the following section,
such observations can be easily rephrased by using grammaticalization
terminology. Langacker then concludes his discussion of MNs by stating:
A further step in this evolutionary sequence would be for the second

noun to be reanalyzed as the head, leaving the remainder as a complex
quantifier: [[a lot of]QNT [sharks]N]NML. I leave open the question of
whether this reanalysis has actually occurred. (Id.: 89)
The aim of the corpus study reported on in this paper is to provide some answers
to both proposed re-analyses, viz. has N1 shifted from head to quantifier and N2
from postmodifier to head?
In conclusion to Langackers account, we can say that it is interesting that
he notes a diachronic shift with regard to the structural status of MN from head to
quantifier, instead of the mere synchronic ambivalence proposed in mainstream
grammars or the simultaneous layers in Hallidays analysis. Langacker also
suggests that this grammatical re-analysis is paralleled by lexical extension and
desemanticization of the MN. He therefore does accommodate the dynamic
aspect of MN constructions by working with two distinct diachronic stages in the
structural development of MN constructions.
2.3 Grammaticalization: diachronic and synchronic
The framework which seems most suitable for tackling the specific developments
encountered in MN constructions is that of grammaticalization theory, which not
only does justice to these developments but also explains them.
Grammaticalization itself has been defined in several ways (e.g.
Haspelmath 1989, Fischer 1999 and Bybee 2000), but its essence is captured by
Lehmanns (1985) definition, which is appropriately general and consists of a
number of interesting parameters. Lehmann also distinguishes between
diachronic and synchronic grammaticalization, a distinction that will prove useful
for this corpus study. Lehmann defines both types of grammaticalization as
follows:
Under the diachronic aspect, grammaticalization is a process which

turns lexemes into grammatical formatives and makes grammatical
formatives still more grammatical (cf. Kurylowicz, 1965: 52). From a
synchronic point of view, grammaticalization provides a principle
according to which subcategories of a given grammatical category
may be ordered. (Lehmann 1985: 303)
The diachronic interpretation of grammaticalization nicely captures the

developments and fluctuation encountered in MN expressions. Looked at in
synchronic slices, they often appear to hover indecisively between the class of
quantifiers and head nouns as a consequence of a gradual move from lexical head
to quantifier.
The synchronic interpretation of grammaticalization, then, can account for
the fact that not all MNs have come to function as a quantifier to the same extent,
i.e. there are individual differences with respect to the degree of their respective
grammaticalization. It also invites us to draw up a scale of grammaticalization
along which the various MNs looked at are positioned.
Lehmann proposes six parameters which define grammaticalization:
attrition, paradigmaticization, obligatorification, condensation, coalescence and
fixation. The first three pertain to paradigmatic aspects, the last three to
syntagmatic aspects of the grammaticalizing item or string of items. Of these
parameters, two clearly apply to MN constructions, viz. coalescence and semantic
attrition. It is these two I will focus on in this article.
Coalescence is a syntactic criterion and concerns an increase in
bondedness or syntactic cohesion of the elements that are in the process of
grammaticalizing, i.e. what were formerly individually autonomous signs become
more dependent on each other to the extent that they are increasingly interpreted
as together constituting one chunk, which as a whole expresses a (grammatical)
meaning (cf. Bybee 2000: 27).
The other relevant parameter, semantic attrition, is often referred to as
delexicalization or loss of lexical content, and is commonly mentioned in
grammaticalization studies as a symptom of grammaticalization processes.4
However, as Kurtbke (2001) correctly points out, one should be careful not to
use delexicalization as a mere synonym for grammaticalization.
Both concepts will be operationalized for the present corpus study in the
following way. Delexicalization will be identified in terms of a gradual
broadening of collocational scatter or a loosening of the collocational
requirements imposed by the MN via such semantico-pragmatic processes as
metaphorization, metonymization, analogy, etc. Grammaticalization, on the other
hand, will be restricted to the actual grammatical re-analysis of a MN as a
quantifier. In this particular study, delexicalization processes typically precede
the re-interpretation of the MN as a quantifier. Delexicalization semantically
paves the way, so to speak, for grammaticalization (Kurtbke 2001). To this
extent, these grammaticalization processes can be said to be largely semantically-
driven (cf. Traugott 1988). In addition phonetic and pragmatic factors come into
play. Nevertheless, both concepts tend to remain intertwined because, in many
cases, lexical and grammatical status are difficult to tease apart.
3. Corpus study of MN expressions
For this corpus study heap(s) of, pile(s) of, bunch(es) of and load(s) of were
extracted from the Cobuild Corpus, The Bank of English. In each case the plural
and singular variants of the MNs were regarded as distinct expressions. The
corpus data were analyzed as either head or quantifier, or vague. The last
category subsumes those MN uses which activate both head and quantifier
features. Typically, it contains expressive stretches of discourse in which both the
lexical meaning and the quantificational potential of the MN are exploited.
On the basis of these quantified data, differences in the degree of
(synchronic) grammaticalization between the various MNs can be studied, which
will be the main focus in the discussion of the corpus results.
The relative frequencies of quantifier, head or vague uses of the various
MNs examined are represented in the following tables, which also contain
examples of the respective categories. Adjectives or nouns premodifying MN or
N2 are underlined, as well as verbs or other elements of structure that serve as
important clues for either head or quantifier status.
Bunch of Tokens % Examples

Head 27 11.4 (13) All stopped a moment when Linda, in clothes
of mourning, bearing a little bunch of roses, comes
through the draped doorway into the kitchen.
Quantifier 209 88.6 (14) The ideologies might be different, but youre
all a bunch of lying, treacherous bastards when it
comes down to it.
(15) Traditional advertising pictures are a bunch of
lies.
(16) Russia and America were just a bunch of
enthusiastic and very fit guys who ran around for 80
minutes without much method.
Total 236 100
Bunches of Tokens % Examples

Head 47 97.9 (17) Just after 5pm two bunches of flowers were
delivered with a card saying simply: to the
neighbour next door.
Quantifier 1 2.1 (18) I had to listen to those two bunches of spoiled,
/vague pampered brats whinge their brains out about which
is the hardest done by.
Total 48 100
Heap of Tokens % Examples

Head 58 55.2 (19) My first impression was not that it was an
earthquake, said Heinz Hermanns, standing by a
heap of bricks that had fallen from his 100-year-old
house.
Quantifier 41 39.1 (20) They went through my bags, searched me and
asked me a heap of questions.
Vague 6 5.7 (21) That deadly, winking snuggling chromium-
plated, scent-impregnated, luminous, quivering,
giggling, fruit-flavoured, mincing, ice-covered heap
of motherlove [said about Liberace]
(22) The British have forged a fine tradition of
gardening and cannot afford to sit on their well-
clipped laurels. Striding past the compost heap of
nostalgia, comes Christopher Lloyd.
(23) He test-fired a dozen of Hellfire missiles at a
fleet of old Saudi school buses, reducing the
vehicles to a heap of springs and blackened chassis.
Total 105 100
Heaps of Tokens % Examples

Head 29 32.2 (24) Pulham, scion of the Portland cement family,
experimented and perfected in the 1840s the art of
using liquid cement poured over heaps of clinker to
make rock formations
Quantifier 59 65. 6 (25) Whats interesting is how many sexual
researchers and observers were driven by self-
interest? Heaps of them at least.
(26) The graphics are very polished, with pitch
detail, markings and the like to add heaps of
atmosphere.
Vague 2 2.2 (27) Many other viruses are highly malignant
reducing the priceless words of your PhD thesis to
amorphous heaps of molten letters.
Total 90 100
Pile of Tokens % Examples

Head 176 88.4 (28) A jilted girlfriend got revenge on the boyfriend
who dumped her by dumping a foot-high pile o f
manure in his bed.
Quantifier 6 3.1 (29) I can just see a whole pile of the boys walking
out after the final and saying bye bye.
(30) It [i.e. a performance] was the biggest pile of
want ever.
Vague 17 8.5 (31) 6 CAMMY MURRAY: Recovered well after a
nervous start and put in a pile of strong defensive
work. ((paper)work-metonymy; here more or less
Quantifier))
(32) If you go and have a look next door theres a
great pile of work that builds up. (paperwork
metonymy; leans more towards the Head category
here)
Total 199 100
Piles of Tokens % Examples

Head 159 93 (33) There was no memory of summer but the little
sad piles of hay that rotted in the fields.
Quantifier 8 4.7 (34) Mike Atherton has been warned he must score
piles of runs for Lancashire to keep his England test
place.
Vague 4 2.3 (35) Leshan emphasizes a remark by G.B. Shaw that
Lourdes is the most blasphemous place on the face
of earth: mountains of wheelchairs and piles of
crutches exist, but not a single wooden leg, glass
eye, toupe[e]!.
(36) The real fun begins when you start receiving
piles of property details. (paperwork-metonymy)
Total 171 100
The percentages indicated in the tables immediately point out that there are
differences between the various MNs in terms of their degree of
grammaticalization or quantifier potential. In order to represent visually the
degrees to which the MNs have grammaticalized, they are set out on a scale of
synchronic grammaticalization in Figure 1 (see Section 2.3). The percentages for
lot of and lots of were also obtained through analysis of Cobuild data. They are
included in the cline as precursors in the present MN-developments.
Load of Tokens % Examples

Head 33 19.3 (37) TODAY flew in a helicopter-load of supplies
to the service station.
Quantifier 132 77.2 (38) I just think thats a load of nonsense.
(39) When our image first went out of control, we
played to a load of skinheads.
(40) What are you going to write about when you
get success and a load of money?
Vague 6 3.5 (41) I always go to Sainsburys and I always go to
the Cake Place and buy a load of cakes. (trolley-
load/lots of)
Total 171 100
Loads of Tokens % Examples

Head 15 7.2 (42) Six plane-loads of food are also being
flown today to the city of Baidoa.
Quantifier 193 92.8 (43) Around Christmas time I was in the
British home Stores and I tried on loads and
loads of dresses.
(44) Ive applied for loads of jobs including
one in a flowershop, but they wanted a
woman.
(45) Youve got loads of people that you can
have conversation with, but not many people
that you can have communication with.
(46) It has tons of variety, smart graphics and
loads of action.
Total 208 100
bunches of heaps of bunch of lot of

pile of
piles of heap of load of loads of lots of
+---------------------------------------------------------------------------------------------------------+
4.7% 77.2% 92.8%
0% 3.0% 41.0% 50% 65.6% 88.6% 100%
2.1% 99.8%
Figure 1. Scale of synchronic grammaticalization

Loads of, bunch of, load of and heaps of have all grammaticalized strongly, while
piles of, pile of and bunches of have hardly grammaticalized at all.
Two main factors seem to play a part in these observed differences in
degree of grammaticalization, viz. dissimilarity in the degree of delexicalization
and collocational broadening, and differences in expressive value of the MNs,
which also involves a phonetic factor.
Considering the fact that the grammaticalization processes of MNs are
largely semantically driven (Section 2.3), it seems only natural that differences in
grammaticalization level can be explained by differences in the preliminary
delexicalization processes. Differences in quantifier potential between the
semantically related heap and pile, for example, have to be put down to
differences in delexicalization potential between the two MNs. These differences
are dependent on certain lexico-semantic properties inherent in the concepts of
pile and heap, which are resistant and conducive to semantic generalization
respectively.
The blocking factor in pile is the feature of verticality and constructional
solidity it calls up. These semantic features are, so to speak, too specific to bleach
into a mere quantity meaning. The concept of heap on the other hand is in itself
more vague and simply profiles an undifferentiated mass, from which it is much
easier to detach a mere quantifier meaning. The lack of delexicalization potential
in pile is matched by a very restricted collocational extension, mainly limited to
prototypically stackable concrete nouns like rubble, paper, bricks etc. Heap(s) of,
on the other hand, has loosened its collocational requirements systematically. In
addition to the prototypically stackable nouns it combines with when used as
head, it has extended to concrete nouns irrespective of their semantics, to human
nouns (e.g. (25)) and abstract nouns (e.g (26)). Heaps of has hence developed a
systematic quantifier use which is more or less devoid of its original lexical
semantics.
The non-head/quantifier uses of pile of and piles of distinguished in the
tables above are all restricted to very specific contexts, as in (34), highly
expressive stretches of discourse, as in (30) and (35), or dependent on metonymy,
as in (31) and (36) for example. Compare the following two MN nominals, which
alternatively have heaps and piles combined with the same set of nouns; the
collocational restrictions on the quantifier use of piles of are immediately
obvious:
piles of stones/paper/people
heaps of stones/paper/people
Piles of evokes a vertical, layered constellation with all three nouns, which
renders the combination piles of people highly marked. By contrast, with heaps a
mere quantifier reading is at least as unmarked as a literal interpretation for all
three nouns; in the case of people the quantifier reading is the most natural one.
Bunch of is another MN of which the high level of grammaticalization can
be explained by a process of extensive collocational broadening. As opposed to
pile or heap, the delexicalization process of bunch has a readily identifiable first
stage in which it designates a very specific cluster-like constellation with an
accordingly restricted set of collocates in the N2 slot, e.g. grapes, flowers,
carrots. Gradually, the specific cluster meaning starts to bleach and the
collocational scatter broadens to include concrete nouns beyond that limited set,
as well as abstract nouns and human nouns, which are in fact the predominant N2
type when bunch functions as a quantifier. The following table represents the
various extensions.
+Inanimate plural (47) Theres now a whole bunch of studies from

count noun different cities that show the same thing.
(48) Traditional advertising pictures are a bunch of lies.
(49) Ned wanted to give me a bunch of suits.
+Uncount noun (50) Trouble was, the funds were able to neatly hide all
but the most conspicuous of their charges in a bunch of
legalese.
(51) We started in May and did a bunch of practising.
(52) I spent a bunch of time, when I was visiting the
county, talking to his neighbours.
+Human/animate (53) Who said Americans were a jingoistic bunch o f
plural count noun rednecks who know or care nothing about what happens
beyond their shores?
(54) Deng was pictured taking a dip with a bunch of his
beaming buddies at a summer resort in the north of the
country.
(55) Russia and America were just a bunch of
enthusiastic and very fit guys who ran around for 80
minutes without much method.
(56) We guarantee the noble young lord will complain
about having to spend time with such a boring bunch of
geriatrics.
Bunch of, just like loads of and heaps of, also illustrates the expressivity and
affective values grammaticalizing MNs often develop, at times leading to new
patterns of collocational consolidation. Especially when used with human nouns,
bunch of goes beyond merely quantifying N2 and also qualifies it, usually
negatively. This qualitative function can be reinforced by the addition of
qualitative adjectives premodifying the MN, as in (53) and (56).
The negative qualification expressed by bunch is best described as
negative semantic prosody, as defined by Sinclair (1992) and Bublitz (1996): a
negative, or occasionally positive, semantic aura spreading from node to
collocate. Bunch radiates a specific halo, it prospects ahead and sets the
scene (Sinclair 1992: 8) for a particular type of subsequent item (Bublitz 1996:
11). This strong predictive power with regard to N2 can create new collocational
requirements and idiom-like patterns of collocational consolidation, as in (48),
(50), (53) and (56), both with human nouns and abstract nouns. It is only with
regard to nouns such as guys, lads, kids, etc. that bunch of radiates a positive
semantic prosody. In such expressions as a bunch of guys/lads/etc. there is the
additional suggestion of bondedness, of a close-knit group of amicable people.
This can be seen as a metaphorical revival of the original cluster semantics.
The specific qualitative meaning of bunch of brings us to the second
important factor motivating the grammaticalization of MNs, viz. the expressive
value they can acquire. As a means of quantification heaps of and loads of are
very hyperbolic in nature, which can be stressed by repeating them as in (43).
Differences in expressive value might also explain why the plural versions of
heap, pile and load have grammaticalized more strongly than the singular
variants. The plurality in terms of grammatical number adds to the hyperbolic
meaning it expresses as a quantifier. In addition, the intrinsic mass meaning of
plural nouns (Langacker 1991) likewise enlarges the magnitude already expressed
by the MN. Phonetically, these plurals contain a vowel that can easily be
lengthened, producing a similar effect of exaggerating the quantity of N2. In this
respect, the extensive grammaticalization of loads of might be enhanced by the
graphemic and phonetic resemblance to lots of, with the added bonus of a
strongly prolongable diphthong in front of a voiced consonant.
In the case of bunch, on the other hand, it is precisely the other way round:
the plural form displays a near-exclusive head use, while it is the singular form
which has a prevailing quantifier use. The resistance of bunches to
grammaticalize into a quantifier might be due to prosodic features which do not
lend themselves well to expressive use, such as the extra syllable the plural
morpheme gives rise to (p. c. Halliday). Grammaticalizing MNs, with their
typical blend of lexical and grammatical potential, thus satisfy the language
users needs for a quantifier as well as the desire to be expressive.
4. Conclusion
The assessment of the structural status of MNs in MN constructions is complex

because of the subtle and often intricate interdependence of the MNs lexical and
grammatical status. The observed structural fluctuation involves many more
dimensions than suggested by traditional descriptions. Lehmanns parameters for
grammaticalization proved essential to impose some order on what appears as
intractable material.
Grammaticalization, both the diachronic and synchronic interpretation,
allows one to reveal the patterns in empirical data. Grammaticalization of MNs
seems to involve two main motivating factors, which typically interlock. Firstly,
there is delexicalization and collocational broadening of the MN. In addition
there is a pragmatic factor, viz. the hyperbolic expressiveness MNs cater for as
quantifiers. Differences in the degree of grammaticalization are matched by

differences in the extent to which these two factors have come into play in the
various MNs.
Not only is there a synchronic dissimilarity in the extent to which the
various MNs have grammaticalized; each of the MNs individually displays a
layering (cf. Hopper and Traugott 1993) of lexical head uses and grammatical
quantifier uses, as well as a considerable number of transitional uses. Some
contextualized examples proved to be irreducible blends (Bolinger 1961) of
quantifier and head status. Our main descriptive research question has thus been
confirmed: bunch(es) of, heap(s) of pile(s) of and load(s) of have developed a
quantifier use comparable to that of regular quantifiers. However, they still retain
the possibility of appearing as the lexical head noun of a nominal group.
The MNs looked at thus do constitute an emergent means of quantification
(cf. Langacker 1991). The observed structural fluctuation and layering
phenomena suggest that they are still very much quantifiers on the move. A
certain amount of lexicality is bound to cling to all MN quantifiers to some
extent. For pile in particular such lexical persistence (Hopper and Traugott 1993)
is at present very strong, whereas heaps of has already developed a systematic
quantifier use which is more or less oblivious to its original lexico-semantics.
Still, even when MNs have become highly grammaticalized, their lexical
semantics can still be exploited, alluded to or revived in various ways, e.g. They
employ lorry-loads of insincere flattery. Again the strong interpersonal
motivation behind MNs as a means of quantification comes to the fore, as well as
the importance of casual, informal registers.
Notes
1. I would like to thank all people at the 2nd Workshop of the Systemic
Functional Research Community (FWO - Fund for Scientific Research
Flanders grant n WO.018.00N) in Leuven, 21-24 November 2001, as well
as those at ICAME 2002 for their much appreciated comments on earlier
versions of this paper.
2. These are just two of the many names they are commonly labelled with.
Others are quantifying nouns in Biber et al. (1999) and NP-like quantifiers
in Akmajian and Lehrer (1976).
3. All examples are extracted from the Cobuild Corpus, The Bank of English,
and reproduced here with the kind permission of HarperCollins.
4. This concept is alternatively referred to as semantic attrition, desemanti-
cization and demotivation (Lehmann 1985:307).
References
Akmajian, A. and A. Lehrer (1976), NP-like quantifiers and the problem of

determining the head of an NP, Linguistic Analysis 2: 395-413.
Bolinger, D.L. (1961), Syntactic blends and other matters, Language 37: 366-
381.
Bublitz, W. (1996), Semantic prosody and cohesive company: somewhat
predictable, Leuvense Bijdragen 85: 1-32.
Bybee, J. (2000), Cognitive processes in grammaticalization, to appear in M.
Tomasello (ed.), The new psychology of language, volume 2. New Jersey:
Lawrence Erlbaum.
Fischer, O. (1999), Grammaticalization: Unidirectional, nonreversible? The case
of to before the infinitive in English, Views 7: 5-24.
Arnold.
Haspelmath, M. (1989), From purposive to infinitive - A universal path of
grammaticalization, Folia Linguistica Historica 10: 287-310.
Hopper, P. and E. C. Traugott (1993), Grammaticalization. Cambridge:
Cambridge University Press .
Jespersen, O. (1970-74), A modern English grammar on historical principles. 7
vols. London: Allen and Unwin; Copenhagen: Enjar Minksgaard.
Kruisinga, E. (1925), A handbook of present-day English, 7th ed. Utrecht:
Kemink en Zoon.
Kurylowicz, J. (1965), The evolution of grammatical categories, Diogenes 51:
55-71.
Kurtbke, P. (2001), YAP- and OL- as delexical nominalising devices in
diaspora Turkish. Paper delivered at the University of Leuven, March 15,
2001.
Langacker, R.W. (1991), Foundations of cognitive grammar. Volume 2:
Descriptive application. Stanford: Stanford University Press.
Lehmann, C. (1985), Grammaticalization: synchronic variation and diachronic
change, Lingua e Stile 20: 303-318.
grammar of the English language. London & New York: Longman.
Sinclair, J. (1992), Corpus, concordance, collocation. Oxford: Oxford University
Press.
Traugott, E. C. (1988), Pragmatic strengthening and grammaticalization,
Proceedings of the 14th Annual Meeting of the Berkeley Linguistics
Society. 406-416.
Yourself: a general-purpose emphatic-reflexive?
Gran Kjellmer
Gteborg University
Abstract
In general, the reference of English personal pronouns has been relatively stable
over the centuries: I (and its forerunners) can normally be taken to refer to the
first person singular, and so on. If this is the general picture, it is necessary to
add some qualifications, most of them of a minor kind. For instance, I is
sometimes used to refer to the second person singular (I shouldnt disturb him
at this time of night), we is sometimes used with reference to the first person
singular, the authorial and the royal we (We are not amused), to the
second person singular (How are we today?) and with general reference (We
should not underestimate the defence of honour), and they can also be used
with general reference (They say that ill weeds grow apace). However, you
and its reflexive-emphatic correspondence yourself stand out in this respect and
differ from their pronominal cousins, both in that their referential changes have
been more generally remarkable over the centuries, and in that such changes are
still in progress. This paper will attempt to chart some of those changes in
modern English with the help of large modern corpora.
1. Introduction
As speakers of English we tend to look upon the pronominal system as something

more or less established and invariable, if indeed we bother to think about it at
all. But there are occasions when this happy mood is broken. When an English
professor addressing his seminar audience says
(1) You can see for yourself that ...
and when an English boy accused of repeating himself replies
(2) I aint repeating yourself
one may well receive a jolt. Our traditional grammars would have us expect the
professor to say see for yourselves and the boy to say repeating myself. A
natural question then is if such linguistic events are due to accidental slips, and
hence of no great interest, or if something is happening to the use of yourself. I
268 Gran Kjellmer
would suggest that among the bewildering mass of uses that yourself can be put
to an interesting pattern can be seen to emerge.
2. Development of you
In order to look into the matter of yourself we shall use material from the
CobuildDirect and the BNC corpora. It is indeed one of the great advantages of
corpora that they can provide us with material that has not yet become
established as mainstream varieties. But before discussing yourself, let us first
consider very briefly the development of the closely related pronoun you. You,
starting out as an object form of the Old English second person plural personal
pronoun , came in late Middle English and Early Modern English times to be
used as a plural subject form and, about the same time, as a singular pronoun,
used both as a subject and an object form (OED You I-II). In Early Modern
English a secondary use developed, Denoting any hearer or reader; hence as an
indef. pers. pron.: One, any one (OED You III:6). From Early Modern English
times onwards you can thus be used in the same way as today, in the plural and
singular and with general reference. Modern examples, from the Cobuild Corpus,
are:
2nd pl. (OED you I):

(3) I can see great things for you, kids, I think your troubles are over.
(Cobuild: usbooks/09. Text: B9000001423)
(4) The great thing about meeting someone through an agency is that you find
a lot about them the first time you meet.
(Cobuild: ukmags/03. Text: N0000000375)
2nd sg. (OED you II):

(5) `Honk all you like, baby. I hope it makes you happy, you little redneck.
Generic (OED you III:6):

(6) Where they lived in it was in near Christchurch as you sort of you had to
go you know maybe five miles before you saw anyone and in winter they
were just cut off anyway. So they never saw a soul.
(Cobuild: ukspok/04. Text: S9000000254)
Out of the generic use there has developed a more specific use where you mostly,
but not always, means I or we (and which is not in the OED; cf. Quirk et al.
1985:354):
Yourself: a general-purpose emphatic-reflexive? 269
Generic > specific:

(7) When I made the booking I explained that the trip was for shopping, but
the tickets arrived with a booklet listing that particular weekend as a
public holiday in France. Now Going Places wants 90 to change the date.
<p> Going Places Direct showed no compassion when you explained your
problem and insisted that you pay a 90 re-booking fee (you = I)
(Cobuild: times/10. Text: N2000951104)
(8) `Theres another one in the back as well Mr Giggins added: `For all the
world it looked as though there were people asleep in the car although
when you looked again you realised they had been shot ) (you = I/we)
(Cobuild: times/10. Text: N2000951208
(9) but I shouldnt think its probably all that much different <F01> Mm.
<F02> except we used to finish off putting chairs on the tables hands
together and eyes <tc text=laughs> closed you know before you went
home every night. (you = we)
(10) Balancing the lust for a story against the demands of self-preservation,
conquering your own fear and crawling that extra exclusive maggot-
infested mile before remembering you were a mother with responsibilities
back home. Home. It was time to call her husband. Her nervousness, for
which she had no explanation - or, at least, none she could remember -
came flooding back. (you = she)
(Cobuild: ukbooks/08. Text: B0000001117)
(10) is probably an example of free indirect speech.
3. History of yourself
As for yourself, its early history is partly dependent on that of you, as could be
expected. The Middle English plural e ou selve(n) became our(e) self(e) in the
early part of the fourteenth century, and like you the latter form came to be used
with singular reference in late Middle English and Early Modern English (OED
yourself II, originally as a honorific plural). And then towards the end of the
fifteenth century the present s-plural ourselves, yourselves came into existence
and eventually became the standard forms (Wright and Wright 1924: 323; see
also Visser 1962-73 I 455). The forms with -selves are [...] the normal plural
usage by the middle of the sixteenth century (Barber 1997: 159). So the form
yourselves gradually becomes the standard one for use in the plural. If yourself,
on the other hand, was thus originally a plural form, as in
270 Gran Kjellmer
(11) All the wise how it was ye wetyn your selfe.

(c1400: OED Yourself I1: obsolete)
its standard modern use is as a singular reflexive form (OED Yourself II: 6), as in
(12) Now you never thought of yourself as a fan. You were a journalist
covering sports.
(Cobuild: npr/07. Text: S2000901019)
or as a singular emphatic form (OED Yourself II: 3), as in
(13) Vu: You used to molest other kids yourself? <p> Mary: Mm-hmm.
(Cobuild: npr/07. Text: S2000911102)1
This, then, is the traditional view of modern you and yourself/yourselves, as

presented in the standard grammars: you is the second person singular and plural
personal pronoun, yourself is the second person singular and yourselves the
second person plural reflexive pronoun (Quirk et al. 1985: 346, Biber et al. 1999:
328). But in order to understand the occurrence of examples like (1) and (2), I
suggest we follow an admittedly hypothetical line of development of modern
yourself. Such a development would imply an ongoing extension of its semantic
range, and consequently an increasing lack of precision.
4. Development of modern yourself
Let us start with the standard use of yourself, where it refers to a singular
addressee:
(14) its exciting for a young man like yourself ...

(Cobuild: npr/07. Text: S2000911214)
As we saw, you can refer to one or several addressees, and frequently it is

difficult or impossible for the listener or reader to decide which is meant.2 The
same thing then applies to yourself. The number indeterminacy of you spills over
on to yourself by analogy, so that the latter can be used in situations where the
speaker may have a plural addressee in mind. In cases like the following, there
could be one addressee or several:
(15) Treat yourself to a Maltese odyssey

(Cobuild: today/11. Text: N6000940101)
(16) Before buying a single share of stock, force yourself to answer one
question: are you reasonably sure that you can keep your money invested
for 7 to 10 years?
(17) If you have just spent 329,000 on a red Ferrari F50 then why not treat
yourself to the perfect number plate?
(Cobuild: times/10. Text: N2000960217)
How then are we to know whether, and how often, yourself in fact refers to a
number of addressees? It is difficult to answer that question as, just in the case of
you, the speaker or writer may not always have made a distinction between
singular and plural but may be addressing himself indifferently to an audience of
one or several. The context is often of little or no help. However, by an indirect
route we might get an idea of the size of the phenomenon. The reflexives myself,
himself, herself, itself have plural correspondences, ourselves and themselves. If
we assume that the relation between reflexive singulars and plurals is very
approximately constant throughout the system, we can investigate the matter in a
corpus like Cobuild and draw our conclusions. The figures are shown in Table 1.
Table 1. Reflexive singulars and plurals in the CobuildDirect corpus

Formally singular Formally plural % formally plural
myself 7311 ourselves 2798 27.7%
himself 14815 themselves 10636 27.4%
herself 5525
itself 7894
yourself 6758 yourselves 289 4.1%
The discrepancy between 27-28% and 4% suggests that a great number of the
yourself instances have plural reference.
When yourself can be interpreted as referring to plural addressees, as in
(15) - (17), one further step in its development follows naturally, viz. that when
yourself unambiguously refers to plurals, and plurals only. This step constitutes a
break with traditional descriptions of the word; it is not described in our standard
grammars. Sentence (1) is one example, and some further examples follow.
(18) Ladies and gentlemen, Francie announced suddenly appearing brightly.

Our resident antiques expert will be having his break now, for twenty
minutes only. Until resumption, please avail yourself of the fairgrounds
refreshments at reasonable prices ... The queue groaned.
272 Gran Kjellmer
(19) Well can you sort that out amongst yourself and then after youve done
that then present it to the February sales meeting
(BNC: JN6 142)
(20) If come Valentines Day you girls found yourself still manless after
deploying every known method to hook that rare breed of muscle, there
was only one place to be.
(21) Coffees are ordered. Do you all consider yourself to be Botards?

(22) I have some good news for those of you who didnt manage to pull
yourself together enough to get tickets to Creamfields
(Cobuild: sunnow/17. Text: N9119980502)
(23) Prologue Oedipus:

My children, generations of the living
In the line of Kadmos, nursed at his ancient hearth:
Why have you strewn yourself before these altars
In supplication, with your boughs and garlands?
(24) Make sure youre in different groups. Okay. ---

One, two, three, so we separate yourself into different groups.
(BNC: KPV 514)
One can see the process in operation whereby yourself is supplanting yourselves
in examples like the following, where the speaker is hesitating between the two
forms and deciding on yourself :
(25) So what subjects did you take then at er <ZF1> S<ZF0> School
Certificate? <ZF1> What what <ZF0> what were your pushing yourselves
to yourself towards?
As suggested above, analogy with you is probably at work here. There is also a
slim chance that a few instances of plural yourself, labelled by the OED as
obsolete,3 are a deliberate continuation of the Middle English plural and hence
imitative of Middle English usage. This may be the case in an example like (23),
where the tone is solemn and somewhat archaic.
Examples like (18) - (24) above, where yourself is used with direct
reference to several addressees, are frequent enough in the corpora. (It is hardly
possible to give statistics, because yourself is a very frequent word,4 and evidence
of the number of addressees, if it occurs at all, may occur anywhere in the
context, often at some distance from yourself.) On the other hand, a further step
in the development of the word, where it is still plural but no longer limited to the
second person, is not recorded as frequently. This step could be represented by
cases like
(26) When I went to that stress management course we were told to use
physical resources like deep breathing and actually making yourself sit
down and making yourself go floppy. and let every muscle let it relax.
(BNC: KBF 8025)
(27) Fiona Me and, did you see me and Sarah [at the show] ...
Jessica No. No, cos we were sitting down down by yourself
(BNC: KBL 2998)
(28) We have to think yourself !

(BNC: KE0 859 )
This usage is clearly colloquial and scarcely acceptable in the standard language.
The shifts in the usage of yourself that we have seen so far represent a
widening of its sphere of application, from reference to second person singular to
reference to second person singular and plural, and from there, in addition, to
reference to other plurals. It has, in other words, become more general in its
application. By a slightly different route it concurrently acquires a generic sense,
as we shall now see.
When yourself, in the wake of you, was used to refer to singular and plural
addressees indifferently, the semantic distinction between what might be called
specific addressing, where you means e.g. you, Benjamin (You should avail
yourself of this opportunity) and general addressing, where you means one
(When you are young, without a job, ... it is your passions that often define
you) became blurred, particularly in general contexts. Ever since late Middle
English times English has lacked a distinctive generic pronoun, corresponding to
French on and German man,5 but you (and one) have come to fill that place.
Consequently yourself, too, could be used in a generic sense, as in the following
examples:
(29) Knowing how to present yourself # can really make or break you,
Charmaine said.
(Cobuild: oznews/01. Text: N5000950205)
(30) The role demands a lot of things. It demands subjecting yourself to

complete vulnerability.
(Cobuild: today/11. Text: N6000950602)
274 Gran Kjellmer
(31) Janet Parsons knows what it is to find yourself a victim of crime. Her
husband, Leslie was killed at the wheel of his lorry by two joyriders
racing each other.
(BNC: K1K 3765)
(32) The general sense of not being quite yourself

(BNC: BLW 1117)
This very clear step towards generality is also shown by the fact that yourself in
this sense can refer back to generic one:
(33) Theres a danger that in a science course one concentrates purely on how
and why nature works, or in an engineering course one concerns yourself
only with how to apply and harness phenomena, not to understand
sufficiently the nature of the phenomena and what are the inherent
limitations.
(BNC: KRW 36)
(34) one is to do it yourself

One step in the development of yourself remains to be discussed. As we

saw in (7)-(10), you is sometimes used in a generic sense although, paradoxically,
it has specific reference. This can at least initially be due to modesty on the part
of the speaker and/or on a wish not to take personal responsibility for the matter
presented, as you mostly stands for I or we. In the same way, yourself can then be
used in a seemingly general way but with clear reference to one or more persons,
mostly I or we:
(35) Id have loosened my tie, but they had taken it away along with my wallet,
gun, belt and shoelaces. I wondered how easy it would be to hang yourself
with your shoelaces.
(BNC: GVL 1718)
The general phrasing refers to the speakers specific problem, but both the
general and the specific meaning of yourself are part of the full meaning of the
sentence. The relevant part means both to hang oneself with ones shoelaces
and to hang myself with my shoelaces. This type of usage can be seen as a
transition to the final stage, that where the reference of yourself is exclusively
specific (and not always I or we, as in (39)). Some examples are:
(36) Peter Look, youve been repeating yourself again.

Kevin Yeah, so are you.
---
Peter I di-- , I aint repeating yourself.
Kevin Did, you did. You did!

Peter I aint repeating yourself.
(BNC: K SP 256)
(37) I know I, er in the past when Ive felt myself going off to sleep in those
situations, Ive been pinching myself and, and really making yourself do
something rather than just sitting there doing nothing, - - - weve read and
heard about people that have gone to sleep on motorways havent they?
(BNC: KBX 687)
(38) Ten-year-old Trevor Kachel, of Belgrave Road, said: `I like boxing

because it means I can defend yourself if you ever needed to.
(BNC: K52 6141)
(39) Petes gone down to the shop and got yourself a bottle whisky.
(BNC: KCT 7304)
As the contexts make clear, these sentences do not mean ... repeating you, ...
making you, etc., and they could not mean ... repeating oneself , ... making
oneself, etc. yourself is clearly specific here.6
The different types of usage that have been presented above could of
course be described as related in several different ways, none of which is
necessarily the correct one. If they are set out as suggested here, the stages in
the development of yourself can be seen as implicational in Figure 1:
This means, for instance, that those who use yourself to refer to the second
person plural (d) will also use it to refer to the second person singular and plural
indifferently (c), but not necessarily to other plurals (e).
5. Conclusions
As we have seen, yourself has changed a good deal through the ages, with
striking results in some variety or varieties of the language. We need not assume,
however, that the development of yourself in the standard language will
inevitably follow suit. This is one line of development among several, in its later
phases very much a minority option. Nevertheless, it is an interesting option in
that it represents the phenomenon of pattern neatening, to borrow a phrase
from Jean Aitchison (1991). From being distributionally and semantically quite
different from its corresponding personal pronoun you deviating in number as
well as type of reference yourself has become a close reflexive-pronoun copy of
it by getting rid of constraining features in its later stages of development. In
those stages it would appear justifiable to regard yourself as a general-purpose
emphatic-reflexive pronoun.
276 Gran Kjellmer
Reference to 2nd plur

(Ye weten your selfe)
Reference to 2nd sing

(A young man like yourself)
Ref. to 2nd sing/plur

(Treat yourself to a Volvo)
Ref. to 2nd plur Generic

(Separate yourself into (The sense of not being
groups) quite yourself)
Ref. to other plurals Explicit ref. to gen. one

(We have to think (One concerns yourself
yourself) with ...)
Ref. to any subject

(I can defend yourself)
Figure 1. Types of usage with yourself
Notes
1. There is occasional ambiguity between the reflexive and the emphatic use, as
in
You gave yourself to the poor,
meaning either You dedicated yourself to the poor or You yourself gave
to the poor.
2. ... it is not always clear in present-day English whether the second person
pronoun refers to one or more people (Biber et al. 1999: 330).
3. Yourself I. In plural sense: now replaced by yourselves.
4. There are 6758 occurrences of yourself in Cobuild and 10587 in the BNC.
5. Old English man with that meaning developed into Middle English me and
became obsolete in late Middle English times.
6. A case like
I shouldnt worry yourself, Dolly, said Carrie, with apparent innocence
(BNC HHC 240)
is probably different, in that I shouldnt do that is often used to mean You
shouldnt do that; I shouldnt worry yourself then means You shouldnt
worry yourself.
References
Aitchison, J. (1991), Language change: progress or decay. 2nd ed. Cambridge
University Press.
Aston, G., and L. Burnard (1998), The BNC handbook. Edinburgh: Edinburgh
University Press.
Barber, C. (1996), Early Modern English. 2nd ed. Edinburgh: Edinburgh
University Press.
grammar of spoken and written English. Harlow: Longman.
BNC = British National Corpus, see Aston and Burnard (1998).
CobuildDirect Corpus, cf. Sinclair (1987).
OED = Simpson, J. A., and E. S. C. Weiner (eds) (1989), The Oxford English
dictionary, 2nd ed. Oxford: Clarendon.
grammar of the English language. London & New York: Longman.
Sinclair, J. M. (ed.) (1987), Looking up. An account of the COBUILD project in
lexical computing. London and Glasgow: Collins.
Visser, F. Th. (1962-73). An historical syntax of the English language I-III.
Leiden: Brill.
Wright, J., and E. M. Wright (1924), An elementary historical new English
grammar. London, etc.: Oxford University Press.
Aspects of spoken vocabulary development in the Polytechnic of
Wales Corpus of Childrens English
Clive Souter
University of Leeds
Abstract
The Polytechnic of Wales Corpus was collected in the late 1970s for the study of
syntactic and semantic development of native English-speaking children aged
between six and twelve. This paper demonstrates that interesting lexical
information can be gleaned from this corpus for EFL instructors and curriculum
designers, even though the size of the corpus (61,000 words) makes it too small
for dictionary development. The Corpus was organised to permit researchers to
observe changes across age groups, and differences between the sexes and
between children of different socio-economic backgrounds. Five investigations
illustrate:
rate of vocabulary growth with age in this Corpus;
the extent to which vocabulary is sex-specific;
differences between sexes in the use of affirmatives and negatives, and in
the use of male and female personal pronouns;
the extent to which vocabulary size is related to socio-economic class;
persistence of errors in applying regular verb endings to irregular verbs.
The Corpus does show active vocabulary size increasing with age, at a rate of
only around 50 words per year (in the limited activities used to elicit speech from
the children). Surprisingly, around half of the words used by each of the sexes are
limited to that sex. Boys make more use of positive expressions, whereas girls
make greater use of negatives. Both sexes use he far more than she. There is no
clear evidence that social class differences influence vocabulary size. Errors
caused by applying regular verb endings to irregular verbs seem to diminish in
children between ages six and eight, and have disappeared by age ten.
Although it is clear that data sparsity influences these results, they are still
useful (and thought-provoking) to curriculum developers and coursebook
designers in EFL, as well as researchers in sociolinguistics of child language.
1. Introduction
In this paper, I present some investigations into the development of childrens

English spoken vocabulary between the ages of 6 and 12. I focus particularly on
the differences in vocabulary between the ages 6, 8, 10 and 12, between the two
sexes, and between socio-economic classes, since the corpus material has been
organised to permit this.
280 Clive Souter
The motivation for such a study came from my belief that, until recently,
the Polytechnic of Wales (POW) Corpus has never been used for vocabulary
study. (It was originally collected for the study of childrens syntactic and
semantic development.) This omission can perhaps be explained by the small size
of the corpus: only 61,000 words. Lexicographers building dictionaries of adult
vocabulary have had access to far larger English corpora, such as LOB and
Brown, and more recently the British National Corpus and the COBUILD/Bank
of English. For dictionary-building purposes, clearly the POW corpus is nothing
like large enough, and may have been overlooked for this reason alone. However,
it does have great value for researchers into child language development, TEFL
syllabus designers and course-book authors.
The POW Corpus is unique in containing childrens spoken language,
organised clearly by age, sex and class, and in being richly syntactically
annotated. I hope to show that there are some interesting features to be uncovered
even in such a small corpus, by modern standards. Such features should hopefully
catch the attention of the designers of school syllabi for English language
learning. In many EU countries, there is pressure on the education system to
introduce foreign language learning earlier in the curriculum, at primary rather
than secondary school age. This is not without difficulty: there are few primary
school teachers trained to teach foreign languages. Space needs to be found in the
curriculum and working week of primary schools. An appropriate syllabus needs
to be designed to engage younger learners. Finally, the impact on the secondary
curriculum needs to be addressed, particularly if some children have been
introduced to a foreign language already, but others havent. For this reason, a
team at the Freie Universitt Berlin in Germany led by Dieter Mindt has also
recently been using the POW Corpus to assess which vocabulary and
grammatical items should be introduced to younger German learners of English,
and in what order. A paper describing their work was also presented by Norbert
Schlter at the ICAME conference in May 2002 in Gteborg, Sweden.
2. Special value of spoken corpora for learners and teachers
Developers of language teaching materials and courses are increasingly making

use of corpus evidence. Such corpora may typically consist of native speaker
material, which is of course seen as the learners target, but may still contain
errors. Additionally, corpus collections have been made of non-native learners
language, such as for the ICLE project (Granger 1993, 1998) and ISLE project
(Menzel et al 2000, Atwell et al 2003), in which learner errors may be found.
From the aspect of young learners of English, native speaker spoken corpora such
as the POW corpus are particularly useful in that they can provide
pronunciation examples
intonation and prosody examples
awareness of accents
Aspects of vocabulary development 281
indications of lexical range including expressions and colloquialisms

grammar of speech (false starts, ellipsis, repetitions, unfinished elements,
interruptions)
discourse and dialogue patterns
production, lexical and grammatical errors/rarities in speech
relationships between and frequency of these
This paper will deal primarily with lexical variations between types of speaker,
and illustrate some of the lexical errors produced by younger native speakers of
English.
3. The Polytechnic of Wales Corpus of Childrens Spoken English
The POW Corpus was collected by Robin Fawcett and Mick Perkins, between
1978-9, for the purpose of studying development of syntax and semantics in
children aged between 6 and 12. The corpus was carefully balanced for age, sex
and socio-economic class. In total, there were 96 child informants, subdivided by
age (within 3 months of 6, 8, 10 and 12 years old), sex (B, G) and class (A, B, C,
or D). Such a division resulted in 32 homogeneous groups of 3 children. Each
group was recorded in a play session (PS) performing a lego building task, and
each child was interviewed (I) separately by the same adult to discuss favourite
games, TV programmes etc.
The recordings were then transcribed orthographically, and annotated
prosodically and published in four volumes (Fawcett and Perkins 1980). A
machine-readable version of the corpus was produced in 1980 with full syntactic
analysis for each utterance, using Fawcetts Systemic Functional Grammar
(Fawcett 1981), but which omitted the prosodic annotation, and separated the
speech of each individual child into one text file. For example, the file 6ABICJ
contains the speech of a six-year-old, social class A boy in the interview situation,
whose initials are CJ. The corresponding utterances during the play session for
this individual are in the file 6ABPSCJ (but not those of his playmates). This is
beneficial for our present purpose, but does make analysis of dialogue difficult.
The original machine-readable version contains around 65,000 words, but
the corpus is now more commonly distributed as the Edited Polytechnic of Wales
Corpus (EPOW: ODonoghue 1991). EPOW contains only 60,784 word-forms
(3,730 word-types), because the texts have been edited for typographical errors
which led to part-of-speech categories wrongly being counted as words for
example. This total corresponds to around 11,000 utterances.
The corpus was initially collected and used for the study of the linguistic
development of older children (Perkins 1983). It was later used for the machine
learning of probabilistic models of lexis and grammar for computer parsing
programs (ODonoghue 1993, Weerasinghe 1994, Souter 1989, 1996).
282 Clive Souter
4. Investigations
Three investigations are presented here into vocabulary range by age, across the
sexes, and by socio-economic class. We then investigate errors in use of irregular
verbs, and the extent to which speakers develop their use of syntactically
ambiguous words.
a) Vocabulary size and rate of growth
We can use the corpus to investigate how childrens vocabulary expands with
age. Taking the part-of-speech tagged version of the EPOW corpus as our data
source, we can extract the number of unique word + word-tag pairs for each age
group. This is achieved using standard unix operating system commands on the
text files of the corpus, once they have been verticalised with only one word +
word-tag per line. For instance, the unix command
cat 6* | sort +0 -1 | uniq | wc

produces the output
1821 3642 79093
(lines strings characters)
and shows that there are 1,821 unique word + wordtag pairs used by the entire
group of six-year-olds. Extracting the same for the older children gives us an
indicative growth rate over each two year span of around 6% (Table 1). Note that
we are not talking about growth rates and vocabulary sizes for individuals here,
but of the combined vocabulary of 24 children in each age group. It does however
give us some indication of the typical upper bound for word + word-tag pairs
used by children. The number of unique word-forms is somewhat lower: the
number of unique words in the corpus is 3,730, compared with 4,618 unique word
+ word-tag pairs.
Table 1. Tagged EPOW Corpus: types by age

6 8 10 12 All
Types 1821 1938 2006 2162 4618
Growth (%) 6.4 3.5 7.8
Tokens 14120 14718 15368 16528 60784
From intuition, we may expect that vocabulary size should grow with age for
older children. We might also expect that the corpus had been carefully controlled
so that there were equal numbers of word-forms in each age cohort, but this was
not the case. As can be seen from the third row of Table 1, there are more tokens
in each cohort as the ages increase.
2500
2000
1500 Age 6
Age 8
Types
Age 10
1000 Age 12
500
0
00
00
00
0
0
00
00
00
00
00
40
60
80
10
12
14
16
2
Tokens
Figure 1. Unique word-wordtag pairs by age
In order to discover if there is a genuine growth in vocabulary with age, we can

plot a learning curve for each age group, which shows how many unique word +
word-tag pairs are found as we read through the corpus data (Figure 1). This has
the effect of normalising for uneven sample sizes.
Until the data supply for six-year-olds runs out at just over 14,000 word-
forms, we can see that the twelve-year-olds consistently have a greater
vocabulary range than any younger group. The ten-year-olds only show a
markedly higher range once we have seen at least half of the data. The eight- and
six-year-olds appear not to differ greatly in vocabulary range. Rather surprisingly,
for much of the learning curve shown in Figure 1, the six-year-olds exceed the
eight-year-olds slightly in vocabulary range.
284 Clive Souter
These figures for vocabulary range obviously need to be carefully

interpreted. They reflect the limited contexts in which the data were collected
(lego-building and conversations with an adult about games, films and TV), but
they are better than nothing as pointers towards active vocabulary.
For greater detail, Appendix 1 shows the 100 most frequent word + word-
tag pairs for each age group. These data reveal the pronoun I to be the most
common word across all age groups in the corpus, and a fairly consistent ranking
of other personal pronouns across the age ranges. Interestingly, he is around twice
as frequent as she across all age groups. Of the words used to express affirmation
and negation, we see a fairly consistent ranking for the word no. The use of yes is
quite consistent among six-ten year olds, but drops significantly among twelve-
year-olds. The use of yeah instead of yes is a growing trend across all the age
groups, and increases quite sharply among twelve-year-olds, as use of yes
decreases.
b) Vocabulary differences by sex and age
Using similar unix commands, we can easily separate the data by sex and age.
Table 2 shows the range of word + word-tag pairs used by boys and girls.
Although the overall total for the corpus for each sex is almost the same, this
parity is only maintained in the subcorpus for eight- and ten-year-olds. Six-year-
old boys appear to have a significantly smaller vocabulary than six-year-old girls,
whereas the reverse is the case for twelve-year-olds, at least to judge from the
POW corpus.
Table 2. Tagged EPOW Corpus: word-wordtag types by sex and age

6 8 10 12 Total
Boys 1099 1252 1319 1454 3054
Girls 1265 1250 1319 1342 3044
Total 1821 1938 2006 2162 4618
What is interesting to observe here, and which is made more obvious in Table 3,
is the number of word types being used only by boys, or only by girls.
Table 3. Raw EPOW Corpus: word types

Girls Boys 6 8 10 12 All
2487 2491 1508 1614 1670 1760 3730
There are 3,730 unique words (word types) being used in the corpus as a whole.
Table 3 columns 1 and 2 show how many of these are used specifically by just the
boys or just the girls. Columns 3-6 show how many types are used by the six-
year-olds (of either sex), eight-year-olds, ten-year-olds, and twelve-year-olds,
respectively. Columns 3-6 are indicative of fairly steady vocabulary growth in
children aged between six and twelve.
Boys use 2,491 words and girls 2,487, which are remarkably similar totals.
However, only around 1,240 of the words in the corpus are being used by both
sexes, and the other half is specific to the speakers sex. We might perhaps expect
that the overlap between sexes would increase if we had a larger corpus, or if the
speakers were adult, but perhaps this distribution is demonstrating a genuine
socio-linguistic phenomenon as well.
We can explore the words used only by boys or only by girls by deleting
those used by both from an alphabetically sorted lexicon extracted from the
corpus. Appendix 2 contains such words (beginning with A) extracted from the
corpus.
An obvious area of difference is in the use of proper nouns. Male names
are prominent in the boys only list, and female names in the girls only list. The
corpus also displays stereotypical examples for favourite toys, careers, games etc
for each sex. Beyond this, we have to speculate as to whether the appearance of a
word in one column or the other is due to data sparsity, or whether it really is
indicative of a difference between the sexes.
There is evidence for both, I would argue. Data sparsity is evidenced by
the occurrence of amusement twice in boys speech (but not in girls), and
amusements once in girls speech (but not in boys). Boys talk about aeroplane,
aircraft, air-force and airport, whereas only air stewardess and air hostess
feature on the girls side. Boys talk about antennas and airholes, action men and
astronauts, whereas girls talk about animal magic, all creatures great and small,
and Alice in Wonderland.
Clearly, in a list such as Appendix 2, many of the items occur only once in
the corpus. If we instead consider the most frequent words used by boys and girls,
can we see any differences? Appendix 3 contains the 100 most frequent word +
word-tag pairs in the boys and girls sub-corpus. If we consider the most
common words which express affirmation or negation, we can see a clear
difference between the sexes. In the POW Corpus, words like yes and no are
labelled with the part of speech F (formula). Given that the corpus contains equal
quantities of text spoken by each sex, boys tend overall to use more positives than
girls do, whereas girls use more negative words, as illustrated in Table 4. There
are, of course, other ways of expressing affirmation and negation, but these are
the ones found most frequently in the corpus. (The use of no as a quantifier has
been omitted from the table.) Either this reflects a general trend between the sexes
in childrens spoken language, or it is an artifact of the tasks performed in corpus
collection. Perhaps Lego building elicits more positive responses from boys, and
more negative responses from girls. Perhaps being interviewed by a friendly male
adult has an impact.
286 Clive Souter
Table 4. Occurrence of some affirmatives and negatives by sex

Item (part of speech) Boys Girls
YEAH (F) 561 336
YES (F) 136 214
YEH (F) 52 41
TOTAL 749 591
NO (F) 274 311

NOT (N) 130 174
DONT (ON) 188 223
CANT (OMN) 59 102
HAVENT (OXN) 75 79
TOTAL 726 889
In line with the data for all the children, regardless of sex, the personal pronoun
he occurs far more frequently than she. One might expect this in the boys
language (239 instances of he against only 56 instances of she), but even the girls
use he (178 occurrences) more frequently than she (123 occurrences).
c) Track differences in social background
The corpus also allows us to look for possible differences by socio-economic

class, which is expressed from A (highest) to D (lowest) in the corpus filenames,
and was judged by parental occupation information collected when the corpus
was compiled. Table 5 displays the word + word-form types by class and age.
Table 5. Tagged EPOW Corpus: types by social class and age

6 8 10 12
ClassA 846 806 983 979
ClassB 852 699 923 938
ClassC 761 813 789 786
ClassD 546 871 702 890
Few clear patterns are evident. Vocabulary range is not always highest for the
class A children, although it is for the ten- and twelve-year-olds. For eight-year-
olds, it is the class D children who have the widest vocabulary. Given the
judgmental approach to allocation of socio-economic class labels, it is perhaps
not worth exploring this area any further.
d) Genuine learners errors (not typographical or transcription errors)
Running a spelling checker on the Edited POW Corpus, and ignoring the many
proper nouns, we can find some examples of native learner errors, such as regular
past tense forms for irregular verbs. Table 6 shows alphabetically which errors of
this kind are found in the corpus, and the source file in each case. One six-year-
old girl is the source of many of these. There are only 11 such errors among the
six-year-olds. Eight-year-olds have produced only four, and thereafter it appears
that these children have learned to use the irregular forms correctly.
Table 6. Past form errors of irregular verbs in POW

Word Source
amnt 6cg (6cgihb)
blowed 8cb
bringed 6cg x 2
comed 6cg
digged 6cg
drawed 8db
keeped 6cg
rided 6cg
runned 6cg x 2
shooted 6ag
throwed 6bg
weared 8db x 2
e) Lexical ambiguity
One of the reasons for using the tagged POW corpus in these investigations was
to discover whether there was an increase in the range of syntactic uses of a word
with age, between the ages 6-12. Do children of these ages know how to use the
word cut as a noun, verb, and adjective? Table 7 shows the number of lexically
ambiguous word types used by each age group, as a percentage of the total
number of types of word + word-tag pairs. This proportion remains remarkably
static across the four age groups. Perhaps children have already learned all such
syntactic differences before the age of six, but I would think that unlikely. More
probably, the corpus elicitation tasks were too constrained to demonstrate this
feature adequately.
Table 7. Tagged EPOW Corpus: ambiguous type

6 8 10 12
Word types 1508 1614 1670 1760
Ambiguous types 204 214 211 238
(% by age group) (13.52) (13.25) (12.6) (13.52)
288 Clive Souter
5. Conclusions
The five investigations have hopefully illustrated some of the possibilities for
discovery of distinguishing features of childrens vocabulary development.
Whilst in some areas it is clear that the data are too sparse (to inform the
compilation of a childrens dictionary, for example), there are others which are
more promising and perhaps disturbing, from the point of view of syllabus and
course material designers. The POW corpus evidence suggests that many of the
words we use between the ages of 6-12 are not regularly used by the opposite sex
in similar contexts. This feature is worth a good deal more investigation. Growth
in vocabulary with age has also been demonstrated, although perhaps not at a rate
of increase we might expect. It would be interesting to compare the vocabulary of
children aged 6-12 with that of adults in the better known corpora, but the limited
tasks for speech collection used in the POW Corpus would confound a
straightforward comparison.
For syllabus and coursebook designers, there are also some warnings to be
made with respect to the Welsh dialect features of the POW Corpus. Although the
collectors sought to minimise Welsh language influence in the data, there are
some dialectal features which show through quite strongly. Two of these are the
disproportionately high occurrence of tag questions (including the use of isnt it
without person agreement with the main clause verb), and the use of Welsh
dialect locative adverbs by-here and by-there, instead of here and there, which
becomes more prevalent in the older age groups.
Further warnings should be made regarding the domain-based lexis. The
most frequent common nouns in POW are house, door, man, window and car,
because of the Lego-building task which the children were set.
From the point of view of syntactic structures, the POW corpus illustrates
just how ill-behaved speech can be, especially when uttered by children.
Around 30% of the constituents in the parsed corpus are lacking a grammatical
head, mainly because of ellipsis or interruption, so there is a wide range of
grammatical structures not typically found in written corpora.
The POW Corpus is a small corpus for lexical work, but it still reveals
some interesting comparative and quantitative linguistic features of children of
different ages and across the sexes. It is almost unique as a lexico-grammatical
resource for childrens spoken language. I have not tried to show all such
features, by any means, but I hope to have demonstrated that it is worth
exploring, particularly if you have an interest in learning and teaching language.
References
Atwell, E., P. Howarth and C. Souter (2003), The ISLE Corpus: Italian and
German spoken learners English, ICAME Journal 27: 5-18.
Fawcett, R.P. (1981), Some proposals for systemic syntax. Journal of the
Midlands Association for Linguistic Studies (MALS), 1.2, 2.1, 2.2 (1974-
76). Re-issued with light amendments, 1981, Department of Behavioural
and Communication Studies, Polytechnic of Wales.
Fawcett, R.P. and M. Perkins (1980), Child language transcripts 6-12 (with a
preface, in 4 volumes). Department of Behavioural and Communication
Studies, Polytechnic of Wales.
Granger, S. (1993), The International Corpus of Learner English, in: J. Aarts, P.
de Haan and N. Oostdijk (eds), English language corpora: design,
analysis and exploitation. Amsterdam: Rodopi, 57-69.
Granger, S. (ed.) (1998), Learner English on computer. London and New York:
Addison Wesley Longman.
Menzel, W., E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton and C.
Souter (2000), The ISLE Corpus of non-native spoken English, in: M.
Gavrilidou, G. Carrayannis, S. Markantionadou, S. Piperidis and G.
Stainhaouer (eds), Proceedings of LREC2000: Language Resources and
Evaluation Conference, vol. 2, 957-964. European Language Resources
Association.
O'Donoghue, T.F. (1991), Taking a parsed corpus to the cleaners: the EPOW
corpus, ICAME Journal 15: 55-62.
O'Donoghue, T.F. (1993), Reversing the process of generation in Systemic
Grammar. Ph.D. thesis. School of Computer Studies, Leeds University.
Perkins, M.R. (1983), Modal expressions in English. London: Frances Pinter.
Souter, C. (1989), The COMMUNAL Project: Extracting a grammar from the
Polytechnic of Wales corpus, ICAME Journal 13: 20-27.
Souter, C. (1996), A corpus-trained parser for systemic-functional syntax. Ph.D.
Thesis. School of Computing, University of Leeds.
Weerasinghe, A.R. (1994), Probabilistic parsing in Systemic Functional
Grammar. Ph.D. thesis. School of Computing Mathematics, University of
Wales College of Cardiff.
290 Clive Souter
Appendix 1: 100 most frequent word-wordtag pairs by age in POW
Age 6 Age 8 Age 10 Age 12

Frq Type Tag Frq Type Tag Frq Type Tag Frq Type Tag
762 I HP 641 I HP 644 I HP 632 I HP
507 THE DD 597 THE DD 556 THE DD 590 THE DD
489 A DQ 451 A DQ 431 A DQ 530 A DQ
389 AND & 411 AND & 426 IT HP 403 IT HP
336 YOU HP 368 IT HP 391 AND & 359 AND &
328 IT HP 348 WE HP 381 YOU HP 342 'S OM
254 'S OM 281 'S OM 296 'S OM 337 WE HP
196 GOT M 262 YOU HP 264 WE HP 333 THAT DD
191 THAT DD 262 THAT DD 234 THAT DD 327 YEAH F
168 WE HP 192 YEAH F 230 YEAH F 319 YOU HP
155 THEY HP 170 NO F 155 THEY HP 191 GOT M
151 IN P 163 GOT M 149 TO I 171 PUT M
148 YEAH F 143 THEY HP 147 NO F 166 IN P
134 MY DD 123 PUT M 141 GOT M 158 NO F
132 TO I 113 TO I 131 THERE STH 157 THEY HP
129 HE HP 113 IN P 124 IN P 145 DON'T ON
110 NO F 110 YES F 122 THIS DD 141 ONE HP
107 CAN OM 108 THIS DD 119 OF VO 129 TO I
100 YES F 104 ON P 109 PUT M 112 OF VO
98 ON P 103 THERE AX 106 THERE AX 111 THERE AX
92 LOOK M 103 MY DD 104 YES F 108 HAVE M
90 'M OX 101 HE HP 104 HE HP 107 ON AX
84 TWO DQ 100 CAN OM 98 DO M 106 THIS DD
84 OF VO 98 'LL OM 96 ONE HP 104 NOT N
83 ONE HP 97 LOOK M 96 LOOK M 102 KNOW M
82 DON'T ON 91 DON'T ON 93 DON'T ON 100 THERE STH
80 MAKE M 91 DO M 87 ALL DQ 93 BE M
79 PUT M 90 BE M 86 'LL OM 87 CAN OM
79 ON AX 89 'VE OX 84 BE M 85 ON P
78 THERE AX 88 HOUSE H 83 THEM HP 85 'LL OM
77 GO M 87 MAKE M 81 HAVE M 83 HE HP
76 SHE HP 85 THERE STH 79 ON P 81 NOW AX
76 HAVE M 85 OF VO 79 CAN OM 81 GO M
75 KNOW M 81 ONE HP 76 IF B 80 THEM HP
71 WITH P 79 ALL DQ 75 MY DD 78 WHAT HWH
69 GET M 78 GO M 72 KNOW M 76 LIKE P
68 IS OM 76 HAVE M 71 WAS OM 75 HOUSE H
68 'S OX 73 ON AX 71 UP AX 73 GET M
67 THERE STH 71 KNOW M 70 NOT N 71 ALL DQ
67 IF B 70 WHAT HWH 70 'VE OX 66 WITH P
66 NOT N 68 THEM HP 70 'S OX 66 'VE OX
64 THEM HP 68 IF B 67 NOW AX 64 LOOK M
62 SOME DQ 67 'S OX 67 MAKE M 63 OUT AX
61 THIS DD 64 NOT N 64 HOUSE H 63 LIKE M
60 HERE AX 57 IS OM 59 LITTLE AX 62 THESE DD
59 MAN H 56 LIKE P 58 WHAT HWH 62 HERE AX
58 DO M 55 WITH P 58 GO M 61 IS OM
57 TO P 55 NOW AX 58 GET M 61 BY-THERE AX
56 UP AX 55 'M OX 57 ON AX 59 LITTLE AX
56 DOOR H 51 WAS OM 57 IS OM 58 ROOF H
55 WHAT HWH 51 PLAY M 57 GOOD AX 58 IF B
55 BUT & 50 AND-THEN & 56 HERE AX 58 DO M
54 HOUSE H 46 HAVE-TO XM 55 WITH P 57 WAS OM
53 'VE OX 46 AN' & 54 LIKE P 56 UP AX
51 ME HP 45 THINK M 54 LIKE M 55 ONES HP
51 BE M 45 SHE HP 53 THESE DD 55 FOR P
50 ALL DQ 45 GET M 53 BUT & 54 JUST AI
49 AND-THEN & 45 FOR P 53 AND-THEN & 51 GOING-TO X
48 WAS OM 45 COULD OM 51 DOOR H 51 BUILD M
48 DO O 44 WHERE AXWH 51 BY-THERE AX 50 SOME DQ
47 JUST AI 43 BUT & 48 WHEN B 50 HAVEN'T OXN
46 ONE DQ 42 OUT AX 48 THINK M 49 TO P
46 FOR P 42 DOOR H 47 WINDOWS H 48 CAN'T OMN
46 COME M 41 UP AX 47 JUST AI 48 'S OX
45 WANT M 40 CAN'T OMN 46 FOR P 47 BUT &
45 'LL OM 39 TO P 45 ONE DQ 45 MAKE M
44 LITTLE AX 38 LIKE M 43 IN AX 44 HAVE-TO XM
44 GOOD AX 37 WHEN B 43 'RE OM 44 GOT-TO XM
43 HAVEN'T OXN 37 NEED M 42 WINDOW H 43 WANT M
42 CAR H 37 DO O 42 NEED M 43 RED AX
41 NEED M 36 LITTLE AX 41 TO P 42 ONE DQ
40 CAN'T OMN 36 GOT-TO XM 41 ROOF H 42 OFF AX
37 LIKE M 36 BUS-STOP H 40 SOME DQ 41 MY DD
37 GOING-TO X 35 TWO DQ 40 DO O 40 PLAY M
36 THINGS H 35 SOME DQ 39 GOING-TO X 40 NEED M
36 PLAY M 35 LEGO HN 38 HAVE-TO XM 39 SHE HP
36 MINE HP 35 IN AX 37 YEH F 39 OR &
35 WAS OX 35 BIG AX 37 BECAUSE B 39 IN AX
292 Clive Souter
35 NOW AX 34 ROOF H 36 SO & 39 'RE OX

35 AT P 34 JUST AI 34 WENT M 39 'D OM
34 WHEELS H 34 GOING-TO X 34 PLAY M 37 THINGS H
33 MAKING M 33 RIGHT AF 34 OUT AX 37 DO O
33 HAVE OX 33 HERE AX 34 BUILD M 37 AN' &
33 HAD M 33 HAVEN'T OXN 34 BRICKS H 36 YES F
33 BUS H 33 BY-THERE AX 33 ONES HP 36 TWO DQ
32 HIM HP 32 THESE DD 33 CAN'T OMN 36 ME HP
31 WHEN B 32 GARAGE H 32 YOU-KNOW AF 36 GOOD AX
30 OUT AX 31 ONE DQ 32 ME HP 35 WINDOW H
30 COS B 31 GOOD AX 32 'D OM 35 THEN AX
30 BACK AX 31 AT PM 31 CAR H 34 WHITE AX
28 WHAT F 30 THEN AX 31 ARE OM 34 WENT M
28 WENT M 30 DOWN AX 31 'RE OX 34 SEE M
28 ARE OM 29 WENT M 30 WANT M 34 COULD OM
27 WINDOWS H 29 OFF AX 30 TWO DQ 34 BIG AX
27 IN AX 29 ME HP 30 THING H 33 DOOR H
26 WINDOW H 28 GOING M 30 REALLY AL 31 WHERE AXWH
26 GOTTA XM 28 DOING M 30 'M OX 31 RIGHT FR

26 DOWN AX 27 THING H 29 WOULD OM 31 LOOK AF
26 BECAUSE B 26 COS B 29 VERY T 30 COME M
26 ANOTHER DQ 25 SAID M 28 HAVEN'T OXN 29 SO &
Appendix 2: Sex-specific words in POW
Boys only talk Girls only talk

Freq Word Type Freq Word Type
4 A-LEVEL 1 A'
2 A-LITTLE 1 A-HUNDRED-AND-ONE-DALMATIANS
1 ABANDONED 1 A-LADDERS
2 ABOVE 1 A...
4 ACTION-MAN 2 ABROAD
1 ADDING 1 ACCOUNTANT
1 ADRIAN 1 ACHING
2 ADVENTURE-BOOKS 1 ACROBATICS
1 ADVERT 1 ACTUALLY
1 ADVERTS 1 ADDED
3 AEROPLANES 1 ADJUST
1 AFRICANS 1 AFFORD
1 AGREE 1 AFTERWARDS
1 AIR-FORCES 2 AGES
1 AIRCRAFT 1 AHEAD
1 AIRHOLE 1 AHEAD-OF
1 AIRPORT 3 AIR-HOSTESS
1 AL 1 AIR-STEWARDESS
1 ALARM 6 ALEX
1 ALF 1 ALICE-IN-WONDERLAND
1 ALFRED-HITCHCOCK 3 ALIVE
1 ALL-OF-A-SUDDEN 1 ALL-ABOARD
1 ALL-THE-WAY 1 ALL-CREATURES-GREAT-AND-SMALL
1 ALL-TOGETHER 1 ALL-RIGHT-THEN
2 ALMOST 2 ALLEY
1 ALRIGHT-ALRIGHT 1 ALONE
2 AMUSEMENT 1 ALRIGHT-THEN
2 AMUSING 1 ALTOGETHER
2 ANDERSON 1 AM...
1 ANDRE 1 AMN'T
3 ANGRY 1 AMOUNT
1 ANIMAL-SNAP 1 AMUSEMENTS
1 ANTENNA 1 AN-ALL
1 ANY-MORE 1 AND'
1 ANY-WHERE 1 AND-FILEY
4 ANYMORE 3 ANDREA
1 ANYONE 2 ANGELS
2 ANYWHERE 1 ANGLES
2 APART-FROM 1 ANIMAL-MAGIC
1 APPLE 1 ANY-HOW
1 ARBEE 1 ANY-RATE
1 ARCADE 1 ANY-WAY
1 AREA 1 ANYHOW
2 ARGENTINA 2 ARCHES
1 ARGUED 1 ARGUE
1 ARROW 1 AROUNDS
2 ARROWS 1 ARRESTED
2 ART 1 AS-FAR-AS
2 ARTIST 1 AS-IF
2 AS-WELL-AS 2 AS-LONG-AS
1 ASTRONAUT 1 AS-SOON-AS
2 ASTRONOMY 1 ASKED
1 AT-FIRST 1 ASLEEP
1 AT-LAST 2 ASSEMBLY
294 Clive Souter
1 ATH-LYMPICS 1 ATTACHED
1 ATTACK 1 ATTENTION
2 ATTACKING 1 AVE
1 AWKWARD 1 AW-MAMMY
Appendix 3: 100 most frequent word-wordtag pairs by sex in POW
Boys Girls
Frq Type Tag Frq Type Tag
1190 I HP 1489 I HP
1186 THE DD 1064 THE DD
942 A DQ 959 A DQ
800 IT HP 801 AND &
749 AND & 727 YOU HP
571 YOU HP 725 IT HP
571 'S OM 602 'S OM
565 WE HP 552 WE HP
561 YEAH F 477 THAT DD
543 THAT DD 361 THEY HP
354 GOT M 337 GOT M
288 IN P 336 YEAH F
274 NO F 311 NO F
249 THEY HP 282 TO I
241 TO I 266 IN P
240 PUT M 242 PUT M
239 HE HP 232 THERE AX
212 OF VO 227 THERE STH
209 THIS DD 223 DON'T ON
193 ON AX 214 YES F
190 ONE HP 211 ONE HP
188 DON'T ON 202 MY DD
179 CAN OM 200 LOOK M
173 ON P 197 HAVE M
167 'LL OM 194 CAN OM
166 THERE AX 193 ON P
156 THERE STH 188 THIS DD
151 MY DD 188 OF VO
149 LOOK M 180 KNOW M
149 DO M 178 HE HP
148 BE M 174 NOT N
146 MAKE M 170 BE M

144 HAVE M 163 GO M
143 HOUSE H 159 THEM HP
140 KNOW M 156 DO M
140 'VE OX 150 ALL DQ
138 IF B 147 'LL OM
137 WHAT HWH 138 WITH P
137 ALL DQ 138 HOUSE H
136 YES F 138 'VE OX
136 THEM HP 133 MAKE M
136 GET M 131 IF B
131 GO M 129 IS OM
130 NOT N 127 NOW AX
127 'S OX 127 LIKE M
118 UP AX 126 'S OX
114 IS OM 124 WHAT HWH
111 NOW AX 123 WAS OM
109 WITH P 123 SHE HP
104 WAS OM 123 ON AX
102 NEED M 120 LITTLE AX
96 TWO DQ 118 LIKE P
96 FOR P 117 HERE AX
94 HERE AX 114 BUT &
92 LIKE P 109 GET M
92 JUST AI 106 UP AX
91 TO P 106 SOME DQ
91 'M OX 105 'M OX
90 BY-THERE AX 102 CAN'T OMN
88 HAVE-TO XM 101 DO O
87 ONE DQ 98 DOOR H
86 GOOD AX 96 FOR P
86 AND-THEN & 95 TO P
84 DOOR H 92 ME HP
84 BUT & 91 THESE DD
81 THESE DD 91 AND-THEN &
81 SOME DQ 90 JUST AI
80 OUT AX 89 TWO DQ
80 GOING-TO X 89 OUT AX
78 LITTLE AX 88 IN AX
78 CAR H 84 PLAY M
77 PLAY M 82 ONES HP
296 Clive Souter
75 HAVEN'T OXN 82 GOOD AX

74 ROOF H 81 GOING-TO X
67 WANT M 80 BY-THERE AX
66 WHERE AXWH 79 THINK M
66 OFF AX 79 HAVEN'T OXN
66 COULD OM 78 ROOF H
65 LIKE M 77 WHEN B
65 BIG AX 77 ONE DQ
61 GOT-TO XM 77 'RE OM
61 GARAGE H 75 WINDOWS H
61 DO O 74 WENT M
60 WHEN B 67 COME M
60 BUILD M 66 COS B
59 MAN H 65 ARE OM
59 CAN'T OMN 64 WANT M
58 COME M 64 THINGS H
57 THING H 64 MAN H
56 THINGS H 64 HAVE-TO XM
56 SHE HP 64 'D OM
56 ME HP 63 WINDOW H
56 IN AX 62 AN' &
55 WINDOW H 61 BECAUSE B
54 AT P 59 OR &
53 HIM HP 58 WHERE AXWH
52 YEH F 58 NEED M
52 THEN AX 57 PEOPLE H
51 WENT M 57 GOT-TO XM
51 RIGHT AF 57 BUILD M
Demonstrative reference as a cohesive device in advanced
learner writing: a corpus-based study
Roumiana Blagoeva
Sofia University St. Kliment Ohridski
Abstract
This paper discusses the under/overuse of different types of demonstrative

reference and their role for the achievement of cohesion in argumentative essays
written by advanced Bulgarian learners of English. The use of pro-forms and
their place within the total framework of text-forming relations are examined in
both native and non-native writing. A comparative approach to the study of
learner language is adopted for the investigation of differences between learner
and native English writing. These differences shed light on L1- induced and
universal features of learner discourse.
The analysis is based on data drawn from the Bulgarian component of the
International Corpus of Learner English (BUCICLE), the LOCNESS corpus of
native learner writing, a sub-corpus of the BNC, and a corpus of Bulgarian non-
learner writing. The frequency of occurrence, the distribution of demonstratives,
and their function as reference items in the four corpora are compared and
examples of their use are discussed.
Explanations of the phenomena observed are sought in several directions:
L1 interference, strategies of teaching/learning, avoidance of certain discourse
patterns, and the nature of the text type.
The differences between learner and native speaker English in the
frequency and distribution of demonstratives might not directly obstruct
communication but it is an indication that there is still much to be done in the
development of language skills even at an advanced level of foreign language
acquisition. The adoption of a corpus-based approach to the study of learner
language can reveal problematic areas in the foreign language and can enable
language researchers and language teaching professionals to diagnose the true
needs of learners and make appropriate choices of teaching materials and
methods.
1. Introduction
Interlanguage studies in Bulgaria developed in the early 1980s as a result of the

growing awareness that it was hardly possible to achieve effectiveness in foreign
language acquisition (FLA) and improvement of foreign language teaching (FLT)
without knowledge of the learners needs and the peculiarities of their foreign
298 Roumiana Blagoeva
language production. Course designers, textbook authors and teachers

concentrated their efforts, on the one hand, on cross-language comparisons which
helped to generate predictions about the areas of learning difficulty in the target
language, and, on the other hand, on analysing learners errors and the factors that
cause them.
Such studies placed too much emphasis on errors detectable on the
phrase and sentence levels, and they paid little attention to the inability of learners
to create a unified whole of the sentences that they produced. This led to the
assumption that as long as students stick to the rules of grammar and the
appropriate use of words they would be able to communicate successfully in the
foreign language. Yet, it was perceived by both teachers and learners that even at
a high level of FLA where very few errors occur there is still much difference
between learner and native-speaker production.
In recent years the collection of electronic learner-language corpora has
led to a shift of priorities in the study of learner production mainly in two
directions. First, by providing larger stretches of discourse a corpus enables
language teaching professionals and language researchers to study not only
isolated sentences and their structure but also the ways these sentences are
organised and utilised by text producers in realistic conditions for the purposes of
communication. Second, electronic learner corpora and corpus linguistics have
provided the necessary material and tools to turn the focus of attention from
erroneous structures to language patterns that might consist of acceptable units of
language but used in unnatural combinations. With the help of corpus data it is
now possible to reveal and analyse quantitative as well as qualitative differences
between learner and native speaker production. These differences seem to be a
major cause of the artificiality of learners interlanguage and they indicate the real
areas of difficulties in the acquisition of a foreign language.
2. Aims of the study
This paper is part of a wider study of grammatical cohesive devices in

argumentative essays written by advanced Bulgarian learners of English which
aims at establishing how Bulgarian learners of English use the resources available
in the foreign language to achieve effective communication. It deals with the
under/overuse of the demonstratives this, that and their plural variants these,
those, both in their functions as determiner (modifier) and pronoun (head), and
their use as cohesive ties in written advanced learner discourse.
3. The corpora
A learner corpus is very different from a native corpus because of the nature of
the material collected. A native corpus contains data from a natural language and
can be used on its own for the investigation of characteristic features of this
language. A learner corpus presents evidence of an interlanguage; and an
Demonstrative reference as a cohesive device 299
interlanguage, regardless of its stages of development, can only be an

approximation to the natural language that is the target aimed at in the process of
FLT. Therefore, any learner corpus would be of little value on its own, but it can
be a useful tool for investigating a particular interlanguage when compared to a
relevant native corpus. The choice of the native-speaker corpus is dependent on
the aims of FLT. If the final goal of FLT/FLA is to achieve an ability to use the
target language as it is used by native speakers for the fulfilment of certain real-
life tasks, then a study of interlanguage will, firstly, need a suitable sample of the
foreign language to compare with the learners production. Secondly, a learner
language is always characterised by some degree of L1 interference and, thirdly,
it could be influenced by the nature of the text type that learners have to produce.
Therefore, their language should be evaluated against a target norm representing a
similar text type. For all these reasons, comparisons with relevant data that take
into consideration these aspects of learner production are indispensable for a
comprehensive description and investigation of any feature a learner corpus might
display.
In view of the peculiarities of learner corpora mentioned above, the
present analysis is based on comparisons of data drawn from four electronic
corpora of about 200,000 words each.
Corpus 1 is a learner corpus of argumentative essays written by Bulgarian
university students of English language and literature, compiled within the
framework of the International Corpus of Learner English (ICLE) project, namely
the Bulgarian sub-Corpus of the International Corpus of Learner English
(BUCICLE). The ICLE project was launched at the University of Louvain in
1990. From the very beginning strict design criteria were adopted and variables
such as age, sex, native language background, level of foreign language
education, and the type and length of texts to be included were carefully
controlled. Each of the research teams from the participating countries was to
assemble a computerized collection of 200,000 words of learner English. At
present the ICLE corpus contains approximately 2 million words of
argumentative writing from university students of English from 11 different
language backgrounds and is an important resource for analysing features of
written interlanguage grammar, lexis and discourse (for further details, see
Granger, Dagneaux and Meunier 2002).
Corpus 2 is the British component of the Louvain Corpus of Native
English Essays (LOCNESS) containing argumentative essays by native-speaker
university students.
Corpus 3 is a sub-corpus of the BNC consisting of non-fiction texts from
the domains of Applied Science, Social Science and World Affairs, as this is the
target norm Bulgarian students are expected to master.
Corpus 4 is a collection of texts written in Bulgarian and taken from
domains comparable to those of the BNC sub-corpus.
4. Theoretical framework
Before discussing the results it is necessary to mention some similarities and

differences between the demonstratives and their role as cohesive devices in
English and Bulgarian. As far as textual relations are concerned demonstratives in
English and Bulgarian behave in a similar way. First, in both languages
demonstratives can function as determiners in noun phrases, or as pronouns, i.e.
as whole noun phrases. Second, in both languages their basic deictic function is to
indicate definiteness and proximity: near and remote (or not near) from the
point of view of the speaker. Third, in both languages they indicate that
information about their meaning, their referent, is to be retrieved from elsewhere:
either from the communicative situation thus relating exophorically to entities in
the world outside the text, or from the text itself where they refer endophorically
to preceding or following items expressing anaphoric or cataphoric reference
respectively. They refer to the location of some thing (person or object) in space
or time that is participating in the process. Finally, in both languages they have
distinct singular and plural forms (for Bulgarian, see Maslov 1982: 309-310;
Krastev 1992: 77-78; Pashov 1994: 95; Andreichin et al. 1998: 239; for English,
see Quirk and Greenbaum 1973: 107; Halliday 1985: 160, 292; Leech and
Svartvik 1994: 267; Lyons 1977: 647).
Two major dissimilarities, however, exist between demonstratives in
English and Bulgarian. The first one arises from the different expressions of
gender and the inflectional character of Bulgarian. This accounts for the larger
number of Bulgarian forms corresponding to the singular forms this and that.
Another important difference comes from the distinction between registers made
in Bulgarian, which leads to the existence of stylistically marked forms of the
demonstratives. These differences and similarities are summarised in Table 1.
Table 1. The English demonstratives and their Bulgarian equivalents

ENGLISH BULGARIAN
Gender Formal/Neutral Stylistically
marked
(colloquial/poetic)
Near
masc. tozi/toja toz

Participants
Sing this fem. tazi/taja taz

neuter tova tuj
Pl. these tezi/tija tez
masc. onzi/onja
Sing that fem. onazi/onaja onaz
Remote
neuter onova onuj

Pl. those onezi/onija onez
One important feature of the demonstratives in English compared with the

demonstratives in Bulgarian that makes them both similar and different should be
noted here, namely that with extended reference and with reference to a fact
only singular forms can be used.
In English the use of demonstratives to refer to extended text, including
text as fact [] applies only to the singular forms this and that used without
a following noun (Halliday and Hasan 1976: 66). Whereas extended reference
differs from usual instances of reference only in extent the referent is more than
just a person or object, it is a process or sequence of processes (grammatically, a
clause or string of clauses not just a single nominal) text reference differs in
kind: the referent is not being taken at its face-value but is being transmuted into
a fact or report (Halliday and Hasan 1976: 52).
In Bulgarian, as Krastev (1992:78) notes, the singular form tova (near),
but not onova (remote), has a special place in the system and is one of the most
frequent and most economical words in the language. Only the demonstrative
tova can replace any word, combination of words, phrases and even whole
stretches of text. Thus in Bulgarian only one form of the singular demonstratives
performs the functions of extended reference and reference to fact, which in
English are shared between the two singular forms.
5. Comparisons and observations
Using WordSmith Tools (Scott 1997), frequency lists and concordances were
produced for all the investigated items in each of the four corpora. The raw data
were then examined to exclude all examples that were irrelevant to the present
study, namely cases where that was used as a conjunction or relative pronoun,
and whenever it was used as an adverb in front of an adjective to express the
degree of a quality. The total number of tokens that were extracted from the
corpora after these first searches is shown in Table 2.
Table 2. Frequency of occurrence of the demonstratives in the four corpora

Corpus 1 Corpus 2 Corpus 3 Corpus 4
Near singular 1167 1552 656 1600
plural 325 297 146 182
Remote singular 412 160 263 76
plural 209 161 128 28
Total 2113 2170 1193 1886
Most often a first step in a quantitative study of any language feature is to look at
the number of occurrences of the items examined, which can give a preliminary
idea of the spread of the feature through entire collections of texts. So when
examining the cohesive function of demonstratives it seems reasonable to start
with a comparison of the total number of tokens found in the corpora. A first
glance at the figures in Table 2 shows a striking similarity between the
frequencies of this/these and that/those in Corpus 1 and Corpus 2. Moreover, the
frequencies are nearly twice as high as that in Corpus 3 (the BNC) and slightly
higher than that in Corpus 4 (the Bulgarian language corpus).
However, these data could be misleading and could bring us to the rash
conclusion that there is no over- or underuse of demonstratives by the Bulgarian
learners of English. Instead, it may be that the use of demonstratives is
determined by the different text types represented in the learner and non-learner
corpora, as their number is greater in the argumentative essays than in the BNC
sub-corpus and the Bulgarian language corpus, both of which consist of other
types of non-fiction texts. However, if we make a distinction between near and
remote types of demonstratives and look at each of these types separately, the
picture changes, as shown in Tables 3 and 4.
Table 3. Near types of demonstratives

Near Corpus 1 Corpus 2 Corpus 3 Corpus 4
Sing. + pl. 1492 1849 802 1782
Table 4. Remote types of demonstratives

Remote Corpus 1 Corpus 2 Corpus 3 Corpus 4
Sing. + pl. 621 321 391 104
The distinction between proximity and non-proximity is expressed differently in

the learner and non-learner material. Demonstratives referring to near persons and
objects are slightly underused by Bulgarian learners when compared to British
students and this is compensated for by a clear overuse of demonstratives
referring to remote persons and objects. This tendency for Bulgarian learners to
use that/those occurs in spite of the very low frequency of occurrence of their
Bulgarian equivalents.
So far mere statistical comparisons of the data suggest that native language
interference as a factor determining learner production plays an insignificant role
in the use of English demonstratives by the Bulgarian learners. However, looking
carefully at the examples extracted from the corpora, we can observe that the
Bulgarian learner writing shows a much wider variety of patterns than the
LOCNESS and the BNC material. The question is how this difference could be
explained.
Two very typical patterns that have some relevance to cohesion in that
they determine the use of demonstratives in endophoric (textual) reference were
observed in the BUCICLE. The first involves a demonstrative functioning as
determiner, as in:
(1) I know a little boy, for example, whose father is a scientist. This nine-year
old boy reads only Science Fiction and I can never persuade him to read a
fairy tale or fable or a folk tale. He is not interested even in books about
famous adventurers, about sailors and pirates, books which I read with
great interest and pleasure when I was his age. That boy reads only about
robots, machines, spacecraft, numbers. I agree that Science Fiction
somehow stirs children's imagination but it creates a world controlled by
machines, rather than one controlled by human beings. Probably the
science fiction stories will be the fairy tales of the new era. (BUCICLE)
The other typical group of examples observed involves the use of demonstratives
to refer to extended text, including text as fact. In English this function applies
only to the singular forms this and that used without a following noun (see
Halliday and Hasan 1976: 66) as in:
(2) Sinclair's, at all events, is the work of a Modernist, and is unlikely to be

that of an occultist. This makes it, in a sense, compatible with Hawksmoor.
But Hawksmoor is a different beast. (BNC)
(3) It fulfilled none of my expectations and seemed to be merely trying to

make me laugh at the fact that it had left me standing there grasping at
nothing. And that was all there was to it. By contrast, here is a comment
by an anthropologist who went to see the work of Mark Rothko. (BNC)
In English the choice of this or that to refer to something that has been said before
is clearly related to that of near (the speaker) versus not near; what I have
just mentioned is, textually speaking, near me whereas what you have just
mentioned is not (Halliday and Hasan 1976: 60). At the same time the notion
of proximity has various interpretations; and in such cases there is no very clearly
felt distinction between this and that (Halliday and Hasan 1976: 61).
In Bulgarian the demonstrative tova (singular, neuter, near), which
according to most traditional Bulgarian grammars (Krastev 1992; Pashov 1994;
Andreichin et al. 1998) expresses the idea of near in time and space, has a very
wide spectrum of uses and has a special place in the system of Bulgarian
demonstratives. As mentioned above in Section 4, apart from its use as pronoun
or determiner to refer to any singular neuter object or person, it is the only
demonstrative that can convey extended reference relations in a text. Here the
distinction near/remote is lost and the reference of tova is derived from the
immediate context in or outside the textual world irrespective of the idea of
proximity. Thus in this particular function its use coincides with both this and
that in English and we may expect a great overuse of this by Bulgarian learners.
The functions of onova (singular, neuter, remote) are always either Head
or Modifier so it can never be used in extended reference and reference to fact;
and as the data demonstrate (Table 6) it is rare in Bulgarian. Yet, this infrequent
use of onova does not cause an underuse of its English equivalent that by the
Bulgarian learners. On the contrary, Table 2 shows a clear overuse of that in
Corpus 1 in comparison with Corpora 2 and 3. It is true that the total number of
singular forms is nearly the same in the learner material, the native-speaker
student writing and the Bulgarian language corpus, as shown in Table 5 and this
at first glance may blur some differences.
Table 5. Frequency of singular forms

Singular Corpus 1 Corpus 2 Corpus 3 Corpus 4
Remote + Near 1579 1712 918 1676
However, the number of singular demonstratives used by the Bulgarian learners is

unevenly distributed between this and that, with a predominance of near over
remote, with the result that the total frequency of this and that in Corpus 1 (1579)
approaches that of tova in Corpus 4 (Table 6).
Table 6. Frequency of singular forms in BUCICLE and the Bulgarian language

corpus
Singular Corpus 1 Corpus 4
Near 1167 1600
Remote 412 76
Total 1579 1676
One possible reason could be the fact that most teaching materials used in
Bulgaria overlook the distinction between the English counterparts of tova and
onova and learners are left with the impression that it is unimportant and that both
this and that, having a very wide range of referents, could be used
indiscriminately to point to any word, phrase or longer stretch of text.
The lower frequency of singular forms in Corpus 3 than in the other
corpora could be attributed to the differences between the text types involved.
One could argue that since the distinction near/remote in the use of the
singular forms is not as clear-cut in English as in Bulgarian, the
interchangeability of this and that is permissible and might not lead to serious
communication breakdowns. Still, it is my view that it could interfere with a
receivers comprehension of a text and could contribute to the production of
unclear textual references by learners of English. In the following example the
choice of this or that would only slightly change the point of view of the writer:
(4) [] no-one is to be thought superior to another despite the differences of

race, social status, nationality and so on and every person is to be treated
objectively by the law and social institutions. And though that is being
continuously officially stated and re-stated often the talk about equality
remains just an euphemism to hide the cruel reality. It is obvious that some
people are more equal than others. [BUCICLE]
That is probably preferred because the fact it refers to in the preceding sentence is
not explicitly linked to the personal feelings of the writer; it is perceived rather as
being officially stated by a third party. In such cases this could easily substitute
for that and make the whole statement more involved.
But sometimes this tendency goes too far and in their desire to vary their
style and avoid repetition learners use this and that as absolute synonyms.
Consider the following examples from BUCICLE:
(5) [] my opinion is that dreaming and imagination are still part of our
society. Even if it werent so, I do not see what the problem is. The world
is changing, developing all the time and if it does not need these, it gets rid
of them as something useless, that is just the way it goes. And if someone
cannot live without dreams they either adapt to the new conditions or keep
dreams in their souls which is a question of personal choice.
In (5) it is unclear why the referents of these (dreaming and imagination) are
perceived as being closer to the writer of the passage than the fact that is referred
to by means of that. The idea of proximity is even more confused in (6) where
one and the same fact is referred to by both this and that in the same sentence:
(6) But is it really so, or it is just another old-dated "fairy tale" we are taught
to believe in and which is so trivial that we have learned it by heart. We
fight for freedom, we strive for equality, we talk about democracy and
having equal rights, but that is just an illusion, with which our minds are
washed away and we are all blind, because we believe in this. Human
beings are not equal. Inequality is determined by history. History is the
reflection of our lives.
6. Conclusions
The observations of the data presented in this paper demonstrate: (1) an overuse
of demonstratives in argumentative writing by both Bulgarian learners of English
and native-speaker students; (2) a tendency for Bulgarian learners to use
that/those in spite of the very low frequency of occurrence of their Bulgarian
equivalents; (3) a similar frequency of this/these in Bulgarian learner writing and
English native-speaker student writing; (4) a similar frequency of this/these and
their Bulgarian equivalents.
These findings shed light on some aspects of Bulgarian learner discourse
that are still unexplored and need further investigation. At this stage of the study
some of the similarities between the production of Bulgarian learners and native
speaker students might point to an influence on learner production by the nature
of the text type. A task-based learner corpus requiring students to produce one
particular text type might not reveal features of other text types. Yet, an academic
essay gives students freedom to write what they want, and more importantly what
they can, on a variety of topics, and in this sense a corpus of this kind can tell the
researcher a lot about learners abilities to produce coherent texts in any real-life
context. It can allow us to draw meaningful conclusions about how aware, or

rather unaware, learners are of certain discourse features.
One indisputable reason for the deviations in the use of demonstratives by
Bulgarian learners from the native speaker target norm is native language
interference. The differences that exist between the systems of demonstratives in
English and Bulgarian reflect affect learner production even at an advanced stage
of foreign language acquisition.
It is also my contention that there exists a strategy of communication
common to many advanced second language learners, namely that at a certain
stage of FLA they feel confident enough to communicate in the foreign language
and stop learning in the sense that they tend to stick to language patterns that
have become fossilised at an earlier stage of learning and continue to learn at a
slower pace, mostly by adding vocabulary. The main concern of such learners are
the real errors they make at the level of vocabulary and syntax and it never
occurs to them that there could be other aspects of the foreign language that are to
be mastered. If at a certain stage of FLA learners are made aware that there is a
tendency for them to resort to a restricted range of language patterns, they would
probably be encouraged to learn alternative ways of expression and a more target-
like way of producing coherent texts.
Naturally, further corpus-based research in this area is likely to enhance
our understanding and intuitive evaluation of learner production and point to
effective ways of bringing their interlanguage closer to the kind of language used
by native speakers of English. This can be done through the development of
teaching materials and methods that focus attention not only on grammar rules
but also on discourse features.
References
Andreichin, L. et al. (1998), Gramatika na sa vremennija balgarski knijoven ezik.

Morfologija. a s t parva. [Grammar of the Contemporary Bulgarian
language. Morphology. Part one]. Abagar Publishing.
BNC World Edition, December 2000, SARA Version 0.98. Published by the
Humanities Computing Unit of Oxford University on behalf of the BNC
Consortium.
Granger, S., E. Dagneaux and F. Meunier (eds) (2002), International Corpus of
Learner English. Version 1.1. Handbook & CD-ROM. Louvain-la-Neuve:
Presses Universitaires de Louvain.
Halliday, M.A.K. (1985), An introduction to functional grammar. London and
New York: Edward Arnold.
Halliday, M.A.K. and R. Hasan (1976), Cohesion in English. London and New
York: Longman.
Krastev, B. (1992), Gramatika za vsichki [Grammar for all]. Sofia: Nauka i
izkustvo.
Leech, G. and J. Svartvik (1994), A communicative grammar of English. London

and New York: Longman.
Lyons, J. (1977), Semantics, Vol. 2. Cambridge: Cambridge University Press.
Maslov, J.S (1982), Gramatika na ba lgarskija ezik [Grammar of the Bulgarian
language]. Sofia: Nauka i izkustvo.
Pashov, P. (1994), Prakti eska balgarska gramatika [Practical Bulgarian
grammar]. Sofia: Prosveta.
Quirk, R. and S. Greenbaum (1973) A university grammar of English. Longman.
Scott, M. (1997), Wordsmith tools. version 2. Oxford: Oxford University.
Translations as semantic mirrors: from parallel corpus to
wordnet1
Helge Dyvik
University of Bergen
Abstract
The paper reports from the project From Parallel Corpus to Wordnet at the
University of Bergen (20012004), which explores a method for deriving wordnet
relations such as synonymy and hyponymy from data extracted from parallel
corpora. Assumptions behind the method are that semantically closely related
words ought to have strongly overlapping sets of translations, and words with
wide meanings ought to have a larger number of translations than words with
narrow meanings. Furthermore, if a word a is a hyponym of a word b (such as
tasty of good, for example), then the possible translations of a ought to be a
subset of the possible translations of b.
Based on assumptions like these a set of definitions are formulated,
defining semantic concepts like, e.g., synonymy, hyponymy, ambiguity and
semantic field in translational terms. The definitions are implemented in a
computer program which takes words with their sets of translations from the
corpus as input and performs the following calculations: (1) On the basis of the
input different senses of each word are identified. (2) The senses are grouped in
semantic fields based on overlapping sets of translations, such overlap being
assumed to indicate semantic relatedness. (3) On the basis of the structure of a
semantic field a set of features is assigned to each individual sense in it, coding
its relations to other senses in the field. (4) Based on intersections and inclusions
among these feature sets a semilattice is calculated with the senses as nodes.
According to our hypothesis, hyponymy/hyperonymy, near-synonymy and other
semantic relations among the senses now appear through dominance and other
relations among the nodes in the semilattice. Thus, the semilattice is supposed to
contain some of the semantic information we want to represent in wordnets. (5)
In accordance with this assumption, thesaurus-like entries for words are
generated from the information in the semilattice.
In the project these assumptions are tested against data from the English-
Norwegian Parallel Corpus ENPC (Johansson 1997).
312 Helge Dyvik
1. Introduction
1.1 Translations as semantic data
Parallel corpora, in which original texts are aligned with their translations into
another language, are a rich source of semantic information. Translations come
about when translators evaluate the degree of interpretational equivalence
between linguistic expressions in specific contexts. In many ways such
evaluations, made without any theoretical concerns in mind, seem more reliable
as sources of semantic information than the careful paraphrases of the semanticist
or the meaning descriptions of the lexicographer. Assuming that this is the case,
can we then retrieve some of the semantic properties of expressions by going
backwards from the network of translational relations in situated texts? Can we
reconstruct semantic properties from the translational properties manifested in a
parallel corpus?
The idea that semantic information can be gleaned from multilingual data
has been explored by others. Resnik and Yarowsky (1997), discussing word sense
disambiguation, suggest that in distinguishing between senses it may be fruitful to
restrict attention to such distinctions as are lexicalised differently in other
languages. Nancy Ide has explored the connections between semantics and
translation in several papers; in Ide et al. (2002) the authors study versions of the
same novel in seven languages and attempt to identify subsenses of words by
considering how the translations of a given word cluster in the six other texts.
1.2 Wordnets and thesauri
The output of the method presented here is a structure containing some of the
information which we find in wordnets. A wordnet is a semantically structured
lexical database. The Princeton WordNet (Fellbaum 1998), which has been built
manually, distinguishes between the senses of words and groups senses across
words into synsets according to near-synonymy. Pointers between such synsets
express semantic relations like hypero- and hyponymy, antonymy, and holo- and
meronymy. Wordnets for various European languages were developed within the
project Eurowordnet (http://www.illc.uva.nl/EuroWordNet/).
Wordnets are important resources for many applications within language
technology. They can be used in meaning-based information retrieval (searching
for concepts rather that specific word forms), in logical inference (if a document
mentions dogs, a wordnet allows the inference that it is about animals), in word
sense disambiguation (providing the search space of alternative meanings), etc.
A related kind of semantic resource is the thesaurus. As an example we
may consider the entry for the adjective conspicuous in the Merriam-Webster
Collegiate Thesaurus (http://www.m-w.com/home.htm), where two senses are
distinguished, each with its own sets of synonyms, antonyms etc.:
Translations as semantic mirrors 313
Entry Word: conspicuous

Function: adjective
Text: 1
Synonyms CLEAR 5, apparent, distinct, evident, manifest,
obvious, open-and-shut, openhanded, patent, plain
2
Synonyms NOTICEABLE, arresting, arrestive, marked,
outstanding, pointed, prominent, remarkable, salient, striking
Related Word celebrated, eminent, illustrious; showy
Contrasted Words common, everyday, ordinary; covert, secret;
concealed, hidden
Antonyms inconspicuous
We may compare this with the thesaurus-like entry for conspicuous below, which
has been generated automatically from parallel corpus data by the method to be
described in this paper:
conspicuous
Sense 1
(Norwegian: avstikkende.)
Sense 2
Hyperonyms: great, hard, large.
Subsense (i) (Norwegian: synlig, tydelig.)
Near-synonyms:
clear, conclusive, definite, distinct, distinctive, obvious,
plain, substantial, unmistakable, vivid.
Hyponyms: apparent, evident, pervasive, visible.
Subsense (ii) (Norwegian: fremtredende, kraftig,
sterk, stor.)
Near-synonyms: outstanding, primary.
Subsense (iii) (Norwegian: oppsiktsvekkende.)
Near-synonyms: amazing, spectacular, startling,
surprising, unusual.
Antonyms and contrasted words are not included in the latter entry, since the
method only allows the derivation of relations of semantic similarity (synonymy,
hyperonymy and hyponymy) from the parallel corpus data. The entry displays a
major division into two senses (of which the first one in this case has no
information associated with it apart from a Norwegian translation), and
furthermore a division into subsenses within the more informative second sense.
Sense 1 in this example is probably a spurious consequence of sparsity of data
in the corpus. A better example of a major division into senses although even
there we would have liked sense 1 to have been merged with sense 4 is
provided by the following automatically derived entry for the Norwegian noun
rett, which is contrastively ambiguous between a number of senses, among which
we find course in a meal and court of law. Some of the related words listed in
this entry are surprising, while most of them are to the point:
rett N
Sense 1
(English: course.)
Sense 2
(English: court, justification.)
314 Helge Dyvik
Near-synonyms:
argument, begrunnelse, berettigelse, domstolsbehandling,
grd, grdsplass, plass, sak, ting.
Sense 3
Subsense (i) (English: option.)
Hyponyms: tilbud.
Subsense (ii) (English: rightN.)
Hyponyms: adgang, rettighet.
Subsense (iii) (English: order.)
Near-synonyms:
bestemmelse, klasse, krav, lov, lsning, mte, orden, regel,
regelverk, stand, system, vedtak.
Sense 4
(English: dish, food, supper.)
Near-synonyms:
aftens, aftensmat, fat, fde, gryte, kar, kopp, kosthold,
kveldsmat, lunsj, mat, matvare, middag, mltid, nring, skl,
tallerken.
1.3 Semantic lattices
The thesaurus entries above are generated from semantic lattices, which in their
turn are derived automatically from the translational data. Figure 1 below is an
example of such a lattice, representing the semantic field associated with sense 4
of rett in the entry above (labelled rettN2 in the lattice):
Figure 1. A semantic lattice
According to the hypothesis behind the method, senses on dominating nodes are
hyperonyms of senses on dominated nodes. Thus, a sense of mat food
dominates senses of rett dish, middag dinner, mltid meal, lunsj lunch,
kveldsmat supper, aftensmat supper, and aftens supper, all of which are
plausible hyponyms of mat. Less convincingly, lunsj also dominates aftensmat.
Formally the lattice expresses inclusion and overlap relations among sets
of translationally derived features, as described in section 2.3 below.
1.4 The parallel corpus
The English-Norwegian Parallel Corpus (ENPC), from which the above results
are derived, comprises approximately 2.6 million words, originals and
translations included. The corpus contains fiction as well as non-fiction and
English originals translated into Norwegian as well as the other way around. The
corpus is aligned at sentence level (Johansson et al. 1996), while it is a part of our
present project to align the ENPC at word level, in order to be able to extract the
sets of translations of a given word automatically. Our present data has been
derived from the sentence-aligned corpus, however, which means that the
translational data for each word in our data set has been extracted manually.
For example, searching for the Norwegian word form bemerkelsesverdig
returns the sentences containing bemerkelsesverdig coupled with the
corresponding English sentences in the parallel text (translation or original).
Based on a set of heuristic criteria to decide whether a word can be said to
correspond to a given word in the translation or not, the set of translations of
bemerkelsesverdig is extracted by the human analyser:
(bemerkelsesverdig (amazing notable remarkable spectacular surprising))
Sets of such lemmas with their associated sets of translations from the corpus
constitute the input to the procedure deriving semantic lattices and thesaurus
entries, by principles which we now proceed to describe.
2. Semantic mirrors
2.1 Separation of senses
We assume that contrastive ambiguity, such as the ambiguity between the two
unrelated senses of the English noun bank money institution and riverside
tends to be a historically accidental and idiosyncratic property of individual
words. That is, we don't expect to find instances of the same contrastive
ambiguity replicated by other words in the language or by words in other
languages. Furthermore, we don't expect words with unrelated meanings to share
translations into another language, except in cases where the shared word is
contrastively ambiguous between the two meanings. By the first assumption there
should then be at most one such shared word.
Given these assumptions contrastive ambiguity should be discoverable in
the patterns of translational relations. We may consider the Norwegian noun tak,
contrastively ambiguous between the meanings roof and grip. Figure 2 shows
the first t-image of tak in the right-hand box, and the first t-images of each of
those English words again in the left-hand box. We refer to the last-mentioned set
of sets as the inverse t-image of tak.
316 Helge Dyvik
Figure 2. The first and inverse t-images of tak.
The point worth noticing is that the images of roof and ceiling overlap in hvelving
in addition to tak, while the images of grip and hold overlap in grep in addition to
tak. This indicates that roof and ceiling are semantically related, and similarly
grip and hold, while no overlap (apart from tak) unites grip/hold and roof/ceiling.
Grip/hold and roof/ceiling hence seem to represent unrelated meanings, and the
conclusion is that tak is ambiguous.
Figure 3. The second t-image of tak

The overlap patterns are necessarily preserved within the first t-image of tak
when we make our third movement and find all the first t-images in English of
the words in the inverse t-image, as shown in Figure 3. We refer to this set of sets
as the second t-image of tak.
As shown in Figure 3, the second t-image can be divided into three
clusters or groups of sets, each group being held together by overlap relations (we
only consider overlaps in the restriction of the second t-image to the members of
the first t-image). On the basis of these groups the first t-image of tak can be
partitioned into the three sense partitions shown in Figure 4.
Figure 4. The sense partitions of tak's first t-image
By this method the main senses of lemmas are individuated.

The limited size of the corpus is a source of error: a translation t of a
occurring only once in the corpus, or only occurring translationally related to a,
will give rise to a separate sense partition only containing t, and hence give rise to
a potentially spurious sense of a (cf. the doubtful sense 1 of the examples
conspicuous and rett in Section 1.2). A larger corpus might display more
alternative translations of t, and thereby include t in one of the other sense
partitions. A frequency filter excluding hapax legomena from consideration might
reduce this problem.
2.2 Semantic fields
Once senses are individuated in the manner described, they can be grouped into
semantic fields. Traditionally, a semantic field is a set of senses that are directly
or indirectly related to each other by a relation of semantic closeness.
In our translational approach, the semantic fields are isolated on the basis
of overlaps among the first t-images of the senses. Since we treat translational
correspondence as a symmetric relation (disregarding the direction of translation),
we get paired semantic fields in the two languages involved, each field assigning
a subset structure to the other. Figure 5 gives a rough illustration of the principle
(arrows indicate the t-image of each sense for simplicity, the indicated sets are
just suggested and in no way reflect the corpus data accurately).
318 Helge Dyvik
Figure 5. Paired semantic fields (simplified illustration)
The subset structure of a semantic field, assigned by its partner field in the other
language, contains rich information about the semantic relations among its
members. For example, senses with a wide meaning (such as good) will in
general have a larger number of alternative translations than words with a
narrower meaning (such as tasty). The number of translations is of course directly
reflected in the number of subsets of which the sense is a member. Thus the
senses at the peaks in the semantic fields will have the widest meanings.
We may illustrate this by means of a constructed and artificially simple
example. Assume that we find the translational pattern illustrated in Figure 6,
where hingst stallion is found translated into animal, horse and stallion, while
dyr animal is translated into animal, horse, stallion, mare and dog, etc.
Figure 6. A constructed example
Since animal1 is translationally related to every member of the Norwegian field,

animal1 becomes the peak of the English field, being a member of all the
subsets, with horse1 ranked immediately below it, etc. By symmetry, the
Norwegian field gets a corresponding subset structure (cf. Figure 7).
2.3 Feature assignment
The next step is to encode, for each sense, its position within the semantic field,
along with its translational relations to the members of the other field. This is
done by means of feature sets, automatically derived from the set structure. In
accordance with traditional semantic componential analysis, the intention is that
wide senses should have few features, while more specific senses should have
more features, some of which are inherited from wider, superordinate senses. This
is achieved by starting from the tops in two paired fields i.e. the sense pair
which is both translationally interrelated and whose members belong to the
largest number of subsets which in Figure 7 gives us the pair dyr1 and animal1.
A feature [dyr1|animal1] is constructed from this pair and assigned to both its
members dyr1 and animal1. Then the feature is inherited (non-transitively) by
lower senses according to the following principle: all senses in the first t-image
of animal1 and ranked lower than dyr1 (i.e. belonging to fewer subsets than dyr1)
inherit the feature, and conversely, all senses in the first t-image of dyr1 and
ranked lower than animal1 inherit the feature. Then the procedure moves to the
next highest, translationally interrelated, peaks hest1 and horse1, constructs a
feature from that pair, and assigns it according to the same principle. The result is
shown in Figure 7.
320 Helge Dyvik
Figure 7. Feature assignment in semantic fields
The feature sets in Figure 7 define a lattice based on inclusion relations among
them, as shown in Figure 8.
Figure 8. Lattices defined by the feature sets
In Figure 8 the daughters of a node N have supersets of the feature set associated
with N. In this constructed example the lattices evidently also reflect hyperonym /
hyponym relations among the senses.
The lattices in Figure 8 are simple trees, while actual derived lattices tend
to be more complex. In the first place, senses may inherit features from more than
one peak in the semantic field, which gives rise to multiple mothers in the
lattice. In the second place, nodes may have intersecting feature sets without
either of the sets including the other, so that there is no mother/daughter
relationship between the nodes in question. When no actual sense is associated
with the intersection, x-nodes (cf. Figure 1) are introduced, carrying the
intersection of the feature sets of their daughters. Thus the x-nodes can intuitively
be seen as virtual hyperonyms of their daughters. It is the presence of x-nodes
which guarantees that the structure is a semilattice (i.e. all nodes with intersecting
feature sets are guaranteed to be dominated by a node carrying the intersection).
In the semilattice, two senses are assumed to be more closely related the more of
their features they share, i.e. the shorter the distance is to their common
dominating node.
Returning now to the actual corpus-based lattice in Figure 1, it is defined
by the feature sets on the nodes according to the principles just described. For
instance, mat2 is associated with the singleton feature set {[mat2|supper3]},
kveldsmat1 with {[mat2|supper3], [kveldsmat1|meal1]}, and aftensmat2 with
{[mat2|supper3], [kveldsmat1|meal1], [lunsj2|meal1], [aftensmat2]}. In Figure 1,
x-nodes with only one feature (such as x1) are displayed with the feature beside
them.
Derivation of thesaurus entries
Derivation of thesaurus entries involves determining subsenses, hyperonyms,

near-synonyms and hyponyms of each sense on the basis of the information in the
semilattices. The semilattices are in some cases extremely complex, showing
intricate networks of connections between the word senses. Much of this
complexity should probably be considered as noise resulting from accidental
biases and gaps in the corpus. In the transition to a wordnet database or a
thesaurus we therefore want to abstract away from much detail in the lattices, and
this can obviously be done in more than one way. We presently use two
parameters to regulate the generation of thesaurus entries: OverlapThreshold and
SynsetLimit.
The value of the parameter OverlapThreshold decides the granularity of
the division into subsenses in the thesaurus entry. This does not concern the
division into main senses described above (tak1, tak2, tak3 etc.) those senses
usually end up in different semantic fields and hence in different lattices. Division
into subsenses is a further subdivision of each sense into related shades of
meaning. We assume that there is no final and universal answer to the question of
how many related subsenses a word sense has (cf. Kilgarriff 1997). By means of
the parameter OverlapThreshold we may attune that kind of semantic granularity
to our purposes.
322 Helge Dyvik
We may illustrate the procedure by means of an example: the adjective

sweet. Figure 9 shows a small sublattice of the large lattice including the sense
sweet1.
Figure 9: A sublattice containing sweet1
Sweet1 is also dominated by several nodes outside this sublattice; size limitations
prevent displaying a more complete graph. The node sweet1 is associated with the
following feature set: {[god3|good1], [fin2|nice2], [pen1|gentle3],
[vakker1|soft2], [snill1|pleasant1], [deilig1|splendid3], [frisk4|sweet1],
[blid3|sweet1]}. Finding hyperonyms, near-synonyms and hyponyms of sweet1
now first involves considering which other senses in the lattice share features
with sweet1. The features in question are assigned to the following senses in the
complete semilattice (we will refer to the sets of senses as the denotations of the
features):
[god3|good1]:
(able1 accurate1 adept1 adequate2 affectionate1 all_right2 amiable2 appropriate5
attractive4 beautiful2 beneficial1 benign3 bright2 burning3 charming2 clean1 clear1 close3
comfortable2 comforting3 competent2 confident2 correct1 cozy2 cute1 decent2 delicious1
delightful2 detailed3 dishy1 easy1 efficient2 elegant3 excellent2 fair2 fancy1 favourable1
fine1 firmA1 first-class3 first-rate2 fit3 fortunate1 fresh3 friendly2 full2 genuine2 good1
handsome2 happy3 healthy2 high3 hot2 joyful2 kind1 kindly1 long3 lovely2 lucky2
magnificent3 marvellous1 neat2 nice2 okay1 peaceful1 perfect3 placid2 pleasant1 pleased2
pleasing1 pleasurable1 plentiful1 plenty1 polite2 positive1 pretty2 proficient1 quite_certain1
real2 reassuring2 respectable3 right2 ripe1 safe2 satisfactory1 satisfying1 secure2 sizeable1
smart2 smooth3 soft2 solid2 sound2 spectacular2 steady1 strong3 successful2 suited1
superb2 superior5 sure1 sweet1 talented2 thorough1 tidy1 well2 whole2 wholesome1
wonderful3 worthy2)
[fin2|nice2]:
(attractive4 beautiful2 breathtaking2 charming2 comfortable2 cute1 delicate3 dishy1 easy1
elegant3 enchanting1 excellent2 fancy1 fine1 first-class3 gentle3 glorious4 graceful2
handsome2 impressive2 lovely2 magnificent3 marvellous1 neat2 nice2 okay1 perfect3
pleasurable1 polite2 pretty2 pure2 slight3 smart2 soft2 splendid3 sweet1 thin2 wonderful3)
[pen1|gentle3]:
(attractive4 beautiful2 charming2 clean1 cute1 dishy1 elegant3 enchanting1 fancy1 fine1
first-class3 formal1 gentle3 graceful2 handsome2 lovely2 neat2 pleasant1 polite2 pretty2
soft2 sweet1 tidy1)
[vakker1|soft2]:
(attractive4 charming2 cute1 delightful2 dishy1 enchanting1 fair2 fancy1 graceful2
handsome2 lovely2 magnificent3 mild2 ornate2 pleasant1 pleasurable1 pretty2 soft2 sweet1)
[snill1|pleasant1]:
(all_right2 amiable2 benign3 friendly2 good-humoured1 good-natured3 jolly1 kind1 kindly1
mild3 pleasant1 pleasing1 polite2 smiling2 sweet1)
[deilig1|splendid3]:
(beautiful2 charming2 cute1 enchanting1 delicious1 delightful2 pleasureable1 splendid3
sweet1)
[frisk4|sweet1]:
(all_right2 brisk5 eager2 fit3 fresh3 healthy2 new1 pert2 sweet1 well2)
[blid3|sweet1]:
(amiable2 blithe3 cheerful4 cheery1 good-humoured1 good-natured3 jolly1 kind1 kindly1
merry1 mild3 smiling2 sweet1)
The most general features, [god3|good1], [fin2|nice2] and [pen1|gentle3], denote

a large number of senses each especially [god3|good1]. This reflects the fact
that they are constructed from wide senses such as god3 and good1. As a result,
many of the senses carrying those features are not sufficiently close to sweet1 to
be called near-synonyms. Therefore we do not want to consider all the senses
sharing such general features as near-synonyms of each other. The value of the
parameter SynsetLimit defines the maximal size which the set denoted by a
feature can have in order to be included among the near-synonyms. With
SynsetLimit = 20, the sets of senses denoted by [god3|good1], [fin2|nice2] and
[pen1|gentle3] are not included among the near-synonyms of sweet1 (unless they
are denoted by other features as well). On the other hand, good1, nice2 and
gentle3 the English senses from which the wide features were constructed are
recorded as hyperonyms of sweet1.
Intuitively, the features represent different aspects of the sense sweet1,
and the question now is whether those aspects are sufficiently different from
each other to be considered different subsenses. Their distinctness can be
measured in terms of the degree of overlap among the sets of senses they denote.
If the set of features denote strongly overlapping sets of senses, the favoured
conclusion is that there is no division into subsenses. On the other hand, the less
the denotations of the features overlap, the more a division into subsenses is
324 Helge Dyvik
motivated. The degree of overlap in a set of sets can be measured as a value

between 0 and 1, with 0 indicating no overlap and 1 full overlap (full overlap
meaning that for each set s, every set either includes s or is included in s). In
calculating the degree of overlap among feature denotations we disregard the
sense sweet1 itself, since it is necessarily a member of all the feature denotations.
The value of the parameter OverlapThreshold is a number between 0 and
1. A feature belongs to subsense n if the overlap between its denotation and the
denotation of at least one other feature in subsense n is equal to or greater than
OverlapThreshold. Hence, the higher the OverlapThreshold, the more subsenses
tend to be distinguished.
The two last features in the set above are constructed from sweet1 itself,
and we assume that senses sharing this feature are hyponyms of sweet1: they have
inherited the feature from sweet1 and must have been ranked lower in the
semantic field.
Setting the parameter values at SynsetLimit = 20 and OverlapThreshold =
0.05, we consequently generate the following entry for sweet:
OverlapThreshold = 0.05:
sweet
Hyperonyms: gentle, good, nice.
Subsense (i) (Norwegian: frisk.)
Hyponyms: all_right, brisk, crisp, eager, fit, fresh,
healthy, new, pert, well.
Subsense (ii) (Norwegian: blid, deilig, fin, god, pen,
snill, st, vakker.)
Near-synonyms:
amiable, amused, attractive, beautiful, benign, blithe, charming,
cheerful, cheery, cute, delicious, delightful, dishy, easygoing,
enchanting, fair, fancy, friendly, good-humoured, good-natured,
graceful, handsome, jolly, kind, kindly, lovely, magnificent, merry,
mild, ornate, picturesque, pleasant, pleasing, pleasurable, polite,
pretty, smiling, soft.
Hyponyms: all_right.
Subsense (ii) includes near-synonyms referring to personal character (e.g.

amiable) as well as synonyms referring to appearance (e.g. beautiful). Raising the
OverlapThreshold to 0.1 leads to the separation of those two kinds of near-
synonyms:
OverlapThreshold = 0.1:
sweet
Hyperonyms: gentle, good, nice.
Subsense (i) (Norwegian: frisk.)
Hyponyms: all_right, brisk, crisp, eager, fit, fresh,
healthy, new, pert, well.
Subsense (ii) (Norwegian: deilig, fin, god, pen, st,
vakker.)
Near-synonyms:
attractive, beautiful, charming, cute, delicious, delightful, dishy,
enchanting, fair, fancy, graceful, handsome, lovely, magnificent,
ornate, picturesque, pleasant, pleasurable, pretty, soft.

Subsense (iii) (Norwegian: blid, snill.)
Near-synonyms:
amiable, amused, benign, blithe, cheerful, cheery, easygoing,
friendly, good-humoured, good-natured, jolly, kind, kindly, merry,
mild, pleasant, pleasing, polite, smiling.
Hyponyms: all_right.
3. Conclusion
We have given an illustration of the method employed in the project From

Parallel Corpus to Wordnet. The method is implemented in a computer program
taking words with their sets of translations from the parallel corpus as input and
returning semantic lattices and thesaurus entries as output. The presentation has
been based on examples of the results obtained on the basis of manually extracted
data from the parallel corpus ENPC.
The examples have only served as illustrations and have not been
subjected to a critical analysis in this paper. An important task within the project
is the evaluation of the results, part of which involves comparisons with existing
sources like the Princeton Wordnet and Merriam-Webster's Thesaurus. Another
task is the alignment of the corpus ENPC at word level, which will make it
possible to extract lemmas with their sets of translations automatically.
Based on our results so far we feel able to conclude that the method merits
further exploration.
Notes
1. The analyses in this paper are based on corpus data resulting from work by
Martha Thunes, Gunn Inger Lyse and the author. The software producing the
semantic analyses has been developed by the author and reimplemented and
improved by Paul Meurer. I am grateful to Martha Thunes for useful
comments on an earlier version of this article.
References
Aijmer, K., B. Altenberg, and M. Johansson (eds.). 1996. Languages in contrast.

Papers from a symposium on text-based cross-linguistic studies in Lund,
4-5 March 1994, 73-85. Lund: Lund University Press.
Diab, M. and P. Resnik (2002): An Unsupervised Method for Word Sense
Tagging using Parallel Corpora. 40th Anniversary Meeting of the
Association for Computational Linguistics (ACL-02), Philadelphia, July,
2002.
Dyvik, H. (1998a): A translational basis for semantics. In: Stig Johansson and
Signe Oksefjell (eds.) 1998. 51-86.
326 Helge Dyvik
Dyvik, H. (1998b): Translations as semantic mirrors. In Proceedings of Workshop

W13: Multilinguality in the lexicon II. 24.44, Brighton, UK. The 13th
biennial European Conference on Artyificial Intelligence ECAI 98.
Fellbaum, C. (ed.) (1998), WordNet. An electronic lexical database. Cambridge:
The MIT Press.
Grefenstette, G. (1994): Explorations in Automatic Thesaurus Discovery,
Boston/Dordrecht/London: Kluwer.
Hearst, M. A. (1998): Automated Discovery of WordNet Relations. In Fellbaum
(1998). 131 - 151.
Ide, N. (1999): Word sense disambiguation using cross-lingual information. In:
Proceedings of ACH-ALLC '99 International Humanities Computing
Conference, Charlottesville, Virginia. http://jefferson.village.virginia.edu
/ach-allc.99/proceedings
Ide, N. (1999): Parallel translations as sense discriminators. In: SIGLEX99:
Standardizing Lexical Resources, ACL99 Workshop, College Park,
Maryland. 52-61.
Ide, N., T. Erjavec and D. Tufis (2002), Sense discrimination with parallel
corpora. Proceedings of ACL'02 Workshop on Word Sense
Disambiguation: Recent Successes and Future Directions, Philadelphia,
54-60.
Johansson, S. (1997), Using the English-Norwegian Parallel Corpus a corpus
for contrastive analysis and translation studies, in: B. Lewandowska-
Tomaszczyk and P.J. Melia (eds), Practical applications in language
corpora. Lodz: Lodz University. 282-296.
Johansson, S., J. Ebeling, and K. Hofland (1996), Coding and aligning the
English-Norwegian Parallel Corpus, in: K. Aijmer, B. Altenberg and M.
Johansson (eds), Languages in contrast. Papers from a symposium on text-
based cross-linguistic studies in Lund, 4-5 March 1994. Lund: Lund
University Press. 87-112.
Johansson, S. and S. Oksefjell (eds.) (1998): Corpora and Crosslinguistic
Research: Theory, Method, and Case Studies. Amsterdam: Rodopi.
Kilgarriff, A. (1997), I don't believe in word senses, Computers and the
Humanities 31 (2): 91-113.
Resnik, P.S. and D. Yarowsky (1997), A perspective on word sense
disambiguation methods and their evaluation. Position paper presented at
the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics:
Why, What, and How?, held April 4-5, 1997 in Washington, D.C., USA in
conjunction with ANLP-97.
Turcato, D. (1998): Automatically Creating Bilingual Lexicons for Machine
Translation from Bilingual Text. In: Proceedings of the 17th International
Conference on Computational Linguistics (COLING-98) and of the 36th
Annual Meeting of the Association for Computational Linguistics (ACL-
98), Montreal.
Physical contact verbs in English and Swedish from the
perspective of crosslinguistic lexicology
ke Viberg
Uppsala University
Abstract
The major English physical contact verbs strike, hit and beat are compared with
their primary Swedish translation equivalent sl on the basis of data from the
English-Swedish Parallel Corpus. The analysis is carried out within two
theoretical frameworks concerning the underlying conceptual representation and
the linguistic cues that can be used for word sense identification. In addition to a
rather detailed account of points of contrast in the fairly extensive patterns of
polysemy that are characteristic of the verbs, an attempt is made to provide a
general characterisation in contrastive terms. In comparison with the English
verbs, the conceptual representation of sl is grounded more firmly in
sensorimotor experience and the fact that hitting prototypically is a hand action.
As in other languages such as Chinese, the main verb of hitting in Swedish has
extended senses that refer to other types of hand actions. With respect to word
sense identification, the semantic classification of the subject and object is a
prominent cue for the distinction between the major meanings of the main
physical contact verbs but to various degrees in English and Swedish. Several
examples are also given of cases where linguistic cues are not sufficient and
disambiguation must be based on topical or pragmatic information.
1. Introduction
This paper will present a contrastive lexical analysis of the major English
physical contact verbs strike, hit and beat in comparison to the Swedish verb sl
which is the closest equivalent to all three English verbs. The semantic analysis is
based on an earlier paper on the verbs of physical contact in Swedish (Viberg
1999). The verb sl has a complex pattern of polysemy and many extended
meanings which require a wide range of translations in English. The rich
polysemy tends to be characteristic of verbs with the same prototypical meaning
across a wide range of languages (for Chinese, see Gao 2001).
The comparison of Swedish and English that will be presented in this
paper is based on the English-Swedish Parallel Corpus, ESPC (Aijmer et al. 1996,
Altenberg and Aijmer 2000), which contains original text samples in English and
Swedish together with their translations. The text samples represent both fiction
and non-fiction and the total number of words from each source language is about
328 ke Viberg
half a million. The corpus will be used for contrastive purposes, whereas matters
such as translation problems or the general characteristics of translated texts will
not be dealt with (see Johansson 1998 on the various uses of parallel corpora).
The aim of the present paper is primarily to present a systematic
contrastive account of the data but the general theoretical significance will be
briefly indicated within two frameworks. The first concerns the conceptual
representation of lexical items accounting for the patterns of polysemy and their
cognitive motivations. This will be oriented towards cognitive semantics and in
particular prototype theory (Taylor 1989). Another important cognitive semantic
idea is the notion of embodiment which implies that our concepts to a large extent
are shaped by our bodies and brains (Lakoff and Johnson 1999). In particular,
bodily movement will be shown to play an important role for the conceptual
representation of the main verbs of physical contact.
The second framework concerns the contextual representation of lexical
items and the process of word sense identification accounting for the interaction
between word meaning and cues in the linguistic context in the disambiguation
process and in the choice of translation equivalents. According to Miller and
Leacock (2000), each meaning of a word must be associated with a contextual
representation, which can be either local or topical. Experimental work has shown
that people can identify various meanings of a polysemous word with a relatively
high degree of success if they are presented with a window of 2 words of
context, but local context is not always enough. Local cues turned out to be very
precise when they occurred but all too often they simply did not occur (op. cit.
p. 156). Miller and Leacock also give an account of the use of topical context
which refers to the general topic of a text or conversation. Topical context has
been tested with various statistical classifiers run on computers. In one such
experiment, only the words occurring in the same sentence as the target word
were presented (in random order). With three or more senses to distinguish of
words such as line and serve the statistical classifiers reached close to 75%
correctness. Human subjects who were presented with lists of words co-occurring
with line in reverse alphabetical order only managed to identify the correct sense
a little better than the statistical classifiers, which justified the conclusion that the
result obtained with the classifiers was close to the ceiling for what can be
achieved with topical context alone.
Table 1 shows the most frequent Swedish equivalents of strike, hit, beat
and knock. Due to the relatively limited number of occurrences, originals and
translations in each language have been pooled together, which is not ideal, but a
separate account would be difficult to grasp. (Originals and translations are
separately coded in the underlying analysis of the data.) The row named Total
English verbs shows the total number of occurrences of the four verbs in the
ESPC. The following rows show the most frequent Swedish equivalents. It turns
out that the most frequent translation equivalent of all these verbs except knock is
the verb sl which is clearly the dominant physical contact verb in Swedish. The
two verbs strike and hit share the verbs drabba affect negatively and trffa in
the sense hit a target as the second and third most frequent equivalents, whereas
Physical contact verbs in English and Swedish 329
beat and knock only share the verb sl. As for knock, the verb knacka serves as
the major equivalent when the verb refers to knocking on a door, otherwise sl is
the major equivalent. The rightmost column shows the total number of
occurrences of the Swedish verbs in the corpus.
Table 1. Major Swedish equivalents of strike, hit, beat, and knock
strike hit beat knock Total

Total English verbs 134 115 67 64 Swedish
Swedish equivalents: verbs
sl strike, hit, beat 63 39 29 14 754
drabba affect negatively 11 19 182
trffa hit a target 9 11 325
knacka knock (on a door) 35 60
Table 1 rather clearly reflects the fact that the semantic field of physical contact
verbs has one central member in Swedish, the verb sl, which is the major
equivalent of the three verbs strike, hit and beat in English. In percentage terms,
sl accounts for between 47% (strike) and 33% (hit) of the equivalents of these
three verbs. On the other hand, these verbs account only for a small proportion of
the English equivalents of sl. Together they account only for 18% of the
equivalents of sl. In spite of this, at least strike and hit are usually experienced
as the closest equivalents of sl by Swedes who know English; this is probably
due to the fact that these two verbs account for close to half (47%) of the
equivalents of sl in its prototypical meaning as a physical contact verb. In
addition, as many as 29 other English verbs which can be regarded as physical
contact verbs are used as equivalents of sl (e.g. bang, pound, punch, slam, slap).
As will be shown below, there are also many English equivalents which belong to
other semantic fields than physical contact due to the extensive patterns of
polysemy which characterize sl. The next section provides an analysis of the
most frequent meanings of the major English physical contact verbs. This is
followed by an account of the extensive pattern of polysemy of Swedish sl and
how it is reflected in the English equivalents.
2. English physical contact verbs
In Table 2, an attempt is made to show the relationships between the major senses
of strike, hit and beat as they are reflected in the ESPC. Unfortunately, the
number of occurrences is rather limited but it is still possible to sketch the basic
semantic relationships. The frequencies (F) given for each verb in the last three
columns refer to the total number of occurrences with a certain meaning and
typical subject and include some cases where the major Swedish equivalent is not
used.
330 ke Viberg
Table 2. Main senses of strike and hit and beat with their major Swedish
equivalents
Semantic Typical subject Major Swedish F F F

fields equivalent strike hit beat
PHYSICAL CONTACT
Bodily action Human sl 35 40 26
Physical event Mechanical devices:
car, vehicle kra p drive on 2
clock sl 4
Natural forces:
lightning sl 11
wind, rain, waves sl 5
Projectiles:
bullet, anything trffa hit a target 4 27 0
moving with force
Sense impressions:
light trffa 4
ABSTRACT MEANINGS
Defeat Human sl 13
(besegra defeat,
vertrffa
surpass)
Negative Natural disaster, drabba afflict 14 27 0
experience disease, economic
crisis
Mental event Thought,proposition: sl 34
it struck me that-S
Various other cases 2
Total (above) 106 96 46

Total 134 115 67
(corpus)
The verbs strike, hit and beat can all be used about a human being moving the
arm and bringing the hand (or something held in the hand) into contact with
something in order to have an impact on it. This use as a bodily action verb can
be taken as prototypical. When the object is also a human being which is
frequently the case the intention is usually antagonistic: to hurt (or even to kill) or
defeat the other human, not just to touch in a friendly way (cf. pat, stroke,
caress). It is hard to find any clear semantic contrast between strike and hit in this
use, whereas beat is frequentative and generally indicates a more intensive effect.
The dominant Swedish equivalent of this use is sl. Equivalents clearly
expressing the intention are also used, in particular as equivalents of beat (e.g.
misshandla batter, kl upp beat up, thrash, ge stryk give a beating, lick).
The verbs can also be used with various classes of inanimate subjects to
describe various types of physical events (i.e. events which can be experienced
with our senses). In this case, there are several clear contrasts between hit, strike
and beat. Since the database is so limited, it is useful to compare the patterns in
the ESPC with the large BNC corpus. Table 3 shows which nouns are salient as
subjects according to Kilgarriffs WASPBENCH , a tool which shows which
collocates appear with more than chance frequency together with a certain target
word according to a statistical formula producing a salience index (Kilgarriff and
Tugwell 2002; see also the demo at http://www.itri.bton.ac.uk/peopleindex.html).
The columns marked F show the frequency of the noun as subject of the verb and
the columns marked Sal. show the salience index. The subjects are ordered in
descending frequency according to this index.
The type of subject is also important for the choice of Swedish translation.
In particular, projectiles such as bullets influence the choice of Swedish
translations in the direction of trffa hit a target. When used as a physical
contact verb, trffa focuses the moment when contact occurs, whereas sl (see
below) prototypically describes a complete bodily action (stretching of arm
followed by contact between hand and target):1
(1) A building contractor called Peter En byggnadsentreprenr vid namn

Kemp had been standing next to Peter Kemp hade sttt bredvid
him and he said Martin dropped the honom och han hade sagt att
gun at the moment the bullet struck Martin tappade vapnet i samma
him. (RR) gonblick som kulan trffade
honom.
As can be observed in Table 3, bullet appears as one of the most salient subjects
both of strike and hit and it is reasonable to regard it as a prototypical projectile.
(Among the salient subjects of hit, there are further examples: ball, shot, bomb,
missile, shell, pellet. Hit is the dominant alternative when the subject is a
projectile even in the ESPC according to Table 2.) However, not only nouns that
are lexically marked as projectiles favour the choice of trffa in Swedish. Any
concrete object that forcefully moves through the air can be interpreted as a
projectile:
(2) [] when another crust came [] nr nnu en brdkant kom

flying out the shed door and hit the flygande ur skjulet och trffade
side of the seagull's head. (RDO) huvudet p msen frn sidan.
332 ke Viberg
(3) Hade hon kommit bara lite tidigare If she had come out just a little
kunde hon ha trffats i huvudet av earlier, the icicle might have hit
istappen (MG) her.
Textually salient subjects such as bullet can serve as prototypical subjects of

trffa in the sense that is relevant here but the limits of the range of subjects that
serve as cues to the choice of Swedish equivalent are set by semantic and
pragmatic factors.
The verb trffa is also the preferred Swedish equivalent when the subject
refers to a human who sets a projectile such as a bullet in motion. In this case, the
projectile may be implied and left unexpressed:
(4) Mannen brjade springa och The man started running, and
Kollberg skt igen och den hr Kollberg shot again and this time
gngen trffade han honom i hit him in the knee.
knvecket. (SW)
(5) We try to aim as close as possible Vi frsker sikta s nra som
without actually hitting them. mjligt utan att verkligen trffa
(MA) dem.
The verbs meaning shoot and aim, respectively, which form part of the topical
context, serve as the major cues to the choice of Swedish equivalent of hit.
The typical and most frequent object of strike, hit and beat in the ESPC is
a human being when the verbs appear in their prototypical use as bodily action
verbs. This is, however, only a tendency, whereas it is more or less a requirement
of Swedish sl (see below). There are a number of more abstract uses where these
verbs have an object which refers to a human experiencer. In prototypical uses
such as Harry struck/hit/beat Peter, there is usually an implication that the agent
wants to dominate or defeat the object. This implication tends to be strongest with
beat and this may be the reason why beat is used when only the abstract sense
defeat is present. The most frequent Swedish equivalent is sl but even more
abstract verbs such as besegra defeat can be used:
(6) He was quick and good at tic-tac- Han var snabb och duktig i
toe and checkers, and cunning and luffarschack och damspel, och
aggressive; he easily beat me. (OS) listig och offensiv; han slog mig
utan besvr.
(7) I was better at maths and science Jag var bttre i matte och
and practical things; you only had naturvetenskap och praktiska
to show him a lathe in the metal vningsmnen; man behvde bara
workshop for him to pretend he visa honom en revolversvarv p
had a fainting fit; but when he metallsljden fr att han skulle
wanted to beat me, he beat me. ltsas svimma; men nr han ville
(JB) besegra mig s gjorde han det.
Table 3. Salient subject collocates of strike, hit and beat according to Kilgarriffs
WASPBENCH
strike F Sal. hit F Sal. beat F Sal.
Total 7149 9777 7552
BNC
subject 4417 0.6 subject 6106 0.7 subject 3987 0.5
lightning 65 24.6 smash 33 24.0 heart 198 27.5
disaster 52 22.7 recession 99 23.7 drum 15 14.7
clock 80 22.0 bullet 45 19.3 pulse 19 13.0
thought 95 19.7 car 90 14.0 side 50 12.1
bullet 21 14.5 ball 42 13.7 stick 11 11.3
tragedy 17 14.2 shot 23 12.3 England 27 11.1
contrast 14 12.5 bomb 24 12.0 sun 31 11.0
blow 13 12.2 missile 14 11.9 team 52 10.8
similarity 11 12.1 squall 7 11.4 wing 15 10.1
bargain 10 11.8 downturn 7 11.3 rain 20 9.4
thing 74 11.3 blast 11 10.9 keeper 7 8.3
lightening 4 10.9 drought 8 10.6 gang 10 8.1
band 22 10.4 shell 13 10.6 whites 7 7.9
cyclone 6 10.4 wave 27 10.3 United 9 7.4
it 511 9.9 cyclone 5 9.9 Surrey 4 7.1
fact 28 9.4 chart 11 9.6 goal 13 7.1
burglar 13 9.2 loss 21 9.3 man 67 7.1
deal 15 9.2 hurricane 7 9.2 they 368 6.9
jinx 4 9.1 blow 9 9.0 Liverpool 6 6.8
raider 7 9.0 crisis 14 8.8 Rangers 5 6.8
thief 11 9.0 pellet 6 8.7
earthquake 6 8.7 slowdown 4 8.6
right 26 8.5 kick 8 8.3
sun 19 8.3 depression 8 8.3
plague 6 8.2 header 7 8.2
These two examples also illustrate how the meaning and the choice of translation
in certain cases can be identified only pragmatically by the wider discourse
context. When both the subject and object are human, the meaning beat
physically is possible but ruled out by the fact that a game such as tic-tac-toe has
been mentioned earlier as in the first example. On many occasions, the cues are
even more indirect, for example when they reflect the general topic of
conversation such as sports. The meaning defeat, however, is also represented in
the list of salient subjects of beat in Table 3. Many of the subjects are (parts of)
names of teams (England, United, Surrey, Liverpool, Rangers). In addition, there
is the noun team itself and a relatively large proportion of the examples of they
334 ke Viberg
also refer to teams. Most of the examples of the salient subject side also belong
here (e.g. Skem boss Dave Maloney, who watched his side beat Glossop 2-1 on
Saturday).
A prominent class of subjects that appear with hit and strike but not with
beat are nouns referring to events with negative effects for humans such as
natural disasters, economic crises, wars and diseases. Several of the salient
subjects in Table 3 are of this semantic type (strike: disaster, tragedy, cyclone,
earthquake, plague; hit: recession, downturn, drought, cyclone, loss, hurricane,
crisis, slowdown, depression). The object typically refers to human groups and
institutions of various types. The dominant Swedish equivalent in this case is
drabba which basically means affect negatively:
(8) When a severe drought struck the Mot slutet av Ahabs styre, nr en
land towards the end of his reign svr torka drabbade landet []
[] (KAR)
Since the negative consequences of the event for humans is in focus, the verb
very often appears in the passive, which places the human experiencer in subject
position:
(9) Drtill drabbades landet av In addition, the country was hit by

lgkonjunktur med tfljande a depression, resulting in political
penningknapphet och politisk oro. unrest.
(KF)
There are a number of alternative Swedish equivalents such as hemska afflict

and the evaluatively neutral intrffa occur but these are not very frequent:
(10) In 1665 yet another plague hit the 1665 hemsktes London av nnu
capital (SUG) en pest
(11) I slutet av 1870-talet intrffade en Sweden was hit by a very deep
mycket svr lgkonjunktur med en recession at the end of the 1870s,
lng rad svenska konkurser som resulting in a large number of
fljd. (TR) Swedish bankruptcies.
A peculiar fact about the use of hit in this meaning is that around 50% of the
occurrences in the ESPC have the passive form. (The passive forms are not as
prominent 3 out of 14 with strike used with the same meaning but this will not
be discussed in detail due to the relatively small number of examples.) One
reason for this is the general tendency of human arguments to be realized as
subject. At the same time, the frequent use of the passive form serves as an
indication that hit is being used as a psychological predicate rather than a physical
action verb. A comparison with Swedish drabba is interesting. There are 182
occurrences of drabba in the ESPC corpus, 103 (62%) of which are passive.
Besides hit and strike, its English correspondences are verbs which have a basic
meaning close to affect (negatively) such as affect (23 examples), afflict (12)
and befall (5). The most frequent equivalent is actually the verb suffer (from)
(33), which takes a human Experiencer as subject in an active sentence:
(12) Men Joe var fr tidigt fdd och But Joe was born too early and had
hade drabbats [Passive] av suffered from lack of oxygen
syrebrist under frlossningen. during his birth.
(SCO)
Negative events of the type just described are in principle observable with our
senses, even if the psychological reaction of the Experiencer is in focus. The
subject can also refer to a purely mental event. A clear case is when the noun
thought is used as subject.
(13) Den frsta tanken slog mig nr jag That thought struck me the
vaknade nsta morgon och tnde following morning when I woke up
ljuset. (RJ) and switched on the light.
In the ESPC, only strike is used with this meaning (the sudden appearance of a
thought). The dominant equivalent in Swedish is sl. In both languages, this
meaning is usually tied to the construction it + Verb +NP +that-S (or wh-S):
(14) I know that at one stage it struck Jag vet att det vid ett tillflle slog
me how utterly out of place I was mig hur ytterligt malplacerad jag
in that cathedral. (BR) var i den dr katedralen.
The use of strike with a mental meaning is also reflected in the list of salient
subjects in Table 3. The noun thought appears close to the top. Among the other
salient subjects, the nouns thing and fact tend to serve as the abstract head of
sentential complements (e.g. The first thing that struck me about Dana's poems
was his incredibly tiny script and I was struck by the fact that there were no
spokes) and the salience of it as a subject of strike is no doubt due to expressions
of the type it struck me that-S.
The verb strike (often in combination with as) can also be used to describe
how something appears to a human Experiencer. In this case, the Swedish
equivalent sl cannot be used as an equivalent and various mental verbs are
preferred instead, such as te sig or tyckas appear:
(15) Det enda som tycktes honom avvik- The only thing that struck him as
ande var ett litet krucifix som satt p being odd was a little crucifix on
vggen intill drren till pentryt. (HM) the wall by the kitchen door.
Another alternative is to use a mental verb where the Experiencer appears as

subject, such as uppleva experience:
336 ke Viberg
(16) Yes, I think that 's how she struck Ja. Det var vl ungefr s jag
me. (JB) upplevde henne.
To sum up, an important cue for word sense identification and for the choice of
Swedish translation of strike and hit is the semantic class of the subject. However,
there is a wide range of other linguistic cues some of which will be dealt with in
the following account of sl, but as will become evident these cues are not as
prominent as for the Swedish verb. There are also cases where only the wider
discourse context or general pragmatic factors are decisive. With respect to the
conceptual representation, the Bodily action component of strike, hit and beat is
less prominent than in Swedish as will be demonstrated in the next section.
3. Swedish physical contact verbs
In Swedish, there is one nuclear physical contact verb sl which has a much
higher frequency than any other verb in the field. The meaning of Swedish sl is
analyzed in greater detail in Viberg (1999). In brief, sl in its prototypical use as a
physical contact verb involves Intentional action, Body movement, primarily with
the arm and hand, which results in contact between the hand and some (optionally
specified) part of the body of some other human being, as in the following corpus
example: Mor slog far i ansiktet (IB) Mother struck father in the face. The
various aspects of the meaning of sl can be related to a number of experiential
levels as outlined in Table 4.
Table 4. Aspects of the meaning of sl

Experiential level Concept Prototype
Cognitive Intentionality Intentional
Sensorimotor Limb movement Arm + hand
Spatial perception Motion through the air to Target: other human
target
Mechanical Force directed towards target Strong force
reasoning
Effects: Affected object:
Psychological effect Defeating, hurting Human
Biological effect Killing Human, animal
Physical effect Setting target in motion Physical object
Breaking target
Producing sound
(Producing artefact)
At the cognitive level, sl refers to an intentional action by a human agent in the

prototypical case. This is reflected in the fact that 70% of the grammatical
subjects of sl refer to a human, which is higher than for strike (41%) and hit
(48%) but relatively similar to beat (72%) in the ESPC corpus. Only in a few
cases is sl unintentional when the subject is human as when it refers to hurting
oneself. In this meaning the verb is reflexive (sl sig) or has one of the subjects
body parts as its object:
(17) I fallet slog han huvudet i en He hit his head on a freshly-

nyuppslagen spkagge [] (KE2) opened barrel of cleaning soap []
As mentioned in the introduction, cognitive linguists such as Lakoff and Johnson

(1980, 1999) have stressed the importance of bodily movement and perception for
concept formation. More specifically, Bailey (1997) presents a computational
model of motor control and word learning using verbs of hand action as an
example. Bailey also refers to the fact that brain imaging studies (Damasio and
Tranel 1993) indicate that there is an intimate connection between language and
the sensorimotor areas of the brain: verbs activate motor control regions, while
nouns do not (Bailey 1997: 12). At the sensorimotor level, sl refers to a limb
movement, in the prototypical case with arm and hand. Usually, this part of the
meaning is not explicitly marked. Only occasionally is the bodily motion
specified in greater detail as in the following example:
(18) Min femriga arm som med all and my five-year-old arm raising
kraft lyfter handen fr att sl my hand to hit back with all its
tillbaka. (MS) might.
The use of the body part as subject in this rather exceptional use also backgrounds
the cognitive level conceptualizing the hitting as an uncontrolled event.
Hitting can be experienced both from within as a sensorimotor activity and
from outside as motion through space. The similarity between the visual
perception of the fist moving through the air and a projectile moving through the
air and hitting its target links examples like Harry hit Peter and A bullet hit Peter
in English. This example also shows that languages exploit potential links
differently in polysemy. As described earlier, Swedish would use sl in the first
case (Harry slog Peter) and trffa (Kulan trffade Peter) in the second. The verb
trffa, however, is not completely ruled out when referring to bodily action in
examples such as Harry trffade Peter med ett vlriktat slag Harry hit Peter with
a well-aimed blow. What motivates the use of trffa in this example is that the
trajectory of the fist and in particular the exact location of its end-point is
focused. Examples where the meaning of sl is based primarily on spatial
perception will be presented later in this section.
One characteristic of Swedish sl is that the direct object is usually also
human unless there is a verbal particle (see below). When it is non-human, the
target of the contact is usually realized by a formally more marked form as a
338 ke Viberg
prepositional phrase (often p on or i in) as in the following examples. The

direct object in the English examples is not possible as an alternative in Swedish
examples of this type:
(19) Vi brjade sl p flaskan fr att We started to hit the bottle to help

hjlpa honom. (RJ) him.
(20) She hit the blackboard. (RDO) Hon slog p svarta tavlan.
There is a strong implication that the contact has a clear effect or impact on the
object. This distinguishes hitting from touching. When the object is human, the
effect is usually psychological. The agents intention to hurt or defeat the other
human is part of the prototypical meaning of sl. Swedish sl can also be used
when the result is death. The object in this case refers to a human or an animal
(cf. the meaning of the English cognate slay) but in this case sl is usually
combined with the particle ihjl (etymologically into Hel, the kingdom of the
dead in Old Norse mythology). Sl ihjl is in most of the cases translated by kill
which is unmarked for manner, but the more direct equivalent beat to death also
occurs:
(21) Han kunde sl ihjl mig utan att He'd kill me without giving it a
blinka. (SG) second thought.
(22) Klappar det p porten r hans A knock at the door? His first
frsta impuls att gripa yxan och impulse is to seize an axe, rush out
rusa ut och sl ihjl. (IU) and beat his visitor to death.
As in many of the other cases where sl is combined with a particle, the particle
signals the result, whereas the verb primarily contributes a manner component. A
sentence such as Peter slog ihjl ormen can be paraphrased as Peter killed the
snake (by hitting it). However, sl without a particle has the conventional
meaning kill when the subject refers to a bear: Bjrnen slog ett lamm The bear
got a lamb.
The verb sl is associated with an extensive pattern of polysemy. The
relationships between a number of the most basic meanings are shown in Figure 1
(see Viberg 1999 for discussion) and the major English equivalents tied to various
meanings are shown in Table 5. In Figure 1, the prototype is shown in the box in
the middle. Above the prototype, a number of uses are displayed where some part
of the prototypical meaning is focused. A relatively frequent use, focuses on the
limb movement without any resulting contact. The typical English equivalent is a
motion verb:
(23) Pastor Tureson slog uppgivet ut Pastor Tureson threw up his hands
med hnderna. (HM) in acknowledgment.
(24) Zablonsky spread his hands. (FF) Zablonsky slog ut med hnderna.
Table 5. Major meanings of sl with their major English correspondences
Semantic field Freq. Major English correspondences

Physical contact 130 strike (27), hit (33), beat (8)
Body movement 35 Motion verbs: throw, fling, wave
Postural 46 sit (down) (27), take a seat
Settlement 37 settle (27)
Kill (sl ihjl) 17 kill (13), beat to death, swat (a
mosquito)
Defeat 20 beat (5), defeat (2), repulse (2),
suppress (2)
Fighting (slss) 45 fight (25), struggle (3)
Subject-centered motion 19 fight ones way (3), set (3), push (2)
Object-centered motion: liquid 8 pour (6), cast
Disconnection (sl snder/av) 38 break (9), smash (4), cut, demolish,
destroy
Joining (sl samman, ihop) 15 merge (8), join (2)
Open/close 69 open (28), close (10), slam (8)
Look up (sl upp) 10 look up (6)
Dialling 11 dial (10)
Switch on/off (sl p/av) 17 switch on/off (7), turn on/off (5)
Non-human subject
Physical object 10
Natural forces: lightning 12 strike (10)
Natural forces: rain, waves 13 bang, batter, beat, crash, hammer
Sound source 16 strike (clock), slam (door)
Heart, pulse 10 beat (6), thump (2)
Mental meanings
Impersonal construction: it 31 strike (13), occur to (7), come to (4),
struck me that-S cross s.o.s mind (2)
sl fast 11 establish (2), specify, state
sl vakt om 13 protect (5), safeguard (5)
Total (above) 633
Total (corpus) 754
340 ke Viberg
(25) Hon for upp och sprang runt i She leapt up and ran round the
kket, slog armarna runt kroppen, kitchen, flinging her arms round
och hulkade och snyftade. (AP) her body, sobbing and sniffing.
The verbs strike, hit and beat only have a few uses where limb movement is
focused, as in the following example:
(26) Han hade brjat skaka av kld och He had begun shaking with cold, so
slog armarna om sig sjlv. (KE) he kept beating his arms round his
chest []
Examples such as Per slog ut med armarna Per spread his arms, where sl
describes limb movement, serve as a model for the conventionalized use of sl to
describe the motion of petals in expressions like Blommorna slog ut The flowers
came out. In the corpus, there is one example which shows that similar
extensions are productive to some extent:
(27) Stockholmarna mrker det ofta Stockholmers usually become

frst nr frmmande flaggor slr ut aware of a state visit only when
p Norrbro. (GAPG) foreign flags fold out along
Norrbro bridge.
An example like this one is based on the spatial perception of a movement that
looks like a certain type of arm movement (perhaps via the conventionalized
extension describing flowers coming out). There is no direct connection to the
sensorimotor experience in this example.
The result of defeating someone can also be focused. In English, this is
possible only with beat. In the following example, the discourse context makes it
clear that the physical part of the meaning of sl and beat should be suppressed:
(28) Genom en rad glnsande aktioner In a series of brilliant actions

slog Karl XII ryssarna vid Narva r Charles XII beat the Russians at
1700 och polackerna vid Klissow Narva in 1700 and the Poles at
r 1702. (AA) Kliszow in 1702.
According to the interpretation presented in this paper, the intention to defeat or

hurt is part of the prototypical meaning of sl. The meaning defeat is thus
rather a case of focusing (and strengthening) rather than some kind of metaphor.
In addition to the regular passive forms of sl, there are irregular forms
associated with the meaning fight. Basically, the vowel is shortened which is
reflected in writing in slss used in the infinitive and present tense. (The regular
passive present form is sls. In the past tense, the difference in vowel length is
not reflected in the written form, slogs.) Slss is usually treated as a separate
Focusing:
Stationary motion
Blommorna slog ut
The flowers came out
Bodily motion Social interaction 2:

Per slog ut med armarna Competition
Per spread his arms Per slog Pl i schack
Per beat Pl at chess
Prototype: Social interaction 1:

PHYSICAL CONTACT
Fighting
Per slog Pl i magen Per och Pl slogs
Per hit Pl in the stomach Per and Pl were fighting
Resultative strengthening:
Object- Disconnec- Sound Organic life Postural

centered tion source Bjrnen slog Per slog sig
motion Per slog Det slog i ett fr. ner i soffan.
Per slog grset. drrarna. The bear got Per sat down
bollen ver Per cut the The doors a lamb. in the sofa.
nt. Per hit grass. slammed.
the ball over
the net. Metaphor:
Specialized meanings:
Open/close Motion: Symbolic Per slog ihjl Settlement

Per slog upp Liquid Klockan slog tiden. Per slog sig
boken. Per slog upp 12. Per killed ner i Finland.
Per opened en grogg. The clock time. Per settled in
the book. Per poured a struck 12. Finland.
drink.
Per slog upp

ett ord.
Per looked up
a word.
Figure 1. Major meanings of sl

342 ke Viberg
lemma in Swedish, but from a semantic point of view slss is closely associated
with the prototypical meaning of sl. Basically, it refers to a fight with the fists
(Pojkarna slss The boys are fighting) but it is often extended to a fight with
other physical means and can be extended into abstract domains as evident from
the second example below:
(29) Somliga sp och slogs s det var Some of them used to drink and
inte klokt. (SW) fight like you wouldn't believe.
(30) Kanske slss dom mot tystnaden, They may struggle with the silence
men mera troligt r att dom fljer but more often they coexist with
med den tystnad dom upptckt. the silence they have discovered.
(SC)
The most frequent equivalent of slss is fight but other alternatives such as
struggle, compete, contend, contest, vie and scramble for also occur.
In the construction sl sig ner (sl + Reflexive + down), sl functions
semantically as a postural verb. The dominant English equivalent is sit down as in
the following example:
(31) Dag slog sig ner p golvet bredvid Dag sat down on the floor beside
Ludde. (MG) Ludde.
Even if the use of sl is completely conventionalized in this construction, which

is characteristic of Swedish postural verbs (stta sig ner sit down, lgga sig ner
lie down, stlla sig upp stand up), there is a close semantic relationship with
the prototypical meaning of sl. To sit down also involves a kind of limb
movement which, even in this case, results in physical contact between the body
and a seat or something serving as a seat (such as the floor in the example above).
This aspect of the meaning is backgrounded in the use of sl as a postural verb
but is more prominent in examples with various types of animals that can fly
(birds, insects):
(32) A fly alighted on his lower lip [] En fluga slog sig ner p hans
(BO) underlpp []
The use of sl as a postural verb also serves as a point of departure for an

extension which is characteristic of postural verbs in many languages, namely to
settle permanently in a place:
(33) Svenska och finska nybyggare slog Swedes and Finns settled in the
sig ner i kolonin, som kallades Nya colony which received the name of
Sverige. (AA) New Sweden.
Hitting a physical object can have various physical effects such as setting the
object in motion, breaking it, producing a new object or producing a sound. Such
meanings are based on mechanical reasoning and the transmission of force

(Michotte 1963, Leslie 1994). There are a number of uses of sl where a certain
physical effect has been conventionalized and become part of the meaning
through a process referred to as resultative strengthening in Viberg (1999). There
is often a complex interaction between the verb sl, various verbal particles that
can be combined with the verb and the semantic class of various objects. One
example is the use of sl to express separation into parts or disconnection, for
example by breaking or cutting (Viberg 1985). The verb sl in combination with
the particle snder asunder, apart is conventionally used to refer to breaking a
physical object by hitting it or (in a more extended meaning) by accidentally
dropping it. The most frequent equivalent of sl snder is break as in the
following example:
(34) Natalie not caring about the way Att Natalie inte bryr sig om ifall
she makes Jane break plates hon fr Jane att sl snder tallrikar
matters; (FW) har ocks betydelse []
In the expression sl snder, sl rather expresses the manner (break by hitting)

whereas the result is expressed by the particle. However, with direct objects
referring to hay, grass and other plants, the result cut has beeen lexicalized as in
the following example (the instrument scythe is also understood):
(35) Vem r det som slagit ert h, sa "Who mows your hay?" asked the
frmlingen. (SC) stranger.
The verb sl can also be used in phrases with the meaning cause to form a unit
but in that case a verbal particle such as samman together or ihop (etymol. in
+ heap) must be used. Even if it is possible to interpret combinations such as sl
ihop or sl samman concretely involving the striking of two objects against one
another, all occurrences in the ESPC have a more abstract meaning. The most
frequent equivalent is merge but join also occurs in a couple of examples:
(36) Produktionen vid Esswells enhet i Production at Esswell's unit in

Toscana sls nu samman med Tuscany will now be merged with
verksamheten vid fabriken i Lucca. operations at the plant in Lucca.
(ASSI)
A type of resultative strengthening that is rather marginal in modern Swedish but

presumably more frequent in pre-industrial cultures is hitting as a method of
production. The expression sl mynt produce coins by hitting metal refers to an
obsolete way of producing coins:
(37) I denna stad hade kungen sin grd, The King had his residence in that
och i Sigtuna slogs ocks de ldsta town, and the oldest dated coins
daterbara mynten i landet. (AA) were minted there.
344 ke Viberg
Interestingly, the expression sl mynt av (lit. strike coins out of) has primarily
survived in modern Swedish in a metaphorical sense to produce a benefit for
oneself, i.e. to take advantage of a certain situation:
(38) "You'll pay for this," Con said, "Det hr ska du f betala fr", sa
already seeing opportunities for Con, som redan hade insett att det
cashing in on this young fool's gick att sl mynt av den unge
misfortune. (JC) klparens misslyckade frsk.
The verb sl can also be used in the sense set in motion by hitting as in the
example Per slog bollen ver nt Per hit the ball over the net. There is also a
more extended use of sl as a motion verb where the object is a liquid. The most
frequent equivalent of sl in this use is pour:
(39) Det fick dra ett tag innan When they 'd soaked it all up, the
gstgiverskan slog p en skvtt innkeeper's wife poured in some
mjlk och lt den koka in. (KE2) milk and let it all putter.
In examples like this one, sl no longer refers to hitting but to a movement with
the arm and hand that is partly similar: to move liquid by tilting a container held
in the hand. (There is also a verb hlla pour in Swedish which has this as it
basic meaning.)
There are several other uses more or less closely linked to the prototypical
meaning where sl refers to some specialized kind of movement with the arm and
hand. One such hand action that is loosely associated with the prototypical
motion of arm and hand are the expressions sl p/sl av referring to the turning
of a switch on or off. The two major equivalents are turn on/off or switch on/off:
(40) [han] slog p sina varningsblinkers [he] turned on his emergency

[] (JG) blinkers []
(41) Kunde det vara s lyckligt att With a bit of luck it might just be
ngon helt enkelt hade slagit ifrn that someone had simply turned
huvudbrytaren? (LG) off the main switch!
A rather frequent use of sl refers to opening and closing, which is basically a

hand action that resembles the prototypical act of striking. In this use, sl is
combined with the particle upp open (basically: up) and its opposites igen,
ihop, samman referring to various closed states:
(42) Han tog ut en dyrbar och vackert He took out an expensive and
ornamenterad prm och slog upp beautifully decorated portfolio and
den framfr sig p skrivbordet. opened it before him on the desk.
(HM)
(43) Jag lade ifrn mig pennan eller slog I put my pen down or closed my
ihop boken. (AP) book.
The most frequent equivalents are open and close. When the object refers to
books and other physical objects consisting of pages joined together (newspapers,
journals, menues, etc.), sl + particle refers to opening and closing in a neutral
way. There is, however, another large group of objects referring to doors,
windows and other barriers that can be moved to allow passage (such as lid). In
this case, the use of sl + particle indicates that the action is carried out briskly
and forcefully. In addition to the neutral use of the verb open alone, there are
various equivalents that mirror the manner component:
(44) Nr drren ut till hallen nyo slogs When the door from the exhibition
[Passive] upp (KOB) hall opened again
(45) Pltsligt slogs drren upp (LH) Then the door flew open
(46) I detta nu slogs drren upp (ARP) Then the door crashed open
(47) Djupt inne i mitt medvetande slogs Deep in my consciousness doors
drrar upp (GT) were thrown open
The expression sl igen drren usually implies that the door was closed so
forcefully that a loud noise was produced, and this is mirrored by the frequent
equivalent slam the door:
(48) "Nr gr ni av skiftet i kvll?" "When do you get off your shift?"
frgade han i samma gonblick he asked the one in the back as she
som en av dem slog igen slammed the car door.
bildrren. (JG)
The use of sl upp and sl igen to refer to opening and closing is so well-
established that it can be further extended to uses where hand action is not
involved. Sl upp can be used about the opening of the eyes:
(49) Eriksson slog upp gonen. (SC) Eriksson opened his eyes.
Both sl upp and sl igen can be used with nouns meaning door (or movable
barrier in general) as subject. In examples like the following, there is no clear
implication that a human was involved:
(50) The glass door slammed. (RR) Glasdrren slog igen.
Another use expressing a hand action loosely associated with striking is when sl
refers to the dialling of a telephone number. In this case, the direct object is
usually numret the number or siffrorna the numbers and the dominant
equivalent is dial:
346 ke Viberg
(51) Hon lste upp bilen och slog She unlocked the car and dialed the
numret till kontoret i Ystad p number of the Ystad office on the
biltelefonen. (HM2) car phone.
This is also an interesting example illustrating the cues that can be used for sense
identification and the choice of translation. The major cue in this case is the
semantic class of the object, which in addition to nouns meaning number can be
any combination of digits which can serve as a telephone number: Peter slog 112
Peter dialled 112. Another example which has been discussed above is the class
of objects that can appear when sl refers to mowing or cutting hay and related
objects.
In Swedish, sl can be combined with a large number of particles. But
even in these cases the semantic class of the object is an important cue. The
combination sl upp, for example, is related to different senses and translations
depending on the semantic class of the object. The meaning open appears when
the object refers to (1) door or other movable barrier, (2) book or other printed
matter consisting of pages joined together or (3) eyes. The meaning pour
appears when the object refers to a liquid, especially a drink or beverage:
(52) Han slr upp vattnet och lgger i He pours out the water and puts a
ngra citronklyftor. (MS) few slices of lemon in each glass.
The combination sl upp can also refer to the finding of information by opening a
book or other printed matter. This meaning is metonymically related to the
meaning open which is transformed into a manner component (find
information by turning the pages in a book). The usual English equivalent in this
case is look up:
(53) I looked up the name Gahan. (SG) Jag slog upp namnet Gahan.
Typical objects in this case are words which refer to verbal or numerical
information such as name and telephone number but in principle any word
used metalinguistically could appear as object: Peter slog upp skiftnyckel (i sin
ordbok) Peter looked up wrench (in his dictionary). In print, (single) quotes are
often used to signal that a word is used metalinguistically but in speech topical or
situational cues must be used.
In comparison with strike and hit, the semantic class of the subject plays a
less prominent role for the interpretation of sl since human subjects dominate so
strongly. Inanimate physical objects do occur as subjects but only to a certain
extent. Natural forces occur as subjects of sl to approximately the same extent as
with the English verbs. When the subject refers to lightning, the equivalent is
always strike but when it refers to rain and waves or fire and smoke, a wide range
of physical contact verbs are used (bang, batter, beat, crash, hammer, smack) in
addition to a few motion verbs (gush, sprout, sweep). Usually, various fine-
grained aspects of the manner component, especially forcefulness, are

incorporated into the meaning of the verb used as translation:
(54) Regnbyarna slog mot vindrutan. Rain squalls hammered against the
(HM2) windshield.
(55) Grtt regn slr mot glas. (PCJ) Grey rain batters the glass.
The verb sl can also be used as a mental verb and take a proposition or a mental
noun such as tanke thought as subject. (The uses of sl with a mental subject are
treated together with other mental uses in Table 5.) A sentential subject is usually
extraposed and introduced by a dummy subject (det it) as in the English
construction it struck me that-S (Swed. det slog mig att-S):
(56) Eftert slog det mig att det kanske Later it struck me that it is perhaps
inte gr att drmma att man dr. not possible to dream that you die.
(BL)
There are 31 occurrences of sl in this construction. The most frequent English

equivalent is strike but there are several other alternatives such as occur to, come
to, cross s.o.s mind:
(57) Det slog mig att det var mycket It occurred to me it had been quite
lnge sedan jag knt mig generad. a while since l'd felt
(LH) embarrassment.
(58) Det slr mig att han antagligen inte The thought crosses my mind that
alls hr till kongressen. (MS) he probably does n't have anything
to do with the convention.
(59) And this, it suddenly came to her, Och detta, slog det henne pltsligt,
might well be the wages of sin. skulle mycket vl kunna vara
(FW) syndastraffet.
Mental nouns such as tanke thought, idea can be used as subjects when the
object is human:
(60) Tanken slog mig att Pekka kanske It came to my mind that Pekka had
hade seglat ivg med MacDuffs perhaps sailed away with
kvinna (BL) MacDuff's woman
Usually, a passive alternative is used as in the following Swedish example:
(61) A thought suddenly struck her. Pltsligt slogs hon av en tanke.

(RR)
348 ke Viberg
The verb sl also appears in a number of phrasal combinations with a mental

meaning, where the subject is a human agent. The active suppression of a thought
can be described with the phrase sl bort tanken (lit. strike the thought away).
This metaphorical expression is used literally about chasing away disturbing
insects such as mosquitos (sl bort myggen) with sweeping motions of arm and
hand.
(62) vervgde ett gonblick att ta For a moment he considered

frukost men slog bort tanken. having breakfast, but he dismissed
(SW) the thought.
There are two phrasal combinations with sl that are relatively frequent in the
ESPC, especially in the non-fiction texts, viz. sl fast and sl vakt om. The phrase
sl fast means literally fasten by hitting. As a mental metaphor it refers to
forming a decision that one sticks to. A number of different equivalents are used,
such as establish, specify, state:
(63) Jag tycker ocks att man hr borde In my view, we should have used
ha tagit chansen att sl fast att this opportunity to establish that
parlamentets ordfrande skall utses the President of Parliament should
p fem r [] (ESJO) be elected for five years []
The phrasal combination sl vakt om (lit. strike guard of) is not transparent in
present-day Swedish. The most frequent equivalents are safeguard and protect:
(64) Det r friheten som vi skall sl vakt It is the freedom we should

om, inte regleringen. (ECED) safeguard, not the rules.
(65) Det r inte s konstigt att vi It is not surprising that we citizens

lundabor envist slr vakt om vr of Lund stubbornly protect our
stads srdrag och om dess town's special qualities and its
lagomhet. (LI) moderation.
To sum up, the Swedish verb sl has an extensive pattern of polysemy

comprising a number of senses that are motivated at various experiential levels
presented above in Table 4. Among these, the sensorimotor level plays a
conspicuous part since many extended meanings are motivated by the fact that sl
is a hand action verb. A similar motivation is found for several of the meaning
extensions of another frequent and polysemous hand action verb in Swedish,
namely dra pull (Viberg 1996). There are also many extended meanings that
can be regarded as cases of resultative strengthening.
4. Conclusion
The present paper is relatively data-oriented and an account has been given of a
rather large number of cases where English and Swedish contrast. However, an
attempt has also been made to characterize the contrasts between the two
languages in general terms based on two different frameworks. With respect to
the conceptual representation, Swedish sl is grounded more firmly in
sensorimotor experience of limb movement than strike, hit and beat, even if
sensorimotor experience plays an important role also for the conceptualization of
the English verbs. At a general level, the extensions of the major verb of hitting to
other types of hand action probably represent a universal tendency. The polysemy
of the Chinese equivalent d_ hit is to a great extent motivated by the fact that
the prototypical meaning refers to hand action according to Gao (2001).
However, a comparison at a more detailed level with Swedish sl shows that
there appears to be great variation with respect to the specific hand actions (out of
the many potential ones) that are conventionally associated with the verb whose
prototypical meaning is hit.
With respect to the process of word sense identification, there is also a
general tendency. In both English and Swedish, there are many types of linguistic
disambiguation cues. It appears, however, that the major equivalents of strike and
hit can be identified with the help of the semantic class of the subject, whereas the
semantic class of the subject is helpful in fewer cases in Swedish due to the
relative dominance of human subjects of sl. The semantic class of the object, on
the other hand, is utilized as a cue to distinguish a rather great number of senses
of sl and appears to be more important for sl than it is for hit, strike and beat.
The relative importance of various types of cues varies a great deal within a
language depending on the type of lexical item. The major meanings of Swedish
f get; may such as Possession, Modal, Causative can be identified with the help
of the syntactic frame (or construction), whereas the subtle but important contrast
between the two modal meanings Permission and Obligation are identified
primarily with the help of pragmatic factors (Viberg 2002).
The semantic class of the subject and object referred to in this paper can be
compared to the notion of local context (Miller and Leacock 2000. See the
introduction). To a large extent it will be available within such a narrow window
as 2 words and is local in that sense. The concept of argument structure of which
subject and object form a central part is, however, different from simple co-
occurrence. In a lexical study, it appears to be justified to provide the more
structured information even if it is still an open question excatly how this
information is used by human or machine. As has been exemplified several times
in this paper, topical and pragmatic information will be needed in many cases to
reach the correct interpretation.
The comparison of Swedish and English has turned up many differences in
semantic structure in spite of the fact that the two languages are rather closely
related. As a matter of fact, most of the verbs treated in this paper have cognates
in the other language: sl slay, strike - stryka stroke, hit hitta find (a
350 ke Viberg
concrete object). However, on each point where a contrast is found, it remains an

open question whether Swedish or English exhibits a language-specific pattern,
and on points where the languages are similar, it is an open question whether this
reflects a universal tendency or is due to the close genetic relatedness of Swedish
and English. To answer this type of questions, more languages must be brought
into the comparison. Some data of this type have already been analyzed in a
restricted pilot corpus consisting of translations of Swedish originals into four
other languages. A simple example is presented in Table 6.
Table 6. Translations of Swedish originals into four other European languages
Swedish English German French Finnish

Mor slog far i She struck Mutter schlug Mre a frapp iti li is
ansiktet (IB) him in the Vater ins pre au visage kasvoihin
face Gesicht
Det slog It struck him Ihm ging L'ide le Johan tajusi
honom att hon that she knew durch den frappa qu'elle ett Gudrun
visste allting everything Kopf, da sie savait tout de tiesi hnest
om honom. about him alles ber ihn lui. kaiken
(KE) wute
ke slog upp ke flung ke ri die ke ouvrit la ke avasi
drren. (KE) open the door. Tr auf. porte. oven.
Han slog He poured Er go den Il remplit le Hn kaatoi
halva bgarn out half a Becher gobelet puolillaan
full (MF) beaker halbvoll moiti olevan maljan
tyteen
As can be seen the extension of sl into the mental domain (it struck him that-S)
has a parallel in French in addition to English, whereas the extension to meanings
such as opening and pouring appear to be language-specific characteristics of
Swedish in spite of the fact that they represent natural extensions from the
prototypical conceptual representation of Swedish sl. To be able to say what is
universal, languages that are genetically and geographically more distant from
Swedish must be taken into consideration, but as already mentioned certain types
of extension such as the extension from hitting to various other hand actions have
parallels in non-European languages such as Chinese.
Note
1. In the following corpus examples the original text is placed first. For an
explanation of the text codes, see
http://www.englund.lu.se/research/corpus/corpus/webtexts.html.
References
Aijmer, K., B. Altenberg and M. Johansson (1996), Text-based contrastive

studies in English. Presentation of a project, in: K. Aijmer, B. Altenberg
and M. Johansson (eds), Languages in contrast. Papers from a symposium
on text-based cross-linguistic studies. Lund: Lund University Press. 73-85.
Altenberg, B. and K. Aijmer (2000), The English-Swedish Parallel Corpus: A
resource for contrastive research and translation studies, in: C. Mair and
M. Hundt (eds), Corpus linguistics and linguistic theory. Amsterdam and
Atlanta: Rodopi. 15-33.
Bailey, D.R. (1997), When Push comes to Shove: A computational model of the
role of motor control in the acquisition of action verbs. PhD dissertation,
Computer Science Division, EECS Department, University of California,
Berkeley.
Damasio, A.R. and D. Tranel (1993), Nouns and verbs are retrieved with
differently distributed neural systems. Proceedings of The National
Academy of Sciences 90, 4757-4760.
Gao, Hong (2001), The physical foundation of the patterning of physical action
verbs. A study of Chinese verbs. [Travaux de linstitut de linguistique de
Lund XLI]. PhD dissertation, Department of Linguistics, University of
Lund.
Johansson, S. (1998), On the role of corpora in cross-linguistic research, in: S.
Johansson and S. Oksefjell (eds), Corpora and cross-linguistic research.
Theory, method, and case studies. Amsterdam: Rodopi. 3-24.
Killgarriff, A. and D. Tugwell (2002), Sketching words, in: M.-H. Corrard
(ed.), Lexicography and natural language processing. A festschrift in
honour of B.T.S. Atkins, 125-137. Distribution: EURALEX
http://www.ims. uni-stuttgart.de/euralex/
Lakoff, G. and M. Johnson (1980), Metaphors we live by. Chicago: University of
Chicago Press.
Lakoff, G. and M. Johnson (1999), Philosophy in the flesh. The embodied mind
and its challenge to western thought. New York: Basic Books.
Leslie, A. (1994), ToMM, ToBY, and agency: Core architecture and domain
specificity, in: L. Hirschfeld and S. Gelman (eds), Mapping the mind.
Domain specificity in cognition and culture. Cambridge: Cambridge
University Press.
Michotte, A. (1963), The perception of causality. London: Methuen. (Original in
French 1946.)
Miller, G.A. and C. Leacock (2000), Lexical representations for sentence
processing, in: Y. Ravin and C. Leacock (eds), Polysemy. Theoretical and
computational approaches. Oxford: Oxford University Press. 152-160.
Taylor, J. (1989), Linguistic categorization: prototypes in linguistic theory.
Oxford: Oxford University Press.
352 ke Viberg
Viberg, . (1985), Hel och trasig. En skiss av ngra verbala semantiska flt i
svenskan, in: Svenskans beskrivning 15: 529-554. Gteborg: Gteborgs
universitet.
Viberg, . (1996), The meanings of Swedish dra pull: a case study of lexical
polysemy. EURALEX'96. Proceedings. Part I, 293-308. Department of
Swedish, University of Gteborg.
Viberg, . (1999), Polysemy and differentiation in the lexicon. Verbs of physical
contact in Swedish, in: J. Allwood and P. Grdenfors (eds), Cognitive
semantics. Meaning and cognition. Amsterdam: Benjamins. 87-129.
Viberg, . (2002), Polysemy and disambiguation cues across languages. The
case of Swedish f and English get, in: B. Altenberg and S. Granger (eds),
Lexis in contrast. Amsterdam: Benjamins. 119-150.
Exploring theme contrastively: the choice of model
Anna-Lena Fredriksson
Gteborg University
Abstract
The aims of this paper are to discuss different approaches to the notion of theme
and to show how parallel corpora can successfully be used for cross-linguistic
analyses of theme.1 The realisation of theme is language-specific which can be
problematic for contrastive studies of thematic structures. In this paper, I start by
describing theme in English following Systemic Functional Grammar (Halliday
1994) and discuss questions concerning the delimitation of the theme from the
rheme in English, which is relevant also for monolingual and cross-linguistic
studies. In a brief overview of various approaches to theme in other languages,
monolingual as well as cross-linguistic, I then demonstrate that the positions
taken to theme differ and the original approach, which is English-based, may
have to be modified to suit other languages simply because different languages
have different ways of realising this function.
1. Introduction
Parallel corpora offer great possibilities for contrastive text analysis.2 In recent
years studies have covered a variety of features in the languages involved and
often combined a syntactic and a textual feature. Studies have for example
focussed on the thematic uses of non-referential there in English-Finnish texts
(Mauranen 1999), sentence openings and textual progression (English-Swedish)
(Svensson 2000), connectors and sentence openings (English-Swedish)
(Altenberg 1998), word order and thematic structure in English and Norwegian
(Hasselgrd 1998, 2000), and thematic development in English-German texts
(Ventola 1995). To my knowledge, Ghadessy and Gao (2001), investigating
English and Chinese, is the only purely quantitative study of thematic
development in parallel texts. The usefulness of this kind of research for
translators and translator training as well as for machine translation is often
stressed.
The present paper originates in problems that I have encountered in my
ongoing thesis work on passives from a corpus-based contrastive English-
Swedish perspective. It is well-known that the passive is a multifunctional
structure that provides a useful way of omitting the agentive subject where it can
be ignored, or of postponing an agentive subject by making it the agent in cases
where we want to give it end focus. At the same time, it gives thematic status to
354 Anna-Lena Fredriksson
the affected entity (cf. Svartvik 1966, Granger 1983, Quirk et al. 1985: 1390f.,
Pry-Woodley 1991, Teleman et al. 1999: 4: 379ff. among others). Such
operations facilitate a smooth development of the text. Its important role in text
organisation gives rise to the question of how passive sentences in original texts
are treated by translators. To what extent is the thematic structure preserved or
altered in translation? Baker points out that [r]endering a passive structure by an
active structure, or conversely an active structure by a passive structure in
translation can affect the amount of information given in the clause, the linear
arrangement of semantic elements such as agent and affected entity, and the focus
of the message (1992: 106). But how can we compare thematic structure across
languages?
Due to the simple fact that language systems and their realisations differ,
difficulties often arise when we want to study text structure across languages. We
can assume that in all languages the clause has some kind of text-related
organisation, and we can acknowledge theme and rheme as basic notions for the
organisation of the message presented in clauses. However, the realisation of
these notions may be specific to each language (e.g. Fries 1995a: 15). Even in
English and Swedish, which are both SVO languages, it is sometimes difficult to
determine which elements are to be considered thematic. Consider (1):
(1) (a) EO: Recently, some 2 billion has been invested in the area; (SUG1)3
(b) ST: Nyligen har ca 2 miljarder pund investerats i Docklands;
Lit: Recently has approximately 2 billion invested-PAST-PASS in
Docklands.
In the Swedish translation (1b) the finite operator precedes the subject. The
inversion occurs because Swedish, like many other Germanic languages, is a
verb-second (V2) language which requires the verb to occupy second position in
declarative main clauses. Consequently, each time a non-subject occurs in initial
position, subject-predicate inversion takes place. Such a typological difference
may influence the choice of model for a thematic analysis.
In cross-language research we need descriptions of the way languages
organise the clause thematically and syntactically, and from there we may
proceed to finding a model of analysis that fits the languages compared. The
present paper discusses the theme-rheme system within Systemic Functional
Grammar (SFG) (Halliday 1967, 1994) which provides a much used model for
thematic analysis in English. Despite the fact that SFG has a strong orientation
towards English which is a potential problem for using it in other languages, the
theory has had considerable influence on translation theorists and on translation
studies of various kinds (cf. Hatim and Mason 1990, 1997, Baker 1992, House
1997, Steiner 2001, Teich 2001), and it has been applied to a variety of
languages. The main focus of this paper is on cross-linguistic descriptions of the
theme-rheme structure. How has the theme been interpreted, defined, and
delimited from the rheme in various languages? Can the notion of theme be
modified for contrastive purposes? I will show that studies of this kind need to be
Exploring theme contrastively 355
corpus-based, and that parallel corpora prove useful for describing the theme-
rheme structure both monolingually and contrastively.
The paper is organised as follows. Section 2 gives a presentation of the
concept theme in English following Halliday (1994) and also discusses how far
into the clause the theme reaches. Section 3 contains a brief overview of some
approaches to theme in other languages, and Section 4 discusses different models
used in cross-linguistic theme-rheme analysis. Concluding remarks are given in
Section 5.
2. Theme in English
As explained above, SFG identifies two textual units in the clause in English: the
theme and the rheme, which appear in the clause in that order.4 The theme can be
described positionally and functionally. Basically, the theme can be identified by
its initial position in the clause. Functionally, Halliday defines the theme as [t]he
element which serves as the point of departure of the message; it is that with
which the clause is concerned (1994: 37). In other words, [i]t is the element the
speaker selects for grounding what he is going to say (Halliday 1994: 34).
Although thematic structure and information structure (GivenNew) are
separate notions in SFG, there is a strong correlation between them, and we may
say that the theme typically contains information that is contextually or otherwise
retrievable (given information) (Halliday 1994: 299). The rheme, on the other
hand, consists of that which the speaker says about the theme. In terms of
newsworthiness, the rheme typically has a higher degree of newsworthiness than
the theme. The notion of theme is connected with the mood system in that the
choice of theme depends on the choice of mood. For example, in the unmarked
case in declaratives, the theme is conflated with the subject as in (2):
(2) EO: We [Exp-Th/Pa] had never seen builders work like this. Everything [Exp-
Th/Pa] was done on the double: scaffolding [Exp-Th/Pa] was erected and a
ramp of planks [Exp-Th/Pa] was built before the sun was fully up, the
kitchen window and sink [Exp-Th/Pa] disappeared minutes later [...] (PM1)5
Every unit given in bold in (2) is an unmarked theme. The concept of markedness
can be understood as a scale on which an unmarked theme is the option
representing the most typical choice in terms of probability and frequency of
usage. An unmarked theme is placed at one end of the scale and the further we
move away from the unmarked option(s), the more marked the choice is.
According to Halliday, the most marked theme of a declarative clause functions
as complement as in (my emphasis and notation) A bag-pudding [Exp-Th/Pa] the
King did make (Halliday 1994: 44). At an intermediate position we find clause-
initial circumstantial adjuncts (adverbial groups and prepositional phrases) which
make up the entire theme:
(3) EO: A few months later [Exp-Th/C] Henry was called in to Detroit again []
(RL1)
The themes we have seen so far are all experiential themes denoting
participants or circumstantial phenomena. This theme type belongs within the
experiential metafunction which constitutes one of the three metafunctions of
language according to Halliday. The other two are the interpersonal metafunction
and the textual metafunction, both of which may also contribute to forming a
theme. According to Halliday (1994: 52ff.), the theme always includes one and
only one experiential element, which is called the topical theme, but this item
may be preceded by one or several textual and/or interpersonal elements resulting
in a multiple theme. Figure 1 illustrates an extended multiple theme in English
with subtypes of the textual and interpersonal components.
well but then Ann surely wouldnt the best idea be to join
the group
continuative structural conjunctive vocative modal finite topical
textual interpersonal experiential
Theme Rheme
Figure 1. Extended multiple theme (Halliday 1994: 55).
What are the principles behind this stacking of thematic items? First, some textual
and interpersonal elements (e.g. connectors, modal adjuncts, and relative
pronouns) regularly take clause-initial position, and because of this their thematic
status is somewhat attenuated (Halliday 1994: 52). Second, their overall
function can be regarded as orienting (cf. Gmez-Gonzlez 1998: 83, Mauranen
1993) and as a consequence it is difficult to say that they express what the clause
is about. Therefore, when such elements occur in initial position, they do not
exhaust the thematic potential of the clause but allow a referential element to be
part of the theme. According to Halliday, the unmarked order of components
within the structure of a multiple theme is textual < interpersonal <
experiential/topical. While the experiential element typically comes last in the
theme and constitutes topical theme, the order of the textual and interpersonal
components may be switched. Finally, everything that follows the topical theme
constitutes the rheme. Example (4) illustrates a multiple theme of a more modest
length than that in Figure 1:
(4) EO: Unfortunately [Int-Th/Mo], part two of the lecture (Why The Earth Is
Becoming Flatter) [Exp-Th/Pa] was interrupted by a crack of another burst
pipe, and [Txt-Th/St] my education [Exp-Th/Pa] was put aside for some virtuoso
work with the blow-lamp. (PM1)
Here the modal adjunct Unfortunately is an interpersonal theme which precedes

the topical/experiential theme part two of the lecture (Why The Earth Is Becoming
Flatter) which is also the subject. Further, the conjunction and is a textual theme
preceding the topical my education.
As we have seen, multiple themes come in slightly different shapes, which
opens the question of where the transition between theme and rheme takes place.6
Matthiessen suggests that the boundary of the theme be moved. Consider (5)
(adapted from Matthiessen 1992: 51):
(5) A. Do you mean were overdressed? said the charming father of the
Family.
B. [Place:] In England, [Time:] at this moment, [Purpose:] for this
occasion, [Participant:] we would be quite over-dressed.
The beginning of (5B) has a number of experiential adjuncts, of which, in

Hallidays approach, only the the first element, Place, counts as theme since it is
the first experiential element and thereby topical theme. However, according to
Matthiessen this is a complex theme consisting of three circumstantial elements
and a participant, and all of them are important for the the thematic perspective.
There is a continuum in that the thematic prominence of the clause gradually
decreases as the clause unfolds (Matthiessen 1992: 51). We may then ask
whether there is a clear cut-off point between theme and rheme. If there is, where
is it best placed (cf. also Fries 1995a: 14)? As we have seen, Halliday argues that
there is always one, and only one, experiential element in the theme, and the
theme ends after this element.7 However, several researchers have suggested a
modification of the theme to include more than one experiential element.
Downing (1991) argues that initial circumstantial elements such as
temporal and spatial adverbials do not always express what the clause is about,
and should therefore not receive the status of topical themes. Thus, in (6) the
second experiential element, i.e. Freud, is part of the theme as well:
(6) Towards the end of his life [Exp-Th/C], Freud [Exp-Th/Pa] concluded that he
was not a great man (Downing 1991: 127).
Downings approach is used also by Svensson (2000) in a corpus-based

contrastive study on sentence openings in Swedish and English.
While Halliday allows topical themes to be preceded but not followed by
textual and/or interpersonal elements, Gmez-Gonzles (1998, 2001), working
with spoken data from the Lancaster/IBM Spoken English Corpus, allows topical
themes to be both preceded and followed by such elements (Gmez-Gonzles
1998: 85). The structure in which this may occur is called Extended multiple
Theme. Example (7), which is an instance of this type of theme, has an
experiental theme which is followed by a modal adjunct as interpersonal theme
(Gmez-Gonzles 1998: 85):8
(7) This of course was not because the government failed in its supposed duty
as provider but largely because energy prices rose considerably in relation
to other prices
Further, just as there may be more than one textual and/or interpersonal item in a
multiple theme, an Extended multiple Theme may contain not only one but
several experiential elements, marked or unmarked, resulting in complex topical
themes.
It is important to consider the significance of the theme in the overall
development of the text. A number of studies (e.g. Francis 1989, Fries 1983) have
shown that the theme plays an important role in the organisation of discourse, or
as Halliday puts it, [t]he choice of Theme, clause by clause, is what carries
forward the development of the text as a whole (1994: 336). As shown by Dane
(1974) the thematic progression (or method of development, Fries 1983) of a
piece of text tends to follow certain identifiable patterns. Thus, this discourse
perspective supports Matthiessens (1992) proposal for an extension of theme.
Consider (8) from Matthiessen (1992: 51):
(8) Autumn passed and winter [passed], and in the spring the Boy went out
to play in the wood. While he was playing, two rabbits crept out from the
bracken and peeped at him.
The third theme in (8), in the spring, is a circumstantial temporal theme. In

contrast to the first two themes, it does not also serve as subject. Instead it is
followed by the subject the Boy which is not thematic according to Halliday.
Matthiessen argues that [y]et the Subject still seems to have some thematic
value: it introduces the Boy as theme, which is then retained as theme in the
subsequent clause (while he was playing) (1992: 52). Hence, this subject is
relevant for the thematic development of the text. Rose (2001: 126f.) argues along
similar lines emphasising that circumstances and participants contribute in
different ways to the thematic progression of a text: circumstances to the staging
of sequences and participants to creating identity chains, and both should be
identified as theme. A theme may of course refer to any element in a previous
clause, regardless of whether this element occurs in the theme or in the rheme.
This is also shown by the various thematic patterns discussed by Dane. Still,
attested examples supporting Matthiessens and Roses point are not hard to find.
The following examples (910) from the ESPC may serve as illustration:
(9) EO: The Pope himself probably survived only because he isolated himself
from everybody else in his huge palace. I suppose isolation was a very
natural impulse. Everywhere in Europe [Exp-Th/C] people [Exp-Th/Pa] resorted
to it, whether [Txt-Th/St] they [Exp-Th/Pa] were noblemen or priests or
intellectuals or ordinary peasants. (ABR1)
The topical theme of the subclause (they) has the same referent as the second
theme of the main clause (people), and following Matthiessen the latter is part of
a complex theme, whereas Halliday has it as part of the rheme. Example (10)
starts with a multiple theme consisting of one interpersonal and one experiential
component. Here we find the first mention of the participant I in this stretch of
text. The next sentence has a complex experiential theme in the first clause (The
next morning and I) and I is taken up as theme both in the subclause and in the
subsequent sentences:
(10) EO: With regret [Int-Th/Mo] I [Exp-Th/Pa] put the diary into my other trouser
pocket. The next morning [Exp-Th/C] I [Exp-Th/Pa] supposed, I [Exp-Th/Pa] would
have to telephone his office with the dire news. I [Exp-Th/Pa] couldn't
forewarn anyone as I [Exp-Th/Pa] didn't know the names, let alone the phone
numbers, of the people who worked for him. I [Exp-Th/Pa] knew only that he
had no partners, as he had said several times that the only way he could
run his business was by himself. (DF1)
As we have seen, the proposals for a change in the linear extension of

theme in the clause seem to be justified. It should however be kept in mind that
the various interpretations of theme we have looked at so far are based on
English. When we turn to other languages it becomes obvious that the SFG
approach sometimes creates problems. This is reflected in the different
approaches to theme presented in monolingual and contrastive studies.
3. Theme in other languages
Again, a safe starting point seems to be to assume that in all languages the clause
has some kind of text-related organisation. The concept of theme is thought of as
a language universal, meaning that there is always one unit expressing what the
clause is concerned with (or is about), and one unit, the rheme, saying
something about the other unit. The realisation of the theme, however, is
language-specific: in English it is realised by initial position, whereas in Japanese
for example, it is expressed by the postposition particle wa (Halliday 1994: 36f.;
see also Rose 2001). Basically then, theme can be viewed from at least two
different angles: from its functional definition and from its realisation.
In their account of theme in Danish, Andersen et al. (2001) find that both
the aboutness aspect and the position aspect apply to Danish in the same way as
they apply to English. In other words, theme represents the point of departure of
the clause as message and all theme types occur in clause-initial position.
However, the Danish system of theme differs radically from the English system
in at least one respect: no distinction is made between topical and interpersonal
theme since it is found that a theme may consist of interpersonal information
only. Consider the following examples taken from Andersen et al. (2001: 175f.):
(11) (a) Han [Exp-Th/Pa] kommer mske.

Lit: He comes maybe. Maybe he is coming.
(b) Mske [Int-Th/Mo] kommer han.
Lit: Maybe comes he. Maybe he is coming.
(c) Kommer [Exp-Th/Pr] han?
Lit: Comes he? Is he coming?
(d) Vil [Int-Th/Fi] han komme?
Lit: Will he come? Is he coming?
Being experiential in meaning, the themes in examples (11a) and (11c) are
analogous with English themes. The difference between (11a) and (11b) is that
the latter has a fronted modal adjunct which is interpersonal in meaning, and in
contrast to any English model, this element can and does make up the whole
theme. A further contrast here is that the subject is placed in postverbal position
in accordance with the V2 constraint. Example (11d) is another instance in which
the theme, here the finite operator, is primarily interpersonal and forms the
entire theme (Andersen et al. 2001: 177). A multiple theme in Danish may
encompass textual items followed by an interpersonal or experiential item.
Andersen et al. follow the initial position criterion and describe how theme is
realised in Danish in different clause types, but do not further discuss the
functional definition.
Steiner and Ramm (1995) offer an account of theme in German, also a V2
language, in which they establish a close connection between theme and the
traditional notion of Vorfeld in German grammar, and as a consequence there is
no stipulation that there is always an ideational element in the theme (1995: 62).
They find that a simple theme may consist of a constituent from either the textual,
the interpersonal, or the experiential metafunction. The theme in (12) can be
either textual (trotzdem), or interpersonal (vielleicht) (1995: 81):
(12) Trotzdem [Txt-Th/Cj]/vielleicht [Int-Th/Mo] haben wir eine grosse Aufgabe.

Lit: Nevertheless/possibly have we a big task. Nevertheless/possibly we
have a big task.
However, it is doubtful whether we can say that textual and interpersonal items
such as trotzdem and vielleicht, and the interpersonal mske in Danish, express
what the clause is about, or that with which the clause is concerned in
Hallidays (1994: 37) wording. Rather, they only serve an orienting function
(Gmez-Gonzlez 1998: 83, Mauranen 1993).
4. Theme from a contrastive perspective
There is no doubt that a parallel corpus may provide data for modelling a way of
analysing theme-rheme structures contrastively. The data obtained often reveal
both the strengths and the weaknesses of the model one is using. Since thematic
structure is clearly discourse-related, it is crucial that the model is tested on
corpus texts. If our model is constructed and tested on intuition or a theoretical

basis only, we run the risk of discovering that it cannot account for a number of
phenomena that occur in natural language.
In my own case, the starting point was Hallidays model, which I applied
to Swedish in order to find out whether it could be used for a contrastive analysis
of the passive. However, the V2 requirement in Swedish gives rise to a different
distribution of elements in cases with a fronted non-subject, and it was not clear
how this could best be dealt with:
(13) (a) EO: Surely [Int-Th/Mo] I [Exp-Th/Pa] 'd been freed from those painful
memories long ago. (ABR1)
(b) ST: Visst [Int-Th/Mo] hade jag [Exp-Th/Pa] fr lnge sedan blivit befriad frn
de dr plgsamma minnena.
Lit: Surely had I for long ago become freed from those painful
memories.
Example (13b) shows that the second thematic element of the English text in
(13a) has been postponed to post-auxiliary position. The question is then: where
does the theme end and the rheme begin? As we have seen, Andersen et al.
(2001), as well as Steiner and Ramm (1995), interpret only the interpersonal
modal adjunct as theme in cases like (13b). In many other situations English and
Swedish behave in similar ways, but still the English model is not ideal for an
English-Swedish contrastive analysis. Clearly, a model developed for one
language is not necessarily applicable to another one. A number of researchers
have in fact pointed to the difficulties of finding models that can be used for
contrastive analyses and in the remainder of this section we will look at a few
corpus-based alternative solutions that modify the English definition of theme.
Mauranen (1999), who has investigated English and Finnish on the basis
of a parallel corpus, suggests a model consisting of an orienting theme realised
by fronted material, e.g. connectors and adverbials, and a topical theme realised
by nominal groups (Finnish) and a subject (English) (Mauranen 1999: 72):
(14) (a) In our culture there is no such moment.

(b) Omassa kulttuurissamme tllaista hetke ei ole.
Lit: own in-our-culture this-kind moment not exists.
In this model, the cut-off point between the theme and the rheme is placed before
the verb, and the rheme hence contains the verb plus optional constituents.
Despite the fact that English and Finnish are typologically different in many
ways, a cross-linguistic comparison of thematic structure is possible (see also
Mauranen 1993).
English and Norwegian (and Swedish) are more closely related than
English and Finnish. Nevertheless, Hasselgrd (1998, 2000) observes difficulties
in applying the SFG model of theme for comparing English and Norwegian, and
has used different definitions of theme. The crucial point is again the V2
constraint requiring the finite verb to occur in second position. The basic
definition in Hasselgrd (1998) includes in the theme the initial part of the
sentence up to and including the first experiential constituent. However, since the
finite verb is by default the second constituent, each time a non-subject occurs in
initial position a choice is made to regard this finite verb as a structural theme [a
subtype of textual theme], so that in cases where the fronted non-subject is a
conjunct or a disjunct adverbial, the theme will include the first experiential
element after the finite verb (1998: 148). This is seen as analogous with the
thematic structure of polar interrogatives in Halliday which have a two-part (i.e. a
multiple) theme consisting of the finite verb followed by the subject (Halliday
1994: 46): Is anybody at home? and Can you find me an acre of land? However,
an objection can be raised against this identification of theme, since it may result
in clauses consisting of only a theme and no rheme, as in (15):
(15) (a) SO: Frmodligen [Int-Th/Mo] gr [Txt-Th/Str] vi [Exp-Th/Pa] under [Txt-Th/Str].

(BL1)
Lit: Probably go we under.
(b) ET: Must expect we will go under.
The process in (15a) consists of the phrasal verb g under go under which is to
be treated as a lexical unit, and the theme hence extends over the whole clause.
An alternative approach is to disregard word order differences between the
languages and adhere to the strict Hallidayan definition taking the first
experiential element as the topical theme (Hasselgrd 2000). The data, taken from
the English-Norwegian Parallel Corpus, show clearly the differences in the
structure of themes that this approach results in. Consider (16) and (17) from
Hasselgrd (2000: 15):
(16) (a) Of course [Int-Th/Mo] I [Exp-Th/Pa] would return.

(b) Selvflgelig [Int-Th/Mo] skulle [Int-Th/Fi] jag [Exp-Th/Pa] vende tilbake.
Lit: Of course would I return.
(17) (a) But [Txt-Th/Str] first [Txt-Th/Cj] I [Exp-Th/Pa] needed this brief withdrawal.
(b) M e n [Txt-Th/Str] f r s t [Txt-Th/Cj] trengte [Exp-Th/Pr] jeg denne kortvarige
ensomheten.
Lit: But first needed I this brief withdrawal.
The result is a higher frequency of processes (finite/predicator) as

experiential/topical theme in Norwegian than in English as a consequence of the
V2 constraint. English, on the other hand, more often has a participant subject in
the first experiential slot. For practical purposes, this model of theme might be
very useful, since it needs no modifications. The analyst only has to keep track of
the changes that occur within multiple themes across the languages.
What may be considered a disadvantage of this approach is connected with
the relation between information structure and thematic structure. We may
assume that the subject typically conveys Given information and the predicator
typically New information, and that the unmarked order of these components is
Given before New. Moreover, in the unmarked case, a speaker will choose the
Theme from within what is Given and locate the focus, the climax of the New,
somewhere within the rheme (Halliday 1994: 299). Having a process/verb
typically conveying New information in thematic position is therefore counter-
intuitive. New information in the theme does indeed occur (cf. Fries 1983,
1995b), but is seen as a marked alternative in English. On the other hand, as
Hasselgrd points out, word order is not open to speaker choice in this case but is
governed by syntactic rules, and the subject-predicator inversion is not likely to
have any major consequences on the overall thematic structure or information
structure of a text.
An approach similar to that of Hasselgrd (2000) is taken by McCabe
(1999) in a contrastive analysis of thematic patterns in English and Spanish
history texts. She counts as thematic everything up to and including the first
experiential element encountered in the clause. As in English, theme in Spanish is
realised by clause-initial position. Because VSO word order is permitted in
Spanish, an unmarked theme can also be realised by a process, creating a pattern
of theme that is different from English.
Teich (2001) draws partly on Steiner and Ramms account of theme in
German (see Section 3) in her corpus-based English-German analysis. The
English theme is analysed according to the original SFG model, but, due to the
V2 constraint in German, the German theme is equated with Vorfeld which
incorporates anything which comes before the finite verb. Hence, only elements
occurring before the finite verb are seen as thematic. Excluding the finite
auxiliary from the theme when it occurs before the first experiential element (as
in (19b)) deviates from the Hallidayan model. These definitions result in themes
as in (18) and (19) (Teich 2001: 202):
(18) (a) But [Txt-Th/St] he [Exp-Th/Pa] couldnt say so.

(b) Aber [Txt-Th/St] er [Exp-Th/Pa] konnte nicht nein sagen
Lit: But he could not no say.
(19) (a) Nonetheless [Txt-Th/Cj] he [Exp-Th/Pa] couldnt say so.

(b) Trotzdem [Txt-Th/Cj] konnte er nicht nein sagen.
Lit: Nonetheless could he not no say.
The results show different theme patterns in English and German. In (19b) a
textual adjunct forms the entire theme, whereas in English (19a) there is a
multiple theme with a textual adjunct and a subject. In contrast to some other
contrastive approaches discussed here, Teich, like McCabe, does not attempt to
find one model that fits both languages, but chooses to use two different
interpretations of theme. Since theme in German is realised differently from
theme in English, two different definitions are used.
Finally, I will suggest yet another approach to cross-linguistic analysis of

thematic structure that seems useful for English-Swedish comparisons. It has
been my aim to find a model of analysis that works reasonably well for both
languages. For this reason, the Danish and the German approaches were
abandoned since they are not suitable for an analysis of English. Further, I find it
important to consider theme in a wider perspective that captures its role in chunks
of discourse larger than the clause or sentence (e.g. Fries 1983, Martin 1992,
Halliday 1994: 61). The model I propose takes Halliday (1994) as a point of
departure, but, following Matthiessen among others, includes in the theme all
preverbal elements in English. Let us consider again Matthissens example (1992:
52):
(20) Autumn [Exp-Th/Pa] passed and [Txt-Th/St] winter [Exp-Th/Pa] [passed], and [Txt-
Th/St] in the spring [Exp-Th/C] the boy [Exp-Th/Pa] went out to play in the wood.
While [Txt-Th/St] h e [Exp-Th/Pa] was playing, two rabbits crept out from the
bracken and peeped at him. [my notations and emphasis added]
This example, as well as (9) and (10), repeated here as (21) and (22), show that
not only clauses or sentences in isolation, but also the context has to be taken into
account when deciding on the theme-rheme transition point.
(21) The Pope himself probably survived only because he isolated himself from
everybody else in his huge palace. I suppose isolation was a very natural
impulse. Everywhere in Europe [Exp-Th/C] people [Exp-Th/Pa] resorted to it,
whether [Txt-Th/St] they [Exp-Th/Pa] were noblemen or priests or intellectuals or
ordinary peasants. (ABR1)
(22) With regret [Int-Th/Mo] I [Exp-Th/Pa] put the diary into my other trouser pocket.
The next morning [Exp-Th/C] I [Exp-Th/Pa] supposed, I [Exp-Th/Pa] would have to
telephone his office with the dire news. I [Exp-Th/Pa] couldn't forewarn
anyone as I [Exp-Th/Pa] didn't know the names, let alone the phone numbers,
of the people who worked for him. I [Exp-Th/Pa] knew only that he had no
partners, as he had said several times that the only way he could run his
business was by himself. (DF1)
An extended theme which includes all preverbal elements thus allows not only
one but several experiential elements in the theme. So how does this work in
Swedish? Altenberg observes that in translations from English into Swedish, the
components of an English multiple theme have to be split up and spread out
beyond the finite verb due to the V2 constraint in Swedish (1998: 138). The
Swedish translation of (20) reads as follows:
(23) Hsten [Exp-Th/Pa] gick och [Txt-Th/St] vintern [Exp-Th/Pa] [gick], och [Txt-Th/St] p
vren [Exp-Th/C] gick pojken [Exp-Th/Pa] ut fr att leka i skogen. Medan [Txt-
Th/St] han [Exp-Th/Pa] lekte krp tv kaniner fram ur ormbunken och kikade p
honom.
Lit: Autumn passed and winter [passed], and in the-spring went the-boy
out to play in the-wood. While he played crept two rabbits out of the-
bracket and peeped at him.
Comparing the English text in (20) with (23), we can see that the distribution of
themes differs. Consider the long multiple theme in (20), and in the spring the
boy, which is split into two chunks when translated into Swedish. A preliminary
term for this type of theme is split theme (cf. Hasselgrd 2000: 24). A split theme
(in a declarative clause) can be defined as including all elements preceding the
finite verb plus the postverbal subject. Preverbal elements may be any
combination of textual, interpersonal, and experiential elements occurring in this
position. There is always an experiential element in the theme. Examples (24)
(26) illustrate the definition of theme suggested here. First, (24) has subjects in
initial position which are simple unmarked themes:
(24) (a) EO: Neighbourhood boys [Exp-Th/Pa] were called up [] (RF1)

(b) ST: Pojkar frn stadsdelen [Exp-Th/Pa] blev inkallade []
Lit: Boys from the neighbourhood were called-up []
The languages behave in similar ways in such structures. Let us now look at some
multiple themes involving textual, interpersonal, and experiential elements:
(25) (a) EO: Nevertheless [Txt-Th/Cj] he [Exp-Th/Pa] loved her dearly, and [Txt-Th/St]
over the week past [Exp-Th/C] he [Exp-Th/Pa] had come to love her even
more [...]. (RDA1)
(b) ST: Inte desto mindre [Txt-Th/Cj] lskade han [Exp-Th/Pa] henne djupt, och
[Txt-Th/St] under den vecka som gtt [Exp-Th/C] hade han [Exp-Th/Pa] kommit
att lska henne nnu mer [].
Lit: Nevertheless, loved he her dearly, and over the week past had he
come to love her even more [].
(26) (a) SO: "Frankly [Int-Th/Mo], I [Exp-Th/Pa] 'm assuming somebody killed him."
(SG1)
(b) ET: "Uppriktigt sagt [Int-Th/Mo] r jag [Exp-Th/Pa] vertygad om att ngon
ddade honom."
In (25), there are two experiential themes in both English and Swedish. In other
words, all types of theme components, not only textual and interpersonal ones,
may be stacked and they do not necessarily occur in any typical order (Halliday
1994: 53). The English multiple themes in (25a) comprise the elements Textual <
Experiential in the first clause, and Textual < Experiential < Experiential in the
second clause. In Swedish, on the other hand, we have split themes. In the first
clause the theme is made up of the components Textual < non-thematic element <
Experiential, and in the second clause we find the elements Textual <
Experiential < non-thematic element < Experiential. The initial conjunctive
adjunct and time adverbial trigger inversion of the subject and the finite operator,
and the same holds for the modal adjunct in (26a).
Since there may be more than one experiential item in a theme it is not
possible to determine whether one element is more topical than another.
Consequently, the concept topical theme has no function in this approach.
Circumstances and participants acting as theme may simply be referred to as
circumstantial theme and participant theme (Rose 2001: 127).
The model proposed here developed out of a need. There simply did not
seem to be a well-functioning model to compare theme in English and Swedish.
The main advantage of this approach is that it is operational and suits the
purposes of my study. A second advantage is that there is an underlying discourse
basis that is larger than the clause - the role of an item in the surrounding context
was taken into consideration in determining the transition point between theme
and rheme. We cannot neglect the fact that themes contribute to the method of
development of a text, which is why we need to take a global view of the notion
of theme (cf. Baker 1992: 129).
5. Concluding remarks
The main purpose of this paper has been to discuss contrastive theme analysis on
the basis of parallel corpora. It has been shown that the theme-rheme definition in
SFG may serve as the basis for an analysis of a number of languages both
monolingually and contrastively, but it is also clear that the original approach has
to be modified when used to analyse languages other than English.
It has been claimed that a parallel corpus can be used for trying out a
suitable method for analysing thematic structure cross-linguistically. A parallel
corpus reveals the ways in which system differences between languages create
differences in the realisation of thematic structure. A parallel corpus is then a
valuable tool for testing existing models and for constructing new ones.
Notes
1. This work was carried out with funding from the Bank of Sweden
Tercentenary Foundation. I am grateful to Karin Aijmer and Bengt Altenberg
for their valuable comments on earlier drafts of this paper, and to Joe Trotta
for proofreading. Any remaining flaws are mine.
2. There is a great deal of terminological confusion concerning the labels
parallel corpus, comparable corpus, and translation corpus which are
used for different types of monolingual, bilingual, and multilingual corpora
(cf. Baker 1993: 248, 1995: 228ff., Johansson 1998: 4f., McEnery and Wilson
1995: 57f.). In this paper, the expression parallel corpus is thought of as an
umbrella term covering both translation corpus (original texts and their
translations), and comparable corpus (original texts in different languages or
original and translated texts in the same language). Such texts are comparable
in terms of for example genre and domain. A majority of the examples in this
paper were taken from the English-Swedish Parallel Corpus (ESPC) which
consists of original texts in English and Swedish together with their respective
translation into the other language. The corpus is described in detail at
http://www.englund.lu.se/research/corpus/index.phtml.
3. The code in parenthesis shows that the example was taken from the ESPC,
and refers to the text from which the example was extracted (see Corpus
texts). EO refers to English original text, ST to Swedish translated text.
Further on, SO refers to Swedish original text and ET to English translated
text. A word-for-word translation of the Swedish sentences is provided.
4. There are a number of approaches to the concepts that SFG calls theme and
rheme. The reader is referred to e.g. Gomz-Gonzlez (2001) who provides an
extensive overview.
5. See the Appendix for an explanation of the abbreviated theme types within
square brackets. Themes are marked in bold type.
6. Within the concept of communicative dynamism in the Prague School
theory of Functional Sentence Perspective (FSP) a division is made between
the theme and the non-theme in which the non-theme consists of the
transition and the rheme. The transition consists of elements performing
the linking function. The TMEs [the temporal and modal exponents of the
finite verb] are the transitional element [sic] par exellence: They carry the
lowest degree of CD [communicative dynamism] within the non-theme and
are the transition proper. The highest degree of CD, on the other hand, is
carried by the rheme proper (Firbas 1986: 54, italics in the original).
7. As pointed out by Rose (2001: 126), Halliday (1994: 66) does refer to a
participant following a circumstantial theme as a displaced theme and
explains that it is a topical element which would be unmarked Theme (in the
ensuing clause) if the existing marked topical Theme was reworded as a
dependent clause.
8. In this and other examples taken from sources other than the ESPC I have
sometimes removed any original notation and added my own.
Corpus texts
Brink, A. (1984), The wall of the plague. London: Fontana Paperbacks. (ABR1)
Davies, R. (1985), Whats bred in the bone. Harmondsworth: Penguin Books.
(RDA1)
Ferguson, R. (1991), Henry Miller: a life. London: Hutchinson. (RF1)
Francis, D. (1989), Straight. London: Michael Joseph. (DF1)
Grafton, S. (1990), D is for deadbeat. London: Pan Books. (SG1)
Lacey, R. (1986), Ford. The man and the machine. Boston: Little, Brown & Co.
(RL1)
Larsson, B. (1992), Den keltiska ringen. Stockholm: Albert Bonniers. (BL1)
Mayle, P. (1989), A year in Provence. London: Hamish Hamilton. (PM1)
References
Altenberg, B. (1998), Connectors and sentence openings in English and

Swedish, in: S. Johansson and S. Oksefjell (eds), Corpora and cross-
linguistic research. Theory, method, and case studies. Amsterdam &
Atlanta, GA: Rodopi. 115-143.
Andersen, T., U. Helm Petersen and F. Smedegaard (2001), Sproget som
ressource. Dansk systemisk funktionel lingvistik i teori og praksis. Odense:
Odense Universitetforlag.
Baker, M. (1992), In other words. A coursebook on translation. London & New
York: Routledge.
Baker, M. (1995), Corpora in translation studies. An overview and some
suggestions for future research, Target 7: 223-243.
Dane F. (1974), Functional sentence perspective and the organization of the
text, in: F. Dane_ (ed.), Papers on functional sentence perspective. The
Hague: Mouton. 106-128.
Downing, A. (1991), An alternative approach to theme: A systemic-functional
perspective. Word 40: 119-43.
Firbas, J. (1986), On the dynamics of written communication in the light of the
theory of Functional Sentence Perspective, in: C.R. Cooper and S.
Greenbaum (eds), Studying writing: Linguistic approaches. Beverly Hills,
Ca: Sage Publications. 40-71.
Francis, G. (1989), Thematic selection and distribution in written discourse.
Word 40: 201-221.
Fries, P.H. (1983), On the status of theme in English: Arguments from
discourse, in: J.S. Petfi and E. Szer (eds), Micro and macro connexity
of texts. Hamburg: Helmut Buske Verlag.
Fries, P.H. (1995a), A personal view of theme, in: M. Ghadessy (ed), Thematic
development in English texts. London & New York: Pinter. 1-19.
Fries, P.H. (1995b), Patterns of information in initial position in English, in:

P.H. Fries and M. Gregory (eds), Discourse in society: Systemic functional
perspectives. Meaning and choice in language: Studies for Michael
Halliday. Norwood, N.J.: Ablex. 47-66.
Ghadessy, M. and Y. Gao (2001), Small corpora and translation. Comparing
thematic organization in two languages, in: M. Ghadessy, A. Henry and
R.L. Roseberry (eds), Small corpus studies and ELT: Theory and practice.
Amsterdam & Philadelphia: John Benjamins. 335-359.
Gmez-Gonzlez, M.. (1998), A corpus-based analysis of extended multiple
themes in PresE, International Journal of Corpus Linguistics 3: 81-113.
Gmez-Gonzlez, M.. (2001), The theme-topic interface: Evidence from
English. Amsterdam & Philadelphia: John Benjamins.
Granger, S. (1983), The be + past participle construction in spoken English with
special emphasis on the passive. Amsterdam: North-Holland.
Halliday, M.A.K. (1967), Notes on transitivity and theme in English. Part 2,
Journal of Linguistics 3: 199-244.
Halliday, M.A.K. (1994), An introduction to functional grammar. 2nd ed. London:
Edward Arnold.
Hasselgrd, H. (1998), Thematic structure in translation between English and
Norwegian, in: S. Johansson and S. Oksefjell (eds), Corpora and cross-
linguistic research. Theory, method, and case studies. Amsterdam &
Atlanta, GA: Rodopi. 145-167.
Hasselgrd, H. (2000), English multiple themes in translation, in: A. Klinge
(ed.), Copenhagen studies in language: Contrastive studies in syntax.
Copenhagen: Samfundslitteratur. 11-38.
Hatim, B. and I. Mason (1990), Discourse and the translator. London & New
York: Longman.
Hatim, B. and I. Mason (1997), The translator as communicator. London & New
York: Routledge.
House, J. (1997), Translation quality assessment. A model revisited. Tbingen:
Gunter Narr.
Johansson, S. (1998), On the role of corpora in cross-linguistic research, in: S.
Johansson and S. Oksefjell (eds), Corpora and cross-linguistic research.
Theory, method, and case studies. Amsterdam & Atlanta, GA: Rodopi. 3-
24.
Matthiessen, C. (1992), Interpreting the textual metafunction. In M. Davies and
L. Ravelli (eds), Advances in systemic linguistics: Recent theory and
practice. London: Pinter. 37-81.
Martin, J.R. (1992), Theme, method of development and existentiality: the price
of reply. At http://homepage.mac.com/asfla/articles.htm. Also in
Occasional Papers in Systemic Linguistics 6: 147-184.
Mauranen, A. (1993), Cultural differences in academic rhetoric. A textlinguistic
study. Frankfurt am Main: Peter Lang.
Mauranen, A. (1999), What sort of theme is there, Languages in Contrast 2: 57-
87.
McCabe, A.M. (1999), Theme and thematic patterns in Spanish and English
history texts, vol. I. PhD thesis, Aston University.
McEnery, T. and A. Wilson (1996), Corpus linguistics. Edinburgh: Edinburgh
University Press.
Pry-Woodley, M.-P. (1991), French and English passives in the construction of
text, Journal of French Language Studies 1: 55-70.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985), A comprehensive
Rose, D. (2001), Some variations in theme across languages, Functions of
language 8: 109-145.
Steiner, E. (2001), Intralingual and interlingual versions of a text how specific
is the notion of translation, in: E. Steiner and C. Yallop (eds), Exploring
translation and multilingual text production: Beyond content. Berlin &
New York: Mouton de Gruyter. 161-190.
Steiner, E. and W. Ramm (1995), On Theme as a grammatical notion for
German, Functions of Language 2: 57-93.
Svartvik, J. (1966), On voice in the English verb. The Hague & Paris: Mouton.
Svensson, M. (2000), Sentence openings and textual progression in English and
Swedish, in: C. Mair and M. Hundt (eds), Corpus linguistics and
linguistic theory. Papers from the Twentieth International Conference on
English Language Research on Computerized Corpora (ICAME 20),
Freiburg im Bresnau 1999. Amsterdam & Atlanta, GA: Rodopi. 355-370.
Teich, E. (2001), Towards a model for the description of cross-linguistic
divergence and commonality in translation, in: E. Steiner and C. Yallop
(eds), Exploring translation and multilingual text production: Beyond
content. Berlin & New York: Mouton de Gruyter. 191-227.
Teleman, U., S. Hellberg and E. Andersson (1999), Svenska Akademiens
grammatik, 1-4. Stockholm: Norstedts.
Ventola, E. (1995), Thematic development and translation, in: M. Ghadessy
(ed.), Thematic development in English texts. London & New York:
Pinter. 85-104.
Appendix: Abbreviations
Exp-Th/Pa experiential theme/participant

Exp-Th/C experiential theme/circumstance
Exp-Th/Pr experiential theme/process
Int-Th/Mo interpersonal theme/modal

Int-Th/Fi interpersonal theme/finite
Txt-Th/St textual theme/structural

Txt-Th/Ct textual theme/continuative
Txt-Th/Cj textual theme/conjunctive
Welcoming children, pets and guests: towards functional
equivalence in the languages of Agriturismo and Farmhouse
Holidays1
Elena Tognini Bonelli, Universit degli Studi di Siena
Elena Manca, Universit degli Studi di Lecce
Abstract
This paper takes a contextual and functional view of translation equivalence; it

aims to define a `wider notion of equivalence built on a network of collocates
rather than on single items. Thus, given an initial node N in L1, the
identification of a translation equivalent in L2 will proceed through several
stages of contextualisation relating each item to its environment and identifying
its collocational profile both in L1 and in L2. Furthermore, it will be shown that
systematic enlargement of the unit of meaning in terms of patterns of co-
occurrence helps to define a typology of the extra-linguistic features associated
with it.
1. Introduction
This paper aims to interpret the concept of translation equivalence in terms of

linguistic shifts between two different socio-cultural contexts. We start from the
assumption that the process of translation has to be seen primarily as a statement
of meaning and that to translate means (1) to identify a specific function together
with its formal realisations in L1, (2) to compare it with another set (function +
formal realisation), or other sets, in L2 and finally, in the light of the previous
stage, (3) to attempt to encode the given function into a chosen formal realisation
in the target language. Whereas the first two steps can be seen as linguistic and
descriptive it is in fact a matter of comparing formal linguistic features across
languages the third step is strategic, and it involves the input of a translator,
his/her awareness of the extra-linguistic features, such as the ultimate purpose of
the translation, and his/her ability to negotiate a chosen meaning across languages
(Tognini Bonelli 1996a).
This paper will only consider the first two steps in translation and will
concentrate on identifying a chosen function by describing its formal realisations
in English, on the one hand, and comparing it with the way that particular
meaning is encoded in Italian, on the other. We shall consider in what way the
formal realisations of that meaning may differ or whether they are indeed
comparable across the two languages. We shall try to demonstrate that these
372 Elena Tognini Bonelli and Elena Manca
differences and/or correspondences can reveal cultural and typological facets and
that these have to be reckoned with in the process of translation.
2. The corpora
Our data is derived from a set of two comparable corpora (Teubert 1996) in
English and Italian in the fields of Agriturismo in Italy and Farmhouse
Holidays in the U.K. Perhaps the easiest way to characterise the common
denominator between these two fields is to say that they offer their customers a
relaxing holiday in the countryside and with it a number of country activities
related to life on the farm. So, guests are often invited to engage in walking,
hiking, riding, fishing, birdwatching, swimming, etc. and are encouraged to enjoy
the proximity and contact with farm animals. One can expect a comparable
typology in terms of the offer and in the way this offer is put across, although, of
course, allowances have to be made for differences, due to geographical location,
national habits and preferences and, in general, for the specific requirements of
the two different markets.2 In spite of these differences, we assume that certain
more general concepts will have a fairly straightforward equivalent in terms of
their linguistic realisations.
We will henceforth refer to our two corpora as the Agriturist corpus in
Italian and the Farmhols corpus in English. We have assembled these corpora
from web pages and the Agriturist corpus now provisionally contains 115,000
words while the Farmhols one stands at 203,000 words. They can be considered
comparable in that the language they represent has a similar function and aims to
sell a similar product.
3. Translating context and function: methodology and assumptions
As a first step we consulted the frequency list for the Farmhols corpus and
identified the word welcome as a particularly frequent one, as Table 1 shows. A
series of interviews with the owners of different www pages for farmhouse
holidays confirmed the centrality of the word which repeatedly appeared in
definitions such as this one:
A Farmhouse holiday can mean different things. It depends on the

accommodation etc. The one thing they should all have in common
is a warm and friendly welcome and the peace and beauty of the
countryside. () Other holidays are like what we provide, self-
catering, with the farmer welcoming you to wander on his farm if
you wish and also to buy good local food. () People coming here
always comment on the peace and beauty and the warmth of the
welcome. (J. Rider, 2000, personal communication)
Welcoming children, pets and guests 373
Having chosen the word welcome, we faced the first difficulty in

identifying a straight-forward equivalence pair. We posited as a prima facie
translation equivalent (TE) in Italian the word benvenuto, which exists both as an
adjective and as an exclamation, but this word had no comparable frequency in
the Agriturist corpus, as Table 1 shows.3
Table 1. Frequencies of welcome and benvenuto
FARMHOLS AGRITURIST
CORPUS CORPUS
Welcome Benvenuto/a/i/e
324 instances 4 instances
The difference in frequency was so marked that we had to ask ourselves why the
concept of welcoming people which appears to be equally central in both the
fields of Agriturismo and Farmhouse Holidays could be realized so differently in
its formal realizations. In spite of our initial assumptions we had to face up to the
problem of non-equivalence.
In this context non-equivalence goes beyond the absence of a match
between L1 and L2. Sometimes when we compare languages we recognise non-
equivalence when there is no match to a certain word: take for instance the
English word hangover which needs to be paraphrased in Italian because there is
no direct equivalent. Sometimes a justification for this phenomenon is possible in
cultural terms. In our case the mismatch occurs when a word like welcome, which
is prominent in terms of frequency in L1, appears only very rarely in L2. The
problem we have to consider, then, is how to identify an equivalent function
given that this may be realised in different ways at the formal level. The other
possibility is that, of course, for some reason, whether cultural or ideological, the
word might not have a direct equivalent.
In order to ascertain whether indeed the concept of welcoming is so
dramatically absent in the Italian of Agriturismo or whether it is simply expressed
differently, we adopted a different approach and decided to address the issue of
translating a word starting from the context in which it is most frequently
embedded. We will explain in the sections that follow our assumptions and our
methodology.
The view we take is that equivalence should not, and often cannot, be
established at simple word level; when indeed a certain type of equivalence
exists, this should be established at the wider level of functionally complete units
of meaning (Tognini Bonelli 1996a/b, 2001). Our aim here is to show how a
systematic contextual and co-textual analysis of the data can help the translator to
identify this wider notion of equivalence built on a network of collocates rather
than on single items. This enlargement of the issue is specially necessary when
we face the problem of non-equivalence at word level outlined above. However,

we also recommend it as a more generally applicable method because it allows
the analyst a privileged position for observing and reconciling the contextual
patterning and the overall function of the translation unit.
Our method brings us, therefore, to question the traditional distinction
between item and environment, in favour of a model of meaning and translating
that takes as central the phenomenon of co-selection and sees the context as an
integral part of the text. Co-selection has been widely discussed in relation to
meaning and lexicography (see Sinclair 1987, 1991 and later) and such statements
as the following ones by Tognini Bonelli (2001: 128) can now be taken more or
less for granted:
That many textual meanings arise from the co-selection of more than one
word.
That habitual co-selection tends to specialise the function of one or more
of the words concerned.
That co-selection is largely covert and subliminal, which increases its
importance in communication.
The importance of contextual information for identifying meanings across

languages is elaborated by Sinclair and his associates in a collection of papers on
corpus-to-corpus translation equivalence (Sinclair et al. 1996). In his preface to
this work, Sinclair states that in many cases, when there is no TE for a chosen
word, translation can only be achieved by first of all combining the word with
one or more others; the whole phrase will then equate with a word or phrase in
the other language (Sinclair 1996: 175). He proposes:
A system of describing the shared meanings of languages in terms of

the actual verbal contexts in which each instance is found. The
attraction of the description is the way in which each instance is
assumed to be carrying in its immediate environment sufficient
differential information to indicate which of several possible
meanings is the relevant one, and in the case of translation, what is
the appropriate phraseology. (Sinclair 1996: 174)
This paper aims to take this work on co-selection (see also Francis 1993,
Partington 1998) one step further and considers the implications of its centrality
in translation with particular attention to methodology.
In the process of establishing equivalence, we will also observe how a
systematic enlargement of the unit of meaning in terms of patterns of co-
occurrence can help to define a typology of the extra-linguistic features
associated with it: the type of product offered and also the specific ways in which
it is offered. We will examine differences which are not only due to the different
geographical provenance of the text but also to cultural diversity.
4. Procedure
Our initial word in L1 is welcome which, for lack of space, will be discussed
here only in its adjectival function. The choice of this word is supported by the
fact that the word welcome is very prominent in the Farmhols Corpus. A simple
word-frequency list reveals immediately that welcome is almost top of the list of
lexical words. However, as we mentioned, there is no direct equivalent to it in the
Agriturist corpus this in spite of the existence of a prima facie equivalent such
as benvenuto. Tables 2 and 3 illustrate the frequencies of welcome in the two
corpora.
Table 2. Frequencies of welcome in the Farmhols Corpus
WELCOME (324 instances)
Adjective Exclamation Noun Verb

147 104 57 15
(46%) (32%) (17%) (5%)
Table 3. Frequencies of benvenuto in the Agriturist Corpus
BENVENUTO/A/I/E (4 instances)
Adjective (benvenuti) Exclamation (benvenuti)

1 3
The mismatch between the frequencies is very clear and, because of this, we shall
try to identify TEs in L2 going through several stages of contextualisation and
relating each item to its environment. We shall identify the collocational profile
of each item both in L1 and in L2 and establish the possible correspondences
between larger units. So, at first, by analysing the concordance to the initial node
in the Farmhols corpus we shall locate the nodes most frequent collocates. For
each of the collocates we shall posit a prima facie translation equivalent (TE1,
TE2, TE3, etc.): each of these will be investigated in its own right as a node in the
Agriturist Corpus and it is within their collocational range that we shall try to
locate an equivalent to welcome. Our methodological steps are outlined in
Figure 1.
Collocate1/L1 TE1/L2
(children) (bambini)
TE Node/L Collocate2/L1 TE2/L2

(welcome) (pets/dogs) (animali)
Collocate3/L1 TE3/L2
(visitors (ospiti)
/guests)
Figure 1. Methodological steps for identifying translation equivalence
Starting therefore with the most prominent English collocates of welcome

children, pets/dogs and visitors/guests as the node, we shall consider their
prima-facie TEs in Italian. This will be done with the help of dictionaries or
basing oneself on transators experience and intuition. However it is important to
understand that the evidence from the corpus can be invaluable even at this stage:
a frequency list of the Agriturist corpus will show immediately that, in terms of
usage, the equivalent for dogs (given the absence of an equivalent concept to pet
in Italian) is not cani but the superordinate animali.
The next step will see us turning to the Italian TEs of these words and
repeat the same procedure. We shall therefore consider what type of collocational
patterning is associated with each of the terms bambini, animali and ospiti. Our
aim here is to locate, within their collocational range, the patterns belonging to, or
denoting, the same semantic field as welcome or, on the other hand, note their
absence.
4.1 Children and bambini
The first step in contextualisation will consider the word welcome as a unit taken
together with its most frequent collocate, children. A quick examination of the
concordance shows quite clearly two points (a few citations are reported in Table
4). First, the close association between children and pets or dogs; we have not
enough data to discuss this in detail, but it certainly should be noted because it
seems rather unusual to find them in the same category. Second, that when
children do not share this association with pets, there is always some kind of
restriction or limitation to their presence in the farm, whether it be some age
restriction (over 10 .., over 5 ..) or the fact that no discount is available, for
example.
Table 4. children + welcome

number of units used Children over 10 welcome Ample off road parking
available. Children over 5 welcome, baby sitting available
single occupancy. Children are welcome but we cannot offer discounts
kind. Pets and children are welcome. Children will find the
residential caravans. Children and pets welcome. We are members of
with fireplace. Children and pets are welcome - Baby sitting -
heating. Dogs and children welcome. Costwolds Main page
The specific age restriction is confirmed by other citations in the same corpus
where the noun children is not combined with the adjectival use of welcome, as
shown in Table 5.
Table 5. Children + age limitations

Sorry no Pets No smokers Children over 16 welcomed
twin bedroom for sensible children over the age of seven.
to leave them in the car. Children over 7 accepted. Most
number of units used Children over 10 welcomed
Dining Room Non-smoking. No children under the age of 8.
We should remember that this type of holiday on the farm in the U.K. is often
centred around domestic animals and their young and part of the fun offered is to
observe them in their own farm environment. The type of conditioned welcome
that we see in the instances above, rather than qualifying a warm and friendly
reception, seems to function as damage limitation when a face-threatening
situation, such as a restriction on the offer, arises. It also reflects well the
situational and cultural context in Britain where the children are not always
welcomed even in places such as farmhouses, where the presence of farm animals
and pets would seem to be an incentive for their presence.
In three instances we find children associated with discount offers (see
Table 6), but these are fairly rare (2.9%), if compared as we shall see in Table 7
with the Agriturist corpus.
Table 6. Children + discounts (2.9%)

there are always good reductions for children. Leave the highways and
We have reduced rates for children sharing with their parents
per night with discounts for children. In addition, we also
Let us now proceed to the second step in contextualisation, that is examining the
patterns of co-selection associated with our prima facie TE of children, viz.
bambini, in the Agriturist Corpus. Table 7 gives some examples.
Table 7. Bambini + discounts (25%)

-RIDUZIONI: Bambini 0-2 anni: -70%; Bambini 2-12 anni: 30%
SCONTI E AGEVOLAZIONI Bambini fino a 3 anni gratis; Sconto ed
agevolazioni: Gratis bambini fino a 2 anni; Sconto 30% pensione
con tariffe speciali per bambini fino a 10 anni
Supplementi e riduzioni: bambini 2/10 anni sconto 35% -
The patterning shown in the citations in Table 7 is very typical. Bambini are
never associated with expressions of welcome or denoting an explicit permission
to stay in the Agriturismo. However, they regularly seem to be connected with the
semantic field of discounts identified by words such as riduzione, sconti e
agevolazioni, gratis and gratuito, which, if only implicitly pointing to the
welcome, they certainly show it in tangible and concrete terms. In Table 6 we
reported the only three instances of this type in the Farmhols Corpus. In the
Agriturist corpus this is the most typical pattern associated with bambini.
As in the Farmhols Corpus bambini are associated with some age
limitations (fino a 3 anni .., da 2 a 6 anni .., 2/10 anni..), but these only refer to
the discounts and the reductions offered and not to the actual acceptance of
bambini in the Agriturismo.
To sum up this section, we can say that the contextual analysis of the data
in the two languages has shown no match for the word welcome in the context of
children. This is true not only in terms of a similar grammatical pattern - we had
started from the lack of correspondence welcome/benvenuto - but also with other
lexical or grammatical patterns that might have realised a similar function. Can
we then ask ourselves whether this absence of welcome in the Italian of
Agriturismo means that children are not really welcomed in Italian Agriturismo
while they are in British farmhouses? We maintain that the analysis should
always be extended to the context and the overall function of the unit. So,
considering the data we have analysed, perhaps the best answer would be to
remind ourselves again of a citation from the Farmhols Corpus where the
welcome cannot certainly be taken as encouragement,
Sorry no pets No smokers Children over 16 welcomed
and to conclude that the English welcome, when applied to children, may not
necessarily convey the warmth and the friendliness that we associate with it; a
qualified welcome is perhaps to be interpreted as discouragement to those
excluded by the qualification. On the other hand, the fact that no explicit welcome
is stated in relation to bambini should also be interpreted in the context of the
regular statements about discounts and reductions made available to children, and
these should be taken as encouragement for the presence of children in the Italian
Agriturismo. It seems to be taken for granted that children are welcome.
4.2 Pets, dogs and animali
Pets and dogs are the recipients of the welcome in 20% of the instances in the
Farmhols corpus. In half of these occurrences, however, this welcome is
accompanied by a limitation on the offer, as was the case with children. As one
can see in Table 8 below, this conditioned welcome is realised here by a variety of
expressions ranging from provided, providing and but to by arrangement and on
payment of. We also find some adjectives such as well-controlled and w e l l

behaved that also signal a limitation on the welcome.
Table 8: Pets and dogs + welcome

and bread oven. Pets are welcome by prior arrangement.
baby bedding is supplied. Pets are welcome but must be kept under control
mountain-bike routes. Your pets are welcome provided they are under control
breakfast. Well behaved pets are welcome in the house or kenneling is
farm-out buildings. Dogs are welcome provided they are kept strictly
year round. Well behaved pets are welcome and short breaks are available.
tranquil. Well controlled dogs are welcome. Pheasant Cottage; Partridge
high chair can be hired. Dogs are welcome on payment of a small fee
breakfast. Well behaved pets are welcome in the house
These restrictions are perhaps more understandable than the limitations we

observed with children because dogs are always perceived as potential dangers on
British farms where they often tend to harass sheep or cattle.
Let us now consider the prima facie equivalent of pets and dogs in the
Agriturist corpus. The word pet/s, with its implication of personal closeness and
affection, has no correspondence in Italian and a quick scan at the frequency list
from the Agriturist corpus identifies the more general term animali as a potential
equivalent. The term animali occurs 65 times in the corpus of Italian, but only 23
instances refer to pets rather than to farm animals. Let us consider some citations
in Table 9.
Table 9. Animali + ammettere/accettare

Accettano. Animali: Ammessi i cani
della prenotazione ANIMALI: ammessi previo accordo
(solo sala ristoro), ammessi animali, angolo lettura, telefono e fax
sconto 15%. Ammessi animali di piccola taglia.
prezzo ridotto. Sono ammessi animali di piccola taglia.
una scuola di parapendio. Animali non ammessi
normalmente in dotazione. Gli animali non son ammessi.
Sono ammessi animali? Si, gli animali sono ammessi con pagamento di
consumo di gas. Non si accettano animali. Tutta la biancheria
Siamo aperti tutto l'anno, animali si accettano previo accordo.
Aperto tutto l'anno. Si accettano animali domestici.
In the co-text of this word, we notice immediately two possible

equivalents to the English welcome: the two verbs ammettere admit and
accettare accept in their different inflected forms, always either in the passive,
as sono ammessi, or in the impersonal, as si accettano. It is interesting to note that
limitations to the presence of animali in the Agriturist corpus exist, although they
are perhaps slightly different from the ones we found in the Farmhols corpus.
Here, we notice for instance the size, di piccola taglia of a small size, which
was not mentioned in the English context or the fact that there should be prior
agreement, previo accordo, which seems to be more prominent in the Agriturist
corpus; in the Farmhols corpus the issue seemed to be more that pets should be
well-behaved or kept under control.
From the point of view of the translation equivalence the result is quite
satisfactory because, while we could not find a one-to-one equivalent for welcome
in general, we were able to locate a perfectly good equivalent for the English pair
welcome-pets in the Italian accettare/ammettere-animali. At the level of
functionally complete units of meaning, the pragmatic dimension of the unit is
realised by the expressions of limitation associated with it both in English and in
Italian. This suggests that the use of welcome in this context in English is just a
euphemism for accepted.
4.3 Guests, visitors and ospiti
The patterning associated with welcome in the context of guests and visitors
differs from both the patterning with children and pets; here we consistently find
the structure Vb-BE + welcome + to-inf. as in Our visitors are welcome to
explore the farm. The concordance in Table 10 groups together some citations for
visitors, guests and also the pronoun you which addresses the potential visitor or
guest in the text from the web pages.
We note here that the structure in which welcome is embedded has a
different impact on the meaning: if with children and pets the welcome conveys
the meaning of permission and implies that they are allowed to join in the
farmhouse holiday, subject to certain specific conditions; with visitors and guests
we find a straight invitation to take advantage of all the leisurely activities offered
by the farmhouse.
Table 10. Guests/visitors + welcome to

and Kilburn. Our visitors are welcome to explore the farm to discover
Caebetran Farm. Visitors are welcome to see the cattle and sheep
bottle fed. All visitors are welcome to join in the farming activities
Visitors are welcome to stroll around the farm. We regret
Guests are welcome to bring their own dogs, if they
Guests are welcome to relax in our victorian lounge
and cattle. Guests are welcome to roam the farm with its
close by. Our guests are welcome to fish the 1/4 mile river bank,
and cattle. Guests are welcome to roam the farm with its pretty
and bathroom. Guests are welcome to use the garden and fields for
where you would be most welcome to join in the family, or
guests to relax in or you are welcome to sit in the garden.
you will be welcome to come carol singing
Let us now consider the Italian equivalent of guests and visitors, that is
ospiti. Again, we note the absence of the typical TE of welcome as suggested by
traditional reference books, the fully lexical benvenuto/i. Some examples are
given in Table 11.
Table 11. Ospiti + potere

ampi spazi a disposizione degli amici ospiti che potranno raccogliere la
forno a legna pu essere utilizzato dagli ospiti per attivit di svago
pu essere raccolta personalmente dagli ospiti, che possono anche assistere
di produzione biologica, ove gli ospiti possono raccogliere prodotti
Nella fattoria Poggio Oliveto gli ospiti possono visitare le colture
in bicicletta. Esternamente gli ospiti possono godere della piscina,
Vi la possibilit per gli ospiti di partecipare alle attivit
e nel mese di dicembre i nostri ospiti possono visitare il frantoio
er vacanze tranquille e rilassanti. Gli ospiti potranno godere di una piscina
ediate vicinanze di Poggio Paradiso gli ospiti potranno fruire di attrezzature
In the concordance in Table 11 it is pretty clear that the equivalent of the English
structure Vb-BE + welcome + to-inf is conveyed in Italian by the modal potere
to be able to in its inflected forms. Here we have the example of a fully lexical
word such as welcome in L1 that has primarily a grammatical realisation in L2.
The phrase vi la possibilit di (there is the possibility to) carries the same
modal meaning but in a lexicalised form. In spite of this lexical status it belongs
under the same umbrella of modality that in traditional linguistics is usually
understood as Grammar. This is a potential trap for translators because the
lexical choice implicitly carries more weight and as such may become a more
visible, and therefore preferred, option when translating. We can certainly say
that it is the purely lexical meaning that tends to be the focus of traditional
reference books, so welcome is translated as benvenuto, and no guidance is given
about the likely use of the modal potere. In this case a translation corpus could
help us to identify the favourite choices of translators, to verify for instance if the
grammatical translation of welcome is indeed used and if so, if it is used
appropriately.
The noun ospiti shows a frequent association with another expression, also
related to modality: a disposizione di. Let us consider some examples in Table 12.
Table 12. Ospiti + a disposizione.

Toscano, 2 piscine a disposizione degli ospiti con una stupenda vista su
antico forno a legna a disposizione degli ospiti.
inoltre a disposizione degli ospiti vi sono tre laghetti
privati
ospiti. A disposizione degli ospiti c' anche un grande
barbecue
agriturismo mette a disposizione degli ospiti quattro camere doppie, due
te all'Oppio mette a disposizione degli ospiti tre appartamenti, mentre
A disposizione degli ospiti, ampia piscina aperta
senese. A disposizione degli ospiti ci sono 3 confortevoli
appartamenti
One thing to notice which, for lack of space, is only mentioned in passing here, is
the fact that the phrase a disposizione degli ospiti in the Agriturist corpus is
mainly associated with the type of accommodation offered (eg. quattro camere
doppie four double rooms), while welcome + to-inf. is connected with the
different leisure activities offered by the farmhouse holiday package. This points
to the specificity of the semantic preference within similar units of meaning and
to the fact that collocational restriction is based on semantic criteria. It is certainly
something that should be investigated further, especially in view of the impact it
can have on the translation process at the level of appropriateness.
5. The typology of the offer
The data discussed in the sections above show that while the single word
denoting welcome cannot be translated satisfactorily in Italian, each of the
collocational pairs welcome-children, welcome-pets and welcome-guests has an
appropriate TE (even if this is 0-equivalence in the case of children) that conveys
welcome either in terms of permission or in terms of invitation.
By enlarging the translation unit to encompass the more systematic
patterning associated with the initial collocation pair, a typology of the offer
specific to each type of guest emerges. We have seen how certain guests (children
and pets in the Farmhols corpus, animali in the Agriturist corpus) invited the
presence of restrictions while others (bambini and ospiti in the Agriturist corpus,
guests in the Farmhols corpus) did not. The type of restrictions, we have seen,
were not the same in the two languages and reflected cultural and ideological
preferences; so while the presence of children was restricted in terms of age in the
Farmhols corpus, in the Agriturist corpus the only qualification was on the type of
discount accorded. With pets the restrictions demanded that they should be under
control and that they should be well-behaved in the Farmhols corpus while the
parallel term animali in the Agriturist corpus seemed to invite restrictions on size
rather than behaviour, and that specific arrangements for their presence should be
made in advance.
The typology of the offer for children included a large safe area, explorer
trails, ample space as well as some specific facilities like cots, highchair and
child minding. The equivalent offer for bambini in the Agriturist corpus showed
predominantly the semantic area of children games and game-parks with words
such as giochi per bambini, spazi attrezzati per bambini, piscina rotonda per
bambini.
6. Conclusion
This paper started off exploring the notion of translation equivalence at word
level between two items which had similar grammatical, lexical and even
morphological realizations in English and Italian. The assumption of equivalence
appeared very plausible because the concept in question, the idea of welcome in
the field of eco-tourism and farmhouse-style accommodation is central both in
English and in Italian. It seemed therefore likely that there would be a fairly
straight-forward match between welcome in English and its Italian counterpart
benvenuto. The radical mismatch in frequency of occurrence between the two

words was a surprise and we set out to explore and explain it and to see if we
could find ways in which a translator could cope with it.
Our initial assumption defined meaning as function in context, and this
led us to take the context in which a word is embedded as the primary focus of
the translating activity. The traditional distinction between item and environment
was reinterpreted here in the belief that a systematic contextual analysis could
help us to identify a wider type of equivalence where functionally complete units
of meaning are compared across languages. The enlargement of the unit of
translation shed light on some contextual features that proved significant not only
in comparative terms, but also for the identification of a suitable TE. Using
Firthian terminology, we could say that starting from the immediate verbal co-
text we went on to address a wider context of situation and ended up
identifying elements that were related to an even wider context of culture. Our
notion of translation equivalence was similarly enlarged to encompass some
cultural and typological facets that are not usually considered as relevant to the
translation process in itself. In actual fact these elements proved determining in
the choice of a TE. Perhaps the basic message that comes across from our study is
that the notion of an abstract translation equivalence does not hold and that
functional translation equivalence has to be sought. This is even more true when
the translation in question has a specific purpose, in our case addressing a specific
audience and selling a specific product.
From a methodological point of view, in this paper we proposed a method
of translation that differs rather radically from the traditional ways. We took as
our starting point the recurrent patterns of co-selection of a word and from them
proceeded to search for an adequate TE.
The procedure we proposed for the comparison of units across languages
goes through a three-stage process (1) from the original word we aim to translate
to the range of collocates that most characteristically accompanies it, (2) from
each collocate to a prima-facie TE in L2, (3) from each TE to the collocational
range that most characteristically accompanies it in L2. This was done with a
view to locating the lexical and grammatical patterns that more characteristically
encode the function of our original node word.
In practical terms this means identifying and comparing syntagmatic units
that share certain contextual features with the view of identifying a similar
function. The units which constitute the currency for this process are above all
multi-word in that a specific function appears always to require more than one
single item for unambiguous identification. The problem, of course, is that our
conventional notion of translation equivalence does not take fully into account the
contextual circumstances and usually searches for correspondences at word level.
Here we showed that if we cannot find a satisfactory one-to-one TE for the
adjective welcome, for instance, functional equivalence can be established at the
level of the wider units welcome+children, welcome+pets and welcome+guests.
This study has also attempted to show that it must not be taken for granted
that the TE of what appears as a well-formed syntagmatic unit in L1 will be easily
retrieved in L2. So, although both welcome and children can be individually
translated in Italian, this does not mean that the unit of meaning in which they are
combined can be translated.
The upshot of our discussion is that any translating activity should start by
considering very carefully the context in which a certain word or expression is
embedded and the one into which it is going to be transferred. While we cannot
maintain that welcome in general language is always to be translated as accettare
or potere, we can certainly say that welcome should be translated with some form
of the verb accettare when it applies to pets and with some form of the verb
potere when it applies to guests in the specific restricted language of Farmhouse
Holidays in the U.K. That is if we want our translation to sound natural and
avoid the unmistakable ring of translationese (Gellerstam 1986).
Corpus evidence gives us a privileged start by allowing us to examine
simultaneously the syntagmatic and paradigmatic dimensions of meaning. We
have tried to show that it is only by comparing possible TEs in the presence of
their syntagmatic patterning and their paradigmatic associations in the two
languages that it is possible to identify functional equivalence.
This study has not specifically focused on the typology of the offer in
Italian Agriturismo and British Farm-house holidays. However, in the course of
our observations, it was apparent that some very interesting insights can be
gained from a close look at the data from a typological perspective. In this
context we only want to point to the possibility of identifying the parameters of
this offer in a systematic way. We believe that anybody wanting to advertise their
offer in a foreign language should be aware of the comparable offer available to
their target customers, not only in terms of linguistic realisations but also in terms
of the facilities they advertise. This will be the focus of further research in the
future.
Notes
1. A first version of the work reported here was presented at the A.I.A.
conference in Catania in September 2001 (published in Textus XV, no. 2,
2002). This version, presented at ICAME 2002 (Gteborg) greatly benefited
from the careful and stimulating comments of the editors of this volume,
Karin Aijmer and Bengt Altenberg, as well as the discussion and the questions
that followed the presentation.
2. See for instance the importance of genuine food and the pleasures linked to a
traditional country cuisine which is central in the Agriturist offer in Italy and
has no real equivalent in the Farmhols Corpus.
3. The word welcome, as well as an adjective and an exclamation, is also used as
a verb (see Manca 2001). In this study we will only consider the adjectival
function in some detail.
References
Francis, G. (1993), A corpus-driven approach to grammar. Principles, methods

and examples, in: M. Baker, G. Francis and E. Tognini Bonelli (eds), Text
and technology: in honour of John Sinclair. Amsterdam and Philadelphia:
Benjamins, 137-156.
Gellerstam, M. (1986), Translationese in Swedish novels translated from
English, in: L. Wollin and H. Lindquist (eds), Translation studies in
Scandinavia. Lund: CWK Gleerup, 88-95.
Manca, E. (2001), Il Linguaggio delle Farmhouse Holidays e quello
dell'Agriturismo messi a confronto: realizzazioni linguistiche e tipologia
dell'offerta. Tesi di Laurea in Inglese, Universit degli Studi di Lecce.
Partington, A. (1998), Patterns and meanings. Using corpora for English
language research and teaching. Amsterdam and Philadelphia:
Benjamins.
Sinclair, J. (1987), The Nature of the evidence, in: J. Sinclair (ed.), Looking up:
an account of the COBUILD project in lexical computing. London:
Collins, 150-159.
Sinclair, J. (1991), Corpus, concordance, collocation. Oxford: O.U.P.
Sinclair, J. (1996), Corpus to corpus: a study of translation equivalence, in:
Sinclair et al. (eds), 171-196.
Sinclair, J., J. Payne and C. Prez Hernndez (eds) (1996), Corpus to corpus: A
study of translation equivalence, International Journal of Lexicography,
Special Issue, 9 (3).
Teubert, W. (1996), Comparable or parallel corpora?, in: Sinclair et al. (eds),
238-264.
Tognini Bonelli, E. (1996a), Towards translation equivalence from a corpus
linguistics perspective, in: Sinclair et al. (eds), 197-217.
Tognini Bonelli, E. (1996b), Corpus theory and practice. Birmingham: T.W.C.
Tognini Bonelli, E. (2001), Corpus linguistics at work. Amsterdam and
Philadelphia: Benjamins.
Using WebCorp in the classroom for building specialized
dictionaries
Natalie Kbler
University Paris 7 Denis Diderot
Abstract
In this paper, we present an experiment that was carried out to use finite corpora
and WebCorp in the classroom with a pedagogical objective that was different
from language teaching. The use of WebCorp and corpora was embedded within
the wider framework of teaching students how to approach machine translation
by building a customised dictionary with the aid of available tools and resources.
The issue of exploiting finite corpora and the Web as a corpus was raised in this
framework and will be discussed here. Although there is no simple and definite
answer, the experiment led students to investigate the Web as a source of
information and tobetter understand the issues involved in corpus building and
corpus use.
1. Introduction
In this paper, we present an experiment that was carried out using finite corpora
and WebCorp in the classroom with objectives that were different from mere
language teaching (see section 2.1). Corpus-based, or corpus-driven teaching as
Johns (1988) termed it, can be adapted to using the Web as a corpus; in this
context, WebCorp can be a useful tool for language teachers and students. Our
purpose was however slightly different. Although WebCorp was tested in a
pedagogical situation, its use was embedded within the wider framework of
teaching students how to extract lexical and syntactic information to build
customised dictionaries for machine translation (MT) in languages for specific
purposes (LSPs). In the light of this specific context, we shall tackle the issue of
finite corpus use as opposed (or not) to WebCorp use.
The first part of this paper presents the pedagogic and scientific context of
the experiment. Some details must be given about the project in which the
experiment took place, since it has an impact on the type of results that were
expected from the WebCorp search.
In the second part, the resources and tools that were used are described.
In the third part, samples of the results obtained with WebCorp and with
the finite corpora will be presented and explained. We will show how WebCorp
can be used to complement and update search for linguistic information in finite
388 Natalie Kbler
corpora. This part will also discuss the benefits of using WebCorp parallel to
querying finite corpora.
The conclusion will deal with future prospects and enhancement
requirements for WebCorp.
2. Experiment context
The experiment took place in a postgraduate syllabus called Language Industry

and Specialised Translation.1 This syllabus is oriented towards computer-
mediated translation. Students have courses in four specific areas, namely
translation: theory and practice;
linguistics: syntax, corpus linguistics, terminology;
cultural studies;
technology: database management systems, HTML, XML, translation
memory, localisation tools, and machine translation.
This translation training is semi-professional since students spend every other
week on work placement with a private company.
WebCorp was used in an introductory course to corpus linguistics and its
application to translation and terminology. As the best way of training students is
to place them in real-life situations, they had to take part in translation projects in
the subject area of computer science. Part of the projects consisted in building
customised dictionaries for machine translation. Students were first shown how to
manually extract terms (Pearson 1998), to use term extraction software, and to
extract lexical and syntactic information in the source and target languages from
comparable and parallel corpora. They then practised extracting linguistic
information from the Web using WebCorp. The two approaches were applied to
dictionary building.
2.1 Pedagogical objectives
The objectives of this project involved not only teaching the students the various
skills which will be described below, but also considering the limits of finite
corpus use versus Web as a corpus use. This approach is very profitable to
young people who are computer-literate, and for whom the Web is regarded as
the fount of all knowledge. Comparison helps them find the advantages and
disadvantages of the two approaches; it is also aimed at showing them that
information extracted from the Web must be carefully examined and not be taken
for granted. This also raised the issues at stake in corpus-building as opposed to
using texts collected without specific criteria, or using the Web.
Below are listed the kinds of competence students should have acquired at
the end of the course; they should be able to:
Using WebCorp in the classroom 389
use a machine translation (MT) system and add appropriate bilingual

dictionaries to improve translation results;
use available term extraction tools, which do not require particular
computing skills;
use available resources, such as Web-based bilingual glossaries, self-made
or Web-based finite corpora and the Web as a corpus;
proofread translation results to produce a professional translation;
analyse the systems translation errors from a linguistic point of view, in
order to grasp the very delicate linguistic issues that are at stake in MT.
This will show students how important the human factor is, whatever tools
and resources are available for each part and step of the translation process.
The whole range of competences was included in the translation project that will
be described below. The workflow of translating documents with customised
machine translation in which corpus use is predominant is fully described in
Kbler (2002).
2.2 Project description
The projects in which WebCorp was used and tested consist in translating texts in
the computer science area, using a customisable machine translation system.
Some texts to be translated from English into French were dictionary definitions,
extracted from a Web-based computing dictionary;2 the other type of texts were
some of the Linux HOWTOs that have not yet been translated. The Linux
HOWTOs are the user manuals of the Linux operating system; they have been
translated into several languages by the various Linux communities.3 The French
Linux community is quite active and has translated most HOWTOs. However, as
new HOWTOs, or updates of previous ones, are regularly released, there are still
some documents that remain to be translated. Our students thus had to translate
some of the most recent HOWTOs.
The machine translation system that was used was Systran, and more
precisely Systranet which is Systrans customisable on-line translation system. It
allows users to create their own (bilingual or multilingual) term bases to improve
translation results; this feature can give quite good results in specialized
translation. Students had to create their own customised dictionaries, in order to
test them with Systranet.
To create term bases (or customized dictionaries) from scratch, the first
step involved automatically extracting term candidates from the English text to be
translated and then finding their French equivalents. The first dictionary would
then be used to translate the text.
Systranet offers the possibility of aligning the source and target text, and,
in the aligned target text, of highlighting unknown terms in red and the users
dictionary terms in green. These features make it possible for the user to add to
the dictionary all the words that are not recognized by Systrans home
dictionaries. The second step is more demanding in terms of linguistic work:
390 Natalie Kbler
students compare source and target texts to complement and modify the
dictionary until no more dictionary change can improve the translation result.
When the dictionary is saturated, i.e. no more change can be made to
improve the translation result, the final translation of the text is achieved; the
result will then be proofread and post-edited to correct the translation errors that
could not be solved by modifying the dictionary.
Finite corpora and the Web as a corpus are key elements in the process of
building and correcting dictionaries, and of proofreading the final translation
result. After extracting term candidates from the source texts, students must
decide which candidates are actual terms. Corpus query must then be applied to
answer this question. Parallel corpora are then necessary to help find the French
equivalents for the terms. Corpus use is not only essential to finding terms and
their equivalents, it is also often the only possible means of finding syntactic
information for the terms, especially for verbs and adjectives; verbs and
adjectives are in fact not always considered terms, and little linguistic information
about these classes can therefore be retrieved.
Finite corpora are not the only resources that are essential to creating
customised dictionaries; it will be shown later how the Web as a corpus can
complete and update the information extracted from finite corpora.
3. Tools and resources
This section describes the tools and resources that were used to fulfil the
assignments in the project. The two most important resources for the tasks under
consideration in this paper are WebCorp and the finite corpora that were used.
3.1 WebCorp
WebCorp is a tool developed in a project that was set up at the Research and
Development Unit for English Studies at the University of Liverpool. Its
objectives were to investigatethe usability of the Web as a linguistic resource.
The project also had to identify and address problems of retrieval and analysis. It
allows the user to type in a request for linguistic information that is processed and
fed into the selected Web search engines. The search engine returns a list of
URLs that WebCorp accesses directly; it then returns concordances or collocates
for the query. We will show below how it can be used to retrieve useful linguistic
information to create bilingual term bases in LSPs. A detailed description of
WebCorp has been given by Renouf (2003) and Kehoe and Renouf (2002).
3.2 Corpora
The finite corpora that were available for the students were first developed at the
Laboratoire de Linguistique Informatique at the University of Paris 13. They have
been augmented and enhanced at the University Denis Diderot Paris 7 for several
years. These corpora, parallel and comparable, are accessible via a Web-based
interface,4 in which a concordancer allows visitors to use perl-like regular
expressions, as described in Foucou and Kbler (2000). The following corpora
were used by the students:
a) The parallel English-French HOWTO corpus, that has been used for several
years at Paris 7. It is made of the Linux HOWTOs (user manual files of the
Linux operating system), which were originally written in English. The
HOWTOs have been translated into several languages, including French. The
source language and target language texts were aligned at section level. The size
of the parallel corpora is approximately 500,000 words each. It is possible to ask
for concordances and then have an aligned view of the section in which the term
or expression occurs. Concordances with regular expressions are very useful for
extracting refined linguistic information about terms. Furthermore, by looking at
the equivalent section in French, it is possible to find the French equivalents of
the term or expression.
b) Smaller comparable corpora in English and French representing subdomains

of computing (less than 100,000 words), such as artificial intelligence,
peripherals, computer games, digital cameras, etc. were also made available to the
students. This led us to develop a methodology for querying comparable corpora
to extract French equivalents of an English term.
c) Our students used an experimental version of WebCorp that gives access to

additional features, such as regular expressions and domain filtering. This was
particularly useful as the students were working in a specific subject area, namely
computer science.
3.3 Tool: machine translation
Apart from WebCorp and the university-developed Web-based interface for

corpus query, the other tools that were used can be found on the market, as for
example Systranet5 and Terminology Extractor.6
Systranet is an on-line machine translation system, developed by Systran.
It gives access to Systrans over 35 language pairs and allows users to translate
either a text file, or a formatted file, or a Web page. Users can create their own
customised dictionaries and compile these into the system to help them translate
specialised texts. Users can work in a network of translators, each member of a
group having access to the other members dictionaries. The interface we used
392 Natalie Kbler
was adapted to specific pedagogical needs, allowing the teacher to create the
groups and to have access to all the students dictionaries, as well as partially to
the logs of the sessions.
The most interesting feature of our project, apart from the translation
engine as such, was the possibility for the user to create and compile customised
dictionaries.
Dictionaries contain more than just a correspondence between a source
word (in this case in English) and a target word (in French), since users can enter
what is called advanced linguistic information in these. The information can be
divided into several levels:
part-of-speech information: basic part-of-speech information can be attached to
the entries, such as verb, noun, proper noun, adjective, and sentence, which
deals with adverbs, adverbial phrases, or whole idioms, such as your mileage may
vary.
syntactic information, such as the governed prepositions for nouns, verbs, and
adjectives, or direct objects for verbs. A verb which governs a preposition is
shown in example (1).
(1) access (verb)(noprep)=accder (verb)(prep:)
semantic information, such as the conceptual class of the possible direct object
of a verb, as shown in example (2). In this example, the coding for the verb
runindicates that the direct object must belong to the semantic class [OS], which
means all terms sorted under the operating system class. Below the verb, the
noun Unix is marked as belonging to the [OS] class. This means Unix can be the
direct object of run.
(2) to run (verb)(context:OS)
Unix (noun) (SEMCAT:OS)
morphological information, such as the plural form of a noun in any language,
the gender of a noun in French, or altering the number in the target or source
language. Example (3) shows how the gender of cache can be altered to
masculine. In general French, the noun cache(hiding place) is feminine,
whereas in computer science French, it is masculine and means cache.
(3) cache (noun) = cache (noun) (masculine)
The term URL takes a plural in s in English, i.e. URLs, whereas in French, it is
invariable; this type of information can be coded in the dictionary, as is shown in
example (4).
(4) URL (noun) (plural:URLs) = URL (noun) (plural:URL)
translational information, such as DNT, which means that the string must not
be translated, i.e. it must remain as it is in the translation process. This feature is
quite useful in computer science, as there are command names for example that
are never translated, such as the Unix command cd, or mkd.
Figure 1 shows a dictionary sample, in which various types of coding are

presented.
AT&T (company name)

auto-dial (noun)=numrotation automatique (noun)
automatic number identification (noun)=identification de lappelant (noun)
based (adjective)(noprep)=architectur (adjective)(prep:autour)
basic language constructs (noun) (plural)=base de construction du langage
(noun) (singular)
to log in (verb)=se loger (verb)
to introduce (verb) (context:extensions)=introduire
to carry (verb)(context:digital data)=transmettre (verb)
Figure 1. Dictionary sample
3.4 Tool: term extraction
To extract term candidates from the source texts, a very simple and user-friendly
tool was applied, viz. Terminology Extractor. This tool works for English and
French and gives several types of results. First, it extracts all the words that are
recognised by its dictionaries, plus all the non-words, i.e. words that are not in the
dictionaries. The non-word feature is interesting, as it usually gives a list of very
specialised words which are not in general dictionaries. Then it extracts in a
window of two to ten words all the sequences that appear at least twice in the
text. This feature allowed the students to have a list of term candidates among
which they could choose the actual terms with the help of the various corpora and
WebCorp.
Debian Netscape accelerate

Permedia Dennis XFCE
RedHat Dialogs Corel
RgbPath FAQs
ServerFlags Howto Microdoft
ServerLayour README Linux
XkbLayout XkbModel RealAudio
Solaris ISA
UI KDE GUI
USB LeftOf IRQs
WindowMaker ModulePath NFS
Figure 2. Results of the non-word extraction from a HOWTO document. Apart
from Dennis and accelerate, all the words are terms or product names in
the computer science area.
394 Natalie Kbler
A sample of the term extraction results is given in Figures 2 and 3. Figure

2 contains the results of the non-word extraction, and Figure 3 the results of the
collocationextraction. They show that an important linguistic job must be done
on the results to obtain an actual list of terms (single and compound).
Internet Gateway 3 { Looking look } at the Network 3

IP aliasing 3 name server 4
ISA { card cards } 3 Network { Device devices } 4
latest version 3 Linux computer 3
DHCP Server 15 IP { addresses address } 16
Linux gateway 3 Linux box 16
modules file 3 card on the Linux box 4
Scripts / ifcfg 3 DNS { Server servers } 17
server will start 3 interface configuration file 3
{ Network networking }{ Card Cards }12
Figure 3. Results of a collocation extraction from a HOWTO document. The
words in bold are actual terms.
3.5 Other information sources
Finite corpora and the Web as a corpus were the main resources used in the
project. There were also secondary sources, such as on-line glossaries, or on-line
term bases. These were presented to the students to help them understand why
data-driven information is essential to this type of work, and why dictionaries and
glossaries are not always satisfactory. Figure 4 shows the type of information that
can be accessed in a Web-based bilingual term base. The search for the
translation of the English word buffer yielded the translation mmoire-tampon,
and three synonyms and translations of these, but no syntactic or phraseological
information. There were no compounds of the word buffer, although it is very
common in computer science English.
ENGLISH FRENCH
Buffer mmoire tampon n. f.
Syn. Syn.
buffer storage tampon n. m.
buffer memory mmoire intermdiaire n. f
intermediate memory zone tampon n. f.
Figure 4. The term buffer and its French translations in Le Grand Dictionnaire
Terminologique.
4. Using finite corpora and WebCorp
Taking our experiment in the classroom into account, we want to show how the
use of finite corpora and WebCorp is neither contradictory nor incompatible.
Available finite corpora, such as the HOWTO corpus and the smaller ones in
subdomains of computing, can give the user a lot of information. But as
computing is a very quickly changing domain, new terms are coined all the time,
which means that available corpora tend to become insufficient or slightly
obsolete, even though they can be regularly updated. In the subject area of
computer science, most neologisms can be found on the Web. So being able to
query the Web as a non-finite corpus is a fruitful way of obtaining missing
information. Taking the above-mentioned example of buffer, we will describe and
discuss this.
4.1. Buffer in the HOWTOs
As shown in Figure 4, the term buffer is translated into mmoire tampon in

French. However, Le Grand Dictionnaire Terminologique did not mention any
compound for this term. Looking for buffer in the HOWTO corpus produces
several multi-word units. Looking at the aligned section in French allowed us to
find French equivalents of these, as shown in Figure 5.
buffer cache (noun) mmoire cache (noun)

buffer memory management (noun) gestion de la mmoire tampon (noun)
buffer store (noun) zone tampon (noun)
DRAM write buffer (noun) buffer dcriture DRAM (noun)
frame-buffer (noun) tampon de trame (noun)
Figure 5. Multi-word units for buffer and their French equivalents.
The problem is that the HOWTO translators have not always translated the whole
text, or they may have modified sentences in such a way that some words just
disappear. As a result, some compounds can be found, but not all, and not always
their French equivalents. This indicates the limitation of finite corpora. New
terms that were created after the collection of the corpus, or translations that have
been radically modified, cannot be found in a finite corpus. Term bases are
generally not complete enough. Because of this, the information must be looked
for on the Web. As not only lexical information but also phraseological and
translational information is necessary, a tool that makes it possible to extract
concordances from the Web is likely to be appropriate. The next sub-sections deal
with examples of Web search, using WebCorp, and demonstrate how the
necessary information can be found.
396 Natalie Kbler
4.2. WebCorp: searching for French equivalents
As the Web is not an aligned corpus, heuristics must be applied to find the French
equivalents for English words. One possibility consisted in searching for an
English term on a French Web-site. In the current state of WebCorp, the only way
of doing that was to look for URLs in the French domain, i.e. ending in .fr. In
French, computer scientists often use the English term for a given concept. Some
translators therefore use the English term and often give its French equivalent in
parentheses at the beginning of the document and then no more. Others use the
French term, but add the English word in parentheses. This permitted us to find
translations and also more terms, as illustrated in Figure 6, which shows a
concordance for buffer extracted with WebCorp. These concordance lines yield
two multi-word units in English, viz. buffer overflow and heap buffer overflow,
and their equivalents in French.
me des dbordements de buffer (tampon en franais). Pour

com/advisories/bufero.html . Writing buffer overflow exploits a tutorial for
de NOP . dbordement de buffer dans le tas (heap buffer overflow)
(buffer overflow) . dbordement de buffer sous windows (et oui ;-)) --[
Figure 6. Concordance for buffer.
Not all searches provide the reader with the English source term in parentheses.
In the case of dial-in line, for example, only part of the term is translated into
French, and no indication of the source term is given. Figure 7 shows an
occurrence of ligne de dial-in, in which only part of the term is translated.
However, other occurrences of dial-in in French text show that this is the correct
way of using it in French.
Monter un serveur PPP/POP dial-in Par Hassan Ali AVERTISSEMENT : a

avec une des lignes de dial-in PPP et son adresse IP
assigner dynamiquement aux utilisateurs du dial-in PPP. Ceci, bien sr
pouvez assignez vos clients de dial-in : # Secrets for authentication using
PAP
Doe appelle laide de ladaptateur dial-in de Windows 95 qui est
Figure 7. Dial-in in French documents.
4.3. WebCorp: searching for linguistic information: to run
As mentioned above, creating a customised dictionary for machine translation

does not only require extracting lexical information from corpora, complemented
by using the Web as a corpus. Phraseological information is also essential and
must be inserted in the dictionary. This type of information is also important
during the proofreading and post-editing process of the translation.
Terms of a domain have specific meanings that are usually unknown in

general English. In computer science, the verb to run has a meaning that differs
greatly from its ordinary meanings in English. Not surprisingly, the French
translation of the verb in computer science French has nothing to do with its
general meaning translation. When to run means to walk quickly, its French
equivalent is courir; to run used in the computing world is translated by tourner,
lancer or excuter, which have nothing in common with courir.
To run in computer science can be followed by a direct object and then
either by the preposition on or by the preposition under, usually depending on the
type of argument that is used. Example (5) shows instances of the syntactic
structure:
(5) You can run a program under an operating system

You can run a program on a platform + OS
An argument that appears after the preposition under can also be used after on,
but the opposite is quite rare. Building a customised dictionary means listing, as
exhaustively as possible, the different verb arguments that can occur in the
different positions in a sentence. Finite corpora can produce a quite exhaustive
answer, which needs to be complemented and updated by using the Web as a
corpus. Figure 8 shows how the expression run * * on, which uses two
wildcards instead of words before the preposition on, can give significant results
on the arguments that can fill the syntactic positions. These arguments could not
be found in the HOWTO corpus, nor in the smaller finite corpora
harm is done if you run cvs init on an already set-up repository.

containing all you need to run Tcl/Tk on a Macintosh. tcl8.0p2.tar
nd showed that it can run equally well on a Sharp or Alcatel telephone
you will be able to run PETSc ONLY on one processor. Also, you will
ith my favorites tools, and run the binary on a real ST. If the
Figure 8. Arguments of the verb to run
Another useful feature offered by WebCorp is the collocate function; it gives the
most frequent collocates of the sequence. Frequent collocates of the verb to run,
for example, are Debian, Alpha and messages, the first two being product names
in computer science. As WebCorp is limited in the number of sites that can be
opened, it is possible to filter out the collocates and discard the URLs in which
they occur. It can be done by using the exclude feature (using the - sign, as in
search engines). This allows WebCorp to extract concordances from other URLs,
which then provide more linguistic information.
The same operation can be applied to extract linguistic information about
the French equivalent of the verb, i.e. tourner. As shown in Figure 9, the first
pass is not always conclusive, since there are occurrences that have nothing to do
with computer science. The sequence tourn* * * sur will find all the words
398 Natalie Kbler
beginning with tourn, followed by two words, followed by the preposition sur
(on).
First pass without filter apart from .fr and computers:

tat de conservation : Ce denier tournois est frapp
japonais. . na pas renonc tourner son film sur le
sterling bruce subspace sun open : tournoi de golf sur
dternit: quatre poules blanches tournant en rond sur une place de village et
Figure 9. Occurrences of tourn without any WebCorp filtering.
In the second pass, a filtering option can be employed, to include keywords of

computer science, such as programme, systme, Linux and machine, and to
exclude words, such as film, napoleon or poule, for example. This makes the
search result much more consistent with the subject, as shown in Figure 10.
fonctionner avec Windows, il peut tourner ou pas sur des cartes vido ou
de type Unix qui peut tourner entre autres sur PC. Il est install par
des ordinateurs distants Pour faire tourner un programme sur une machine
distante dont ladresse
texte ASCII par un module tournant sous Windows (sur PC) et devrait bientt
Figure 10. Occurrences of tourn using filters.
4.4. Discussion
These few examples show occurrences of terms and their phraseological contexts
that could not be found in the finite corpora on computer science. Studying
terminology and phraseology for practical purposes raises issues that are different
from describing the language as such. Describing languages for specific purposes
means working in well-defined subject areas, which does not need huge corpora
as in the study of general language (if there is such a thing as general language).
A few hundred thousand words, sometimes less than a hundred thousand words
are enough to describe the characteristics of a language for specific purposes.
However, applying this type of description for practical purposes, such as
creating a dictionary that will be integrated into a machine translation system,
raises the issue of exhaustiveness. Machine translation needs human input to
achieve satisfactory translation results. In this case, a small, specialised corpus is
not enough. Moreover, the issue of up-to-date information arises. WebCorp, as a
tool enabling the user to make daily updates, is ideal for complementing and
updating the information extracted from time-bound specialised finite corpora.
However, using finite corpora presents some advantages over WebCorp
that will be difficult for a concordancer using the Web as a corpus to overcome.
Finite corpora have the significant advantage of presenting controlled and
balanced information. The texts collected in a corpus have been selected in
preference to other candidates. Using the Web as a corpus implies that one has no
control over the content of the documents that are extracted. The huge quantity of
documents is also a problem.
5. Conclusion
While, in our case, finite corpora were used as the basis for the creation of
customised dictionaries, WebCorp provided us with more complete and up-to-
date linguistic information. In the classroom situation, students were faced with
those issues, i.e. finding information in finite corpora, discovering they needed
more, and using WebCorp instead of collecting a bigger corpus in the domain.
Students learned how to use heuristics to find appropriate information using
WebCorp; this also led them to note the advantages of WebCorp over classical
search engines, namely the availability of concordances, collocates, regular
expressions, and the possibility of limiting and filtering the linguistic information.
WebCorp still needs some improvements, such as refining language
identification, and domain filters. Linguistic information extracted with WebCorp
would be more accurate if domain filters could be used to restrict the search to
one domain. Refined regular expressions would allow users to extract more
accurate phraseological information. As these improvements are integrated into
the next release of WebCorp, the next step will be to test them and see if the
results are significantly improved.
Notes
1. The French DESS (Diplme dEtudes Scientifiques Spcialises) which is

equivalent to the second year of a vocational M.A.
2. FOLDOC: Free On-Line Dictionary of Computing.
3. Linux is a Unix type operating system that is freely available to the
community.
4. http://wall.jussieu.fr
5. http://www.systranet.com
6. http://www.chamblon.com
References
Foucou P.-Y. and N. Kbler (2000), A Web-based environment for teaching

technical English, in: L. Burnard and T. McEnery (eds) Rethinking
language pedagogy: papers from the third international conference on
language and teaching. Frankfurt am Main:Peter Lang. 65-73.
400 Natalie Kbler
Johns, T. (1988), Whence and whither classroom concordancing?, in: T.

Bongaerts, P. de Haan, S. Lobbe and H. Wekker (eds), Computer
applications in language learning. Dordrecht: Foris. 9-27.
Kehoe, A. and A. Renouf (2002), Webcorp: Applying the Web to linguistics and
linguistics to the Web, in: Proceedings of the WWW 2002 Conference,
Honolulu, Hawaii, 7-11 May 2002.
Kbler, N. (2002), Creating a term base to customize an MT system: Reusability
of resources and tools from the translators point of view, in: E. Yuste
(ed.), Proceedings of the Language Resources for Translation Work and
Research. Workshop of the LREC Conference. Las Palmas de Gran
Canarias: ELRA. 44-48.
Pearson, J. (1998), Terms in context. Amsterdam: John Benjamins.
Renouf, A.J. (2003), WebCorp: providing a renewable energy source for corpus
linguistics, in: S. Granger and S. Petch-Tyson (eds), Extending the scope
of corpus-based research: new applications, new challenges. Amsterdam
& Atlanta: Rodopi. 39-58.
The accidental corpus: some issues in extracting linguistic
information from the Web
Antoinette Renouf, Andrew Kehoe, David Mezquiriz
University of Liverpool
Abstract
The Web is a text store which can potentially supplement traditional corpora as a
source of up-to-date linguistic data. The WebCorp project investigates this
potential, and in its second year tackles some residual problems inherent in the
nature of Web text, thereby refining its retrieval and analysis tool for the
facilitation of corpus linguistic study.
1. Introduction
The Web is a vast, growing store of text-based information which in principle

could meet many of the linguists needs for evidence of authentic written
language use. Rare, topical, new and changing words and word uses that are not
captured in existing finite corpora can often be found in Web-based text.
However, the nature of the Web as a random accumulation of heterogeneous
texts, many being less conventionally text-like, poses problems for the corpus
linguist who tries to access it through existing search engines. The WebCorp
project (Renouf 2003; Kehoe and Renouf 2002) was set up at the University of
Liverpool in December 2000, with the objectives of investigating the usability of
the Web as a linguistic resource, and of identifying and addressing some of the
problems of retrieval and analysis that it presents. A WebCorp tool has been
developed to demonstrate a set of search functions to users, with a facility for
gathering feedback, and this system has been iteratively enriched according to a
project design and in response to user comments.
In this paper we begin with a brief exposition of the structure and basic
linguistic retrieval functions of the WebCorp tool, before moving on to outline
some of the issues we have encountered in interacting with the Web, some
solutions that we have devised, and other measures that we envisage taking to
enhance the performance of Web linguistic access, retrieval and analysis.
404 Antoinette Renouf, Andrew Kehoe and David Mezquiriz
2. The WebCorp system
2.1 Structure of WebCorp tool
Several approaches could be taken to extracting linguistic data from the Web and
processing it online. The WebCorp system has adopted a straightforward
approach, as shown in Figure 1. WebCorp has six basic stages of operation. It
first registers the users request for linguistic information. Then it translates the
request and feeds it to a search engine. The search engine locates relevant texts,
returning a list of URLs to WebCorp, which accesses these directly, processes the
associated texts in memory, and then returns concordance results to the user
interface.
Search
Engine
2
3
4
5 WebCorp
Web
Texts 1
User
Interface
Figure 1. WebCorp operational diagram
A linguistic extraction system needs a GUI (Graphical User Interface) that

displays its functions clearly and offers a range of options to accommodate the
anticipated needs of different users. WebCorp currently runs two versions of the
GUI. The publicly accessible interface offers a reduced number of the options and
variables displayed in the advanced GUI. The advanced GUI in its latest version
is being tested by ICAME members, and currently looks as in Figure 2.
The accidental corpus 405
Figure 2. The WebCorp GUI
2.2 Sample retrieval results from WebCorp
As mentioned, traditional corpora of present-day language are not large enough to

contain rarer usage; nor do they capture the latest coinages, due to the time
required for their creation, and with neologisms flowing into the language on a
daily basis. The neologism Enronomics was not found in existing corpora in May
2002. It is derived from Enron, a US company that in early 2002 was discovered
to have conducted large-scale financial malpractice. The name now carries
connotations of the particular kinds of shady business dealing and poor
management style involved, and is used to characterise companies and practices
exhibiting similar qualities. Contexts for this neologism could already be
extracted from the Web by WebCorp in May 2002. They indicated that the root
form Enron was extremely productive, already appearing in a range of derived
forms. In the sample output for Enronomics in Figure 3, we also find Enronyms,
Enronitis, Enronify, Enronethics, Enronizing, enronish, Enronitize and enronomy.
In addition, we note that Enronomics is probably modelled on Reaganomics, as is
Clintonomics.
attack Bushe economic policies with the term Enronomics (a phrase that
originated
to Believe He Knows About the Economy? Enronomics = Contributors Get
Richer
corporate malfeasance. Recently spotted Enronyms: Enronitis, Enronify,
Enronomics
laid bare by what rivals call Enronomics the political fable of the Enron
corporation
slogan and neutralize the Enronomics accusations, may I coin the term
Enronethics
Team Bush - talk of Enronomics, or Enronizing Social Security and
Medicare
believing their press, watch out. Its Enronomics, folks. The rich seducing
the poor
to be enronish and to practice Enronomics. Weve seen ugly, enronish
sights before
The Looting of America: Reagonomics, Clintonomics and Enronomics
Strategy) . Enronomics Explained (deliberately driving the country into
spent two weeks talking about Bushs Enronomics and Enronizing
Social Security.
It blows the lid off Bushs Enronomics, and his plan to Enronitize Social
Security
hardest hit by the Bush trickle down enronomics. Now it looks like the Bush
enronomy
Figure 3. WebCorp output for search term Enronomics Domain: .uk or .com
Alternatively, one might wish to check the neologistic status of a word through a
Web search. In an article on Health Obsessions in the Observer of 14.04.02, the
vogue term medicalisation is presented in inverted commas as though a
neologism. Though there is no consistent meta-information for date on the Web
to support the chronological extraction of word occurrences, WebCorp can
retrieve at least some in-text dates indicating that the word is not new, but has
been used as early as 1974, as shown in Figure 4.
1. legislation shifted from criminalisation to medicalisation of drug use

2. the causes and effects of the medicalisation of abortion, focusing on the law
3. decriminalisation and legalisation. Medicalisation: prohibited drugs on
prescription
4. (1991) medicalisation a more effective way of controlling deviance than
legal punishment
5. The psychologisation/ medicalisation of school education
6. A political sociology of lifestyle pharmaceuticals and medicalisation
7. the medicalisation and psychologisation of PMS is done to market
8. over-medicalisation of womens normal physical processes (e.g.
menopause);
9. Crawford R (1980) : Healthism and the medicalisation of everyday life
10. RSI exemplifies the medicalisation of work behaviour. Spillane, 2000
11. medicalise, and therefore pathologise, difference. The medicalisation in
maternity care
12. Scott (1988) discusses the usefulness of the medicalisation of childbirth
13. BMJ 2002. 324: Education and debate. Has the medicalisation of childbirth
gone too far?
14. palliative medicine and the medicalisation of death, European Journal of
Cancer Care
15. medicalisation of lifes normal processes: ageing, sexuality, unhappiness,
and death
16. in 1974, when I wrote Medical Nemesis, I could speak about the
medicalisation of death
17. only the very richcan avoid the medicalisation of the end of life (Illich,
1976).
18. Seymour JE. Revisiting medicalisation and "natural" death. Soc Sci Med
1999; 49: 691-704
Figure 4. WebCorp output for search term medicalisation
Figure 4 also includes evidence of the vogue use of medicalisation to mean treat
medically a natural condition as if it were a disease, in the context of words such
as ageing, childbirth, everyday life, death, and psychologisation, as well as more
established uses. In the context of abortion or drugs, medicalisation is used to
mean decriminalisation; while in the context of terminal conditions, it can also
mean treating with medicine, collocating with such words as palliative. The
rarity of inverted commas here indicates that the word is no longer considered to
be a new coinage, the one use (in 16) being to indicate the novelty of its status
back in 1974.
3. Issues arising in treating the Web as a corpus
During the development phase, we have established many of the needs of users
via our feedback mechanism. These have led us to face a number of retrieval and
processing issues, which we shall outline below, together with solutions that we
have found. The major areas of concern are:
scope (recall in IR terms)

speed, both of access to, and retrieval of, Web text
the state of Web search engines and Web text
the types and formats of linguistic information required
refinement/relevance (precision in IR terms)
3.1 Scope
All things being equal, it seems a good idea to maximise the scope of Web search
in order to garner as many examples as possible. However, a Web search is
limited to the scope of indexing of the various search engines. A report (Bergman
2001) stated that the foremost search engine, Google, had indexed 2 billion Web
pages, but estimated that it only searched 10% of the Deep Web. The use of
multiple search engines currently Google, AltaVista, Metacrawler, FAST,
Northern Light and SearchEngine.com is a remedy that we have applied to
increase coverage.
3.2 Speed
Any Web language retrieval system will be subject to speed constraints. These
are imposed by each agent in the loop, including local server, university resources
and Web traffic. An arrangement which allows direct access to the Web via the
index built by one of the search engines is likely to increase speed. In the case of
WebCorp, this improvement is achieved by linking into SearchEngine.com, a
major UK-based system. Speedier processing can also be achieved through the
parallelisation of the downloading and processing of Web pages. Neither measure
brings huge benefits, however; a new order of processing power is required, of
the scale envisaged for the post-Internet era of distributed computing.
3.3 The state of the Web
3.3.1 Handling search engines
Search engines require careful monitoring since they are constantly changing:
opening up, closing down, amalgamating, adding new functionality, and
imposing new restrictions.
A problem in their current functioning that has consequences for corpus

linguists is the fact that they each access different pages, and different pages each
time. Thus the linguistic sample is not constant. The ephemeral nature of the Web
introduces a further dimension into the equation of comparability, the
impossibility of describing more than one phenomenon simultaneously in the
same body of data. The only solution, which means relatively little in linguistic
terms (as we shall explain later in relation to textual data) is to save the particular
download with its given time and date.
3.3.2 Handling Web pages
The Web page is in a state of disorder from every point of view that concerns
linguistic processing. To begin with the basic unit of word, even the boundary
between words, is erratic. Then, spelling is variable and presents a problem
analogous with that which has preoccupied generations of historical linguists.
Punctuation is haphazardly sprinkled, and frequently omitted (or suppressed by
some intermediate processing), a tendency that presents a particular dilemma in
that it removes the sole means of processing the surface text for sentence
boundary.
Web pages are a mixture of text and metatext (including URLs and other
links). For some purposes, the linguist requires access to the text itself; for others,
such as the study of meta-terms for specialised dictionary creation (see Kbler
and Foucou 2000), access to the metatext. Scarcely any purpose is served by a
system which retrieves a mixture of both. A partial solution here is to construct a
retrieval routine that identifies and ignores the kind of text, such as link text, on
the Web page which is not required.
3.4 Linguistic data requirements
3.4.1 Concordance presentation options
There are a number of variables that serve a linguist and are readily producible.
With reference to the WebCorp GUI, we offer options for case
sensitive/insensitive search, URL display and full text hyperlink, specifiable span
(ideally up to a maximum of the total text), and selected formats (including
HTML, ASCII and HTML Tables).
3.4.2 Sentence-length concordances
The production of sentence-length concordances might seem routine to the

linguist, but sentence identification can be problematic in electronic text, where
layers of processing can lead to the full stop (the prime clue to sentence
boundary) being suppressed. As mentioned, in Web text the use of the full stop is
even more erratic. In a grammatically tagged corpus, sentence ending could be

deduced from the grammar itself. With Web text untagged as it is, however, few
clues exist at surface level as to sentence boundary. A WebCorp heuristic
searches backwards from the search term to the previous full stop, until either one
has been traced or a maximum number of characters has been analysed. The
results are often uninformatively long, and look as follows:
owned first quarter losses after cutting costs in its South African and
Scandinavian operations Ananova: Melissa computer virus creator
gets 20 years in prison David Smith, who admitted creating the
Melissa virus that swamped computer networks worldwide and
caused millions of dollars in damage in 1999, was sentenced today to
20 months in prison, prosecutors said.
So another approach to finding sentence boundary has been tested with WebCorp,
in which it simply searches backwards through the text, left of the search term,
for the previous upper-case initial word. This simple measure is surprisingly
successful in identifying a sentence start, or at least a clause start, which is often a
satisfactory compromise in terms of the interpretability of a context. However, its
success is determined by various factors, such as grammar. For instance, it works
well with the verb swamped because the previous upper-case initial word is very
often the noun (or proper name) designating the clause subject. (This word relates
to David Blunketts unfortunate remark in 2002 about schools being swamped
with immigrants). Our output is shown in Figure 5.
David Smith, who admitted creating the "Melissa" virus that swamped
computer networks worldwide and caused millions of dollars in damage in
1999, was sentenced today to 20 months in prison, prosecutors said.
January 2000 "Swamped!
Technology Summary: Swamped!
By combining research in autonomous character design, automatic camera
control, tangible interfaces and action interpretation, Swamped!
Academic Papers: Swamped!
Sorry, I have been swamped with other stuff but
Or, as with any developer, youre probably swamped with bugs.
Some of the competitors, however, persisted in racing until they were
swamped.
Birmingham Citys ticket offices were bracing themselves to be swamped
by eager football fans today hoping for a ticket for the Division One play-off
final.
Call centers of high-tech companies are swamped, and consumers are
fuming
Figure 5. Potentially sentence-length contexts for swamped
In contrast, an adverb like sulkily is less successful, because it often collocates

with reporting verbs, so we find such fragments as Ed sulkily, which due to verb-
noun inversion, have lost their actual sulky utterance.
He grabbed the stapler, and sulkily asked me to make him a cup of tea.
Her husband, who is driving, frowns sulkily.
"I suppose so," the other sulkily replied, as he crawled out of the umbrella.
"Sorry," they mumbled sulkily.
Cilla: (sulkily) All right, fine
Ed sulkily.
Elinor responded sulkily as she smoothed the folds of her long cambric
overdress.
Figure 6. Potentially sentence-length contexts for sulkily
An obvious strategy for improving the output is to download the text for post-
processing, at which point the potential of grammatical and other factors for
sentence identification may be exploited.
3.4.3 Collocational profiles
Collocational information is a standard measure in lexical studies. Ideally,

statistical measures of its strength and significance in relation to the search term
are applied. But this requires a knowledge of the total size of the body of data
from which it is extracted, and the totality of the Web is not measurable.
Statistical significance calculations also require a reasonably large amount of
data, but the corpus created from the relatively small number of Web pages
downloaded in a search is rather small. What can be produced is a frequency-
based profile, a simple frequency count of the collocates within a specified span
to the search node term. A collocational profile for the word minimum is provided
by WebCorp as in Figure 7, showing the frequencies of collocates above a
specified threshold, here one occurrence, and indicating their left-right position
within a +4/-4 span.
The collocational profile in Figure 7 in fact also serves as a guide for the
user to the role adopted by the word minimum as a noun modifier. A possible
enhancement would be to extract a fuller collocational profile by retrieving a very
large number of Web texts, so that a significance count might just be derivable.
This could be a larger dataset than the user actually specifies as the desired
number of contexts to be displayed. It would require more computing power,
however.
A collocational dilemma is raised in discontinuous phrasal search via
pattern matching and wildcard use. The operational definition of collocation in
WebCorp is the words that sit to either side of a word. But the wildcard search
assumes that there are also collocational sets in the asterisked spaces within the
variable phrases or lexical (lexico-grammatical) strings. This problem can be
solved by differentiating, in the collocational calculations, between external and

internal phrasal collocates.
Left Right
Word Total L4 L3 L2 L1 R1 R2 R3 R4
Total Total
wage 36 1 34 1 1 35
national 15 15 15 0
rate 6 3 3 0 6
Please 5 2 1 2 2 3
set 5 1 4 0 5
UK 4 3 1 4 0
National 4 2 2 2 2
standards 4 1 3 1 3
requirements 4 1 2 1 0 4
level 4 4 0 4
guide 3 2 1 3 0
new 3 1 1 1 3 0
rates 3 1 2 1 2
section 2 1 1 2 0
maximum 2 2 0 2
regulations 2 1 1 1 1
Figure 7. Top collocates of minimum (excluding stopwords)

A further complication that arises in the course of producing collocational
information is that a word may occur inordinately often on a single Web page. An
example is the adverb sulkily, which occurs constantly throughout a Web-
retrievable novel, Alice in Wonderland. A solution is to allow the option of
retrieving only one concordance line per site. This is a manipulation of the facts
which may serve for some purposes but not for others. It is a reminder of the
limitations of the linguistic validity of treating the Web as a corpus.
3.4.4 Pattern matching
Lexical items are often common combinations of two or more words, in more or
less fixed patterns. It is possible with WebCorp to search on the Web for these,
and also for discontinuous phrases, which can be effected through the use of a
wildcard character. So the * sank retrieves a series of phrases containing some
of the collocational set which sits between the words the and sank, which is:
the boat sank, the ship sank, the ferry sank, etc.
Multiple wildcard characters within the pattern the * * sank can expand the
search to discover some of the members of each of the two collocational sets that
sit between the words the and sank, which include: the unsinkable ship sank, the
Russian submarine sank, etc.
It is also possible to support a search for variable strings using wildcards.
These can match inflections and suffixes, such that run* will represent run,
running, runs, runner, runners, but also runt, rune, rung. However, wildcard use
in the matching of initial word elements (e.g. *ing) is not supported by search
engines, though there are obvious off-line post-editing remedies to apply.
Square brackets and pipe characters (as separators) are additional
measures for introducing grammatical or orthographic variation into the search,
as for instance the boat s[a|u]nk. Square brackets around lexical variants, e.g.
the [boat|ship] sank, allows a search for the alternatives specified.
Brackets can be used to allow more flexibility and/or specificity, so that
run can be explicitly expanded to r[un|an|unning|uns], which will retrieve
instances of run, runs, running, and ran.
Wildcards allow the discovery of new/unconventional forms, of the kind
that supports the testing of a users hypothesis that electronic communication
encourages greater inflectional variation, especially in youth-speak. For example,
the query formulated as follows: [he|she|I] text* [him|her|me], confirms this
and moreover reveals that text not only functions as a verb but as an uninflected
past tense verb:
I sent him my picture and he text-ed me back that I look like his wife
I was almost speechless when she textd me the last one below
Yesterday he texted me in a meeting with you want to go out?
The next time I text him, he didnt reply
I texted her and invited her to meet us
A combination of all these pattern-matching options can be used to represent

complex patterns. For instance, dr[i|o]ve[s|n|] * [a|]round the retrieves the
following phrases:
Start up drives me round the twist

Fury over lorry that drives residents round the bend
Her Majesty was driven twice round the Mews yard
Over used, that stupid drumbeat drove me round the bend
Sick Diana pic drives critics round the Benz
We quit - youve driven us round the bend
Well drive you round the island or take you shopping
The noise drove her around the bend
The pattern can be further specified in the light of first run results, as in:
dr[i|o]ve[s|n|] * [a|]round the [bend|twist]
Alternatively, patterns can be less extensive, thus allowing variable phraseology

to be retrieved. For example, the pattern dr[i|o]ve[s|n|] [her|him|me]
specifying only the verb and pronoun common to a set of variable phrases,
retrieves not only the prepositional phrases found by dr[i|o]ve[s|n|] * [a|]round
the, as shown above, but others, e.g. up the wall as well as some adjective
complements: crazy, mad, insane and nuts.
It is not possible, using search engines, to retrieve lexico-grammatical
patterns, as this requires word-class tagging at the corpus input stage, which
clearly does not exist on the Web. However, as with sentence identification,
better results could be achieved by off-line post-processing. We are working on
this; it requires considerable processing power for fast online search.
3.5 Search refinement
3.5.1 Detailed search specification
A single search term is a blunt instrument in the retrieval of linguistic

information, particularly so from the Web, which is not set up to facilitate
delicate text search. To increase the relevance (or precision) of Web-based
output, the following kinds of basic specification are available and can be
exploited, as they are by our WebCorp system:
Text type and genre can be specified via the Open Directory or Yahoo
Some indication of document date (typically last update) can be identified,
where it is provided, using the WebCorp output option that displays URLs
Search may be limited to the whole or part of a particular URL, such as
bbc.co.uk, or .gov
Search may be limited to certain (and multiple) domains, using Boolean
terms as follows: .net OR .org; .ac.uk OR .edu
A word filter may be used, specifying that the search term, e.g. plant, must
occur in a text also containing, or excluding, a particular word or words,
such as +flower nuclear
3.5.2 Internationalisation
Search can be refined through the specification of relevant language(s), to allow

the user to focus on languages other than English, the primary and most
ubiquitous Web language. The different aspects of this issue include the users
specification of a particular language for search routines, the systems automatic
identification of particular languages, and its handling and representing of texts in
other languages. We have in the last year or so built some of this functionality
into WebCorp.
3.5.2.1 Refining search by specifying language
The user may wish to refine his/her search by specifying the language of the
context surrounding the chosen search term. One possibility is to specify a
particular country code. However, our findings are that there is no one-to-one
correlation between a country code and its associated language. The country code
can retrieve text in other languages than that associated with the country. A
search on the term gracejar, a Portuguese word meaning to joke, might be
expected to generate relevant output, but even with the specification of a country
code, in this case .pt, it does not, as shown in Figure 8.
que no era bonito gracejar com coisas to s

refere Chris Newell, para depois gracejar com um caso que conheceu
rbara Alexievna, chegando mesmo a gracejar com ela. Mas no
Dem-lhe uma oportunidade para gracejar e trazzz .. a est ele
disposio e gostava de gracejar custa de Pulquria Ivanovna
sei que ele est a gracejar, mas nem por isso menos
tentou em vo brincar, gracejar e conhecer as razes
desatou a rir e a gracejar: deixaram-no vontade e ele aproveitou
aram a rir e a gracejar sobre o caso curiosssimo
explicar, frisar, generalizar, gracejar, imaginar, incitar, informar, insinuar,
Figure 8. GRACEJAR with specified domain .pt (Portugal)
The final example above is actually in Spanish, and presented in a Spanish

context. This happens when the search term is ambiguous in the sense that it
exists in more than one language; e.g. it is an international brand name, global
term, or a term originating from EU legislation. The word gracejar also exists,
rarely, in Spanish.
The word swamped will retrieve English text even with the domain
specification .no, of Norway, as seen in Figure 9. The English language
dominates the Web and the fact that a site is based in a particular country, in this
case Norway, does not mean that the site will necessarily be written in that
countrys national language.
letters asking for Syrian intervention swamped our ministries from

Lebanese
your query earlier: it got swamped by > xmas xcesses. >
offering cheap Internet connections, completely swamped the systems and
low priority to avoid being swamped (spammed). This is something which
considerable period, this emphasis was swamped by a shift in the
is carried on, and is swamped in the competition with the

at large are already virtually swamped. The proletarian is without property
such. > I basically got swamped with K5 stuff, and became
an organization which was immediately swamped with congratulations from
their staffs were overloaded and swamped with cases involving companies
Figure 9. SWAMPED, domain .no
Perhaps the best one can say is that the country code refines the scope of
reference to one of interest to inhabitants of that country, and this tends to favour
texts in the native language. Ultimately, success in retrieving a particular
language via the country code comes primarily with search terms that are unique
to the language associated with it. The exclusively French word blaguer with
French domain setting retrieves only French language contexts, as in Figure 10.
lui mentir, mme pour blaguer, sans pouffer automatiquement de rire

dis a, cest pour blaguer 27 Si vous avez des
02 16h59 "Ctait pour blaguer, si on ne peut m
tchou tchou Est connu pour blaguer et parfois vomir Un tr
compagnie des enfants, elle aime blaguer avec eux. ge de
monde extrieur. Jaime bien blaguer et mes rponses sont
moral baisse, plus personne nose " blaguer " sur la guerre. Les permissions
est hyper important. On peut blaguer, se prendre des coups de 2000
On peut mme blaguer! 14/12/2000
Jai un besoin daide, pas question de blaguer: dans
Figure 10. BLAGUER with domain specification .fr (France)
Even so, if the search term is cited rather than used, it could occur anywhere, as
we see in Figure 11 below, where we submitted the search term blaguer to
Portuguese text domains and nevertheless managed to retrieve it in Portuguese
contexts.
inclusive o anjo. Era um "blaguer", um adorvel mentiroso. Tinha

inclusive o anjo. Era um " blaguer ", um adorvel mentiroso. Tinha
da Fonseca, <<grande blaguer >>. Para ele, at
Figure 11. BLAGUER with domain specified as .pt (Portugal)
3.5.2.2 Automatic language identification
The second problem of internationalisation is the automatic identification of

different languages. There are two obvious means by which the language of a text
could be identified: one is through the use of the HTTP language identification
protocol heading a Web document; another is through the application of language

feature analysis to the candidate Web text.
HTTP language identification protocol

Using the HTTP language identification protocol, an automated system could
theoretically identify the language of a Web page. However, this protocol is not
yet widely or consistently used, and we have found that fewer than 10% of the
pages listed by Google for any given search term return a language header when
accessed. For the English pages where a language header is returned, this is given
variously as en, en-gb, eng, English, etc.
Feature analysis
Secondly, one could identify a language through Feature Analysis of a
candidate Web text. Much work has been done on the automatic identification of
particular languages, not least by the Leeds team of Eric Atwell, Clive Souter,
and their postgraduate students (Souter et al. 1994). The two approaches that we
have so far isolated as promising are what we shall call Negative Feature
Analysis, and Positive Feature Analysis.
The principle of negative feature analysis is that a text is deemed not to be
in a particular language if it contains features not associated with that language.
The features could be a sequence of characters drawn from text of a given
language. This approach is exemplified by the work of a team of undergraduate
computer scientists at the University of Paris VII (Longuemaux et al. 2001). They
have built exemplar corpora in selected major languages, and they match a Web
email to each in turn, ranking the unlikelihood of the email being in each
language. The text is judged to be more likely to have been written in the
language of which it contains fewest untypical or impossible features. The
advantage of their system is that a one-page corpus furnishes sufficient features
for matching, and the language of the unknown text can be identified after very
few character combinations. The system can also rank the relative probabilities of
the language content of a Web text or page that contains more than one. This
would differentiate between the main language use and subsidiary languages, say
occurring in links to text headers in other languages.
The principle of positive feature analysis, as devised by Souter and team,
is to build a character-bigram (or trigram) model of text in each of the languages
that it is desirable to identify, then to compare new incoming text against each
letter-bigram/trigram model. This isolates the right language in a few characters,
because each language has specific patterns rarely found in other languages. It
can sometimes function even with a single word as its input data. We are still
finalising our method for the WebCorp tool, but language identification does not
seem to be problematic.
3.5.2.3 Handling/representing texts in other languages
The third aspect of internationalisation basically involves the integration of

Unicode/double-byte characters. We have developed a separate search
mechanism, in collaboration with colleagues in Beijing and Shanghai, which will
be built into WebCorp.
4. Next steps
In the next phase, we will carry on this research within the framework of the
University of Liverpool ULGRID initiative. This is concerned with the design
and implementation of the next generation of the Internet, with reference to the
new types of software, middleware and hardware that are required to facilitate
the larger tasks and greater traffic anticipated for the future. Greater in-university
processing power and distributed processing initiatives will help to increase the
speed of WebCorp response. In terms of improving access to more linguistically
usable Web-based text, we will be making recommendations, to the Semantic
Web and other initiatives, to enrich and standardise Web text mark-up for
document language and linguistically vital information such as date of authorship.
A fledgling markup infrastructure exists, but its adoption and uniform use by
Web page creators is slow.
Acknowledgement
We gratefully acknowledge the EPSRC funding of the WebCorp project.
References
Bergman, M.K. (2001), The deep Web: surfacing hidden value: http://www.
brightplanet.com/deepcontent/tutorials/DeepWeb/deepwebwhitepaper.pdf.
Kehoe, A. and A. Renouf (2002), WebCorp: applying the Web to linguistics and
linguistics to the Web, in: Proceedings of 11th International World Wide
Web Conference, Honolulu, Hawaii, 7-11 May 2002 (http://www.
2002.org/CDROM/poster/67/)
Kbler, N. and P.-Y. Foucou (2000), A Web-based environment for teaching
technical English, in: L. Burnard and T. McEnery (eds), Rethinking
language pedagogy. Papers from the Third International Conference on
Language and Teaching. Frankfurt am Main: Peter Lang. 65-73.
Longuemaux, F., F. Morandeau, A. Riviere, R. Tadayoni-Rouchon, P. Vaz
Martinho (2001), Reconnaissance de la langue partir de facteurs
interdits. Unpublished manuscript, Univ. Paris VII Denis Diderot.
Renouf, A. (2003), WebCorp: providing a renewable data source for corpus

linguists, in: S. Granger and S. Petch-Tyson (eds), Extending the scope of
corpus-based research: new applications, new challenges. Amsterdam and
Atlanta: Rodopi. 39-58.
Souter, C., G. Churcher, G. Hayes, J. Hughes and S. Johnson (1994), Natural
language identification using corpus-based models, in: K. Lauridsen and
O. Lauridsen (guest eds), HERMES Journal of Linguistics 13: Faculty of
Modern Languages, Aarhus School of Business. 183-203.

Aijmer &amp; Altenberg - Advances in Corpus Linguistics

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Aijmer &amp; Altenberg - Advances in Corpus Linguistics

Transféré par

Droits d'auteur :

Formats disponibles

Introduction

Karin Aijmer and Bengt Altenberg

University of Gteborg and University of Lund

The role of corpora in linguistic research

the problems of transcription and annotation

practical applications since it prevents the development of language theory and

data (collection) - description - theory

Peter Willemses study of pseudo-definite NPs in existential constructions is a

foundation of grammatics, using grammatics to mean the theoretical study of

2. Spoken and written

3. Spoken language and the corpus

In English we talk about reducing spoken language to writing, in a

and punctuation, while in origin marking patterns of prosodic movement, has

4. Some features of the spoken language

4.1 Patterns in casual conversation

4.2 Pattern forming and re-forming

[Two students are talking about the landlord of a mutual friend]

4.3 Patterns in words and phrases

4.4 Patterns in grammar

4.5 The grammar of appraisal

they werent recognized as a systematic component of meaning; but also because

4.6 Non-standard patterns

4.7 Grammatical intricacy

5. Some problems with a spoken corpus

In principle, as I think is generally accepted, the corpus is just as useful,

Sydneys latitudinal position of 33 south ensures warm summer

The goal of evolution is to optimize the mutual adaption of species.

6. Corpus-based and corpus-driven

Like Hunston and Francis, Tognini-Bonelli stresses the difference between

7. Aspects of speech: a final note

Any piece of spoken monologue can be thought of as an extended turn:

Cloran, C. (1994), Rhetorical units and decontextualization: an enquiry into some

Leech, G. (2000), Same grammar or different grammar? Contrasting approaches

Appendix: Transcripts of recorded conversations

Text 1: Passage from tape recording transcribed about 1960

35 of fewer teachers -- er or would you say - well no cut them off at at

Text 2: Passage from Svartvik and Quirk (1980: 215-218)

Text 3: Orthographic (and somewhat reduced) version of Text 2

Text 4: Passage from Grimshaw (ed.) 1994

she made the

Which at the time

=3 But she didnt

Text 6: Spoken translations of some sentences of written

2. Sydneys latitudinal position of 33 Sydney is at latitude 33 south, so it is

[It will be seen that] a [] it is possible both to replace

The Tuscan Word Centre

1. I have no longer any confidence in the ability of a human being to invent

My own position among these alternatives is perhaps over-cautious, but it

2. In the early days of corpus linguistics, when researchers were trying to

Table 1: Collocates of the phrase on the of (Bank of English May 2002)

1-12 13-24 25-36 37-48

One grouping could go like this:

The account of the patterning associated with this phrasal framework is

essence of finding the meaning-creating mechanisms in corpora is the

1. The information captured in mark-up is valuable and worth preserving.

class N; computer grammarians rely on an uneasy mix of received

community. This is because I do not regard the description of languages as

uncritically. Corpus-driven linguistics aims at developing the models so that they

Aarts, J. (1991), Intuition-based and observation-based grammars, in K. Aijmer

This chapter begins by considering the contrast between the data-driven

Table 1. Summary of the contents of this article

In the 1960s, one of the widely-accepted fundamentals of linguistics was to be

(1) Explanatory adequacy is achieved when the associated linguistic theory

Descriptive adequacy is achieved when the grammar gives a correct account

Aijmer & Altenberg - Advances in Corpus Linguistics

Aijmer & Altenberg - Advances in Corpus Linguistics