Vous êtes sur la page 1sur 33

Early Corpus Linguistics

"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.

Language acquisition
The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)].

Spelling conventions
Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.

Language pedagogy
Fries and Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagody had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.

Chomsky
Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance. Competence is best described as our tacit, internalised knowledge of a language.

Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form. Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence. Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it. However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.

The non-finite nature of language


All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:

The sentences of a natural language are finite. The sentences of a natural language can be collected and enumerated.

The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists...regarded the corpus as the sole explicandum of linguistics" (Leech, 1991). To be fair, not all linguists at the time made such bullish statements - Harris [1951) is probably the most enthusiastic exponent of this point, while Hockett [1948] did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time." The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in "Corpus Linguistics" Chapter 1, pages 7-8).

The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.

The value of introspection


Even if language was a finite construct, would corpus methodology still be the best method of studying language? Why bother waiting for the sentences of a language to enumerate themselves, when by the process of introspection we can delve into our own minds and examine our own linguistic competence? At times intuition can save us time in searching a corpus. Without recourse to introspective judgements, how can ungrammatical utterances be distinguished from ones that simply haven't occurred yet? If our finite corpus does not contain the sentence: *He shines Tony books how do we conclude that it is ungrammatical? Indeed, there may be persuasive evidence in the corpus to suggest that it is grammatical if we see sentences such as: He gives Tony books He lends Tony books He owes Tony books Introspection seems a useful and good tool for cases such as this. But early corpus linguistics denied its use. Also, ambiguous structures can only be identified and resolved with some degree of introspective judgement. An observation of physical form only seems inadequate. Consider the sentences: Tony and Fido sat down - he read a book of recipes. Tony and Fido sat down - he ate a can of dog food. It is only with introspection that this pair of ambiguous sentences can be resolved e.g. we know that Fido is the name of a dog and it was therefore Fido who ate the dog food, and Tony who read the book.

Other criticisms of corpus linguistics


Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1963) summed up the corpus-based approach as being composed of "pseudo-procedures". Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive.

Whatever Chomsky's criticisms were, Abercrombie's were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time. The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died.

Chomsky re-examined
Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpusbased work. For example, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. Also, in the field of language acquisition the observation of naturally occuring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteenmonth-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed, and there is no evidence that a child at the one-word stage has meta-linguistic awareness. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies.

The revival of corpus linguistics


It is a common belief that corpus linguistics was abandoned entirely in the 1950s, and then adopted once more almost as suddenly in the early 1980s. This is simply untrue, and does a disservice to those linguists who continued to pioneer corpus-based work during this interregnum. For example, Quirk (1960) planned and executed the construction of his ambitious Survey of English Usage (SEU) which he began in 1961. In the same year, Francis and Kucera began work on the now famous Brown corpus, a work which was to take almost two decades to complete. These researchers were in a minority, but they were not universally regarded as peculiar and others followed their lead. In 1975 Jan Svartvik started to build on the work of the SEU and the Brown corpus to construct the London-Lund corpus. During this period the computer slowly started to become the mainstay of corpus linguistics. Svartvik computerised the SEU, and as a consequence produced what some, including Leech (1991) still believe to be "to this day an unmatched resource for studying spoken English". The availability of the computerised corpus and the wider availability of institutional and private computing facilities do seem to have provided a spur to the revival of corpus linguistics. The table below (from Johansson, 1991) shows how corpus linguistics grew during the latter half of this century.

Date To 1965 1966-1970 1971-1975 1976-1980 1981-1985 1985-1991

Studies 10 20 30 80 160 320

The machine readable corpus


The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as psuedotechniques. The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer.

Processes
Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist. The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words occuring in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a concordance), and extract from this another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark. The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.

Goals and conclusion


In this section we have

seen the failure of early corpus linguistics examined Chomsky's criticisms seen the failings of introspective data seen how corpus linguistics was revived

In the remaining sections we will see

how corpus linguists study syntactic features (Section 2) how corpus linguistics balances enumeration with introspection (Section 3) how corpora can be used in language studies (Section 4)

Definition of a corpus

The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways. In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

Text Encoding and Annotation

If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation. For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus. Leech (1993) describes 7 maxims which should apply in the annotation of text corpora.

Formats of Annotation
Currently, there are no widely agreed standards of representing information in texts and in the past many different approaches have been adopted, some more lasting than others. One longstanding annotation practice is known as COCOA refernces. COCOA was an early computer program used for extracting indexes of words in context from machine readable texts. Its conventions were carried forward into several other programs, notably the OCP (Oxford Concordance Program). The Longman-Lancaster corpus and the Helsinki corpus have also used COCOA references. Very simply, a COCOA reference consists of a balanced set of angled brackets (< >) which contains two entities:

A code which stands for a particular variable name. A string or set of strings, which are the instantiations of that variable.

For example, the code "A" could be used to refer to the variable "author" and the string would stand for the author's name. Thus COCOA references which indicate the author of a passage of text would look like the following: <A CHARLES DICKENS> <A WOLFGANG VON GOETHE> <A HOMER> COCOA references only represent an informal trend for encoding specific types of textual information, e.g. authors, dates and titles. Current trends are moving more towards more formalised international standards of encoding. The flagship of this current trend is the Text Encoding Iniative (TEI), a project sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing and the Association for Computers and the Humanites. Its aim is to provide standardised implementations for machine-readable text interchange. The TEI uses a form of document markup known as SGML (Standard Generalised Markup Language). SGML has the following advantages:

Clarity Simplicity Formally rigourous Already recognised as an international standard

The TEI's contribution is a detailed set of guidelines as to how this standard is to be used in text encoding (Sperberg-McQueen and Burnard, 1994). In the TEI, each text (or document) consists of two parts - a header and the text itself. The header contains information such as the following:

author, title and date the edition or publisher used in creating the machine-readable text information about the encoding practices adopted.

Click here to read more about headers and text You might also want to read about the EAGLES advisory body in chapter 2 of Corpus Linguistics (page 29).

Textual and extra-textual information


The most basic type of additional information is that which tells us what text or texts we are looking at. A computer file name may give us a clue to what the file contains, but in many cases filenames can only provide us with a tiny amount of information. Information about the nature of the text can often consist of much more than a title and an author. Click here for an example of a document header. These information fields provide the document with a whole document header which can be used by retrieval programs to search and sort on particular variables. For example, we might only be interested in looking at texts in a corpus that were written by women, so we could ask a computer program to retrieve texts where the author's gender variable is equal to "FEMALE".

Orthography
It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task, but even with a basic machine-readable text, issues of encoding are vital, although to English speakers their extent may not be apparent at first. In languages other than English, the issue of accents and of non-Roman alphabets such as Greek, Russian and Japanese present a problem. IBM-compatible computers are capable of handling accented characers, but many other mainframe computers are unable to do this. Therefore, for maximum interchangeability, accented characters need to be encoded in other ways. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. For example, French speakers omit the accent entirely, writing Hlen as Helene. To handle the umlaut, German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter, so Frhling would become Fruehling or Fr"uhling. However, these strategies cause additional problems - in the case of the French, information is lost, while in the German extraneous information is added. In response to this the TEI has suggested that these characters are encoded as TEI entities, using the delimiting characters of & and ;. Thus, would be encoded by the TEI as

&uumlaut;

Read about the handling of non-Roman alphabets and the transcription of spoken data in Corpus Linguistics, chapter 2, pages 34-36.

Types of annotation
Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow:

Part of Speech annotation Lemmatisation Parsing Semantics Discoursal and text linguistic annotation Phonetic transcription Prosody Problem-oriented tagging

Multilingual Corpora

Not all corpora are monolingual, and an increasing amount of work in being carried out on the building of multilingual corpora, which contain texts of several different languages. First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language, but each contains completely different texts in those several languages. For example, the Aarthus corpus of Danish, French and English contract law consists of a set of three monolingual law corpora, which is not comprised of translations of the same texts. The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. This refers to corpora which hold the same texts in more than one language. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew, Latin and Greek etc. A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" would be equivalent to "is smoking" in English.

At present there are few cases of annotated parallel corpora, and those which exist tend to be bilingual rather than multilingual. However, two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers a restricted range of text types (proceedings of the Candian Parliament). However, this is an area of growth, and the situation is likely to change dramatically in the near future.

An example of a bilingual corpus


This example is taken from a parallel French-English corpus, and is aligned at sentence level.
sub d = 22 ----------& the location register should as a minimum contain the following information about a mobile station : -----& 1 ' enrigisteur de localisation doit contenir au moins les renseignments suivants sur une station mobile : sub d = 386 ----------& handover is the action of switching a call in progress from one cell to another ( or radio channels in the same cell ) . -----& le transfert intercellulaire consiste commuter une communication en cours d ' une cellule ( ou d ' une voie radiolectrique l ' autre l ' intrieur de la mme cellule ) . sub d = 380 ----------& the location register, other than the home location register used by an msc to retrieve information for, for instance, handling of calls to or from a roaming mobile station , currently located in its area . -----& enregistreur de localisation , autre que l ' enregistreur de localisation nominal , utlilis par un ccm pour la recherche d ' informations en vue , par exemple , de l ' tablissment de communication en provenance ou destination d ' une station mobile en dplacement , temporairement situe dans sa zone .

Introduction
In this session we'll be looking at the techniques used to carry out corpus analysis. We'll reexamine Chomsky's argument that corpus linguistics will result in skewed data, and see the procedures used to ensure that a representative sample is obtained. We'll also be looking at the relationship between quantitative and qualitative research. Although the majority of this session is concerned with statisitical procedures which can be said to be quantitative, it is important not to ignore the importance of qualitative analyses.

With the statistical part of this session two points should be made.

First, that this section is of necessity incomplete. Space precludes the coverage of all of the techniques which can be used on corpus data. Second, we do not aim here to provide a "step-by-step" guide to statistics. Many of the techniques used are very complex and to explain the mathematics in full would require a separate session for each one. Other books, notably Language and Computers and Statistics for Corpus Linguistics (Oakes, M. - forthcoming) present these methods in more detail than we can give here. Tony McEnery, Andrew Wilson, Paul Baker.

Qualitative vs Quantitative analysis


Corpus analysis can be broadly categorised as consisting of qualitative and quantitative analysis. In this section we'll look at both types and see the pros and cons associated with each. You should bear in mind that these two types of data analysis form different, but not necessary incompatible perspectives on corpus data.

Qualitative analysis: Richness and Precision.


The aim of qualitative analysis is a complete, detailed description. No attempt is made to assign frequencies to the linguistic features which are identified in the data, and rare phenomena receives (or should receive) the same amount of attention as more frequent phenomena. Qualitative analysis allows for fine distinctions to be drawn because it is not necessary to shoehorn the data into a finite number of classifications. Ambiguities, which are inherent in human language, can be recognised in the analysis. For example, the word "red" could be used in a corpus to signify the colour red, or as a political cateogorisation (e.g. socialism or communism). In a qualitative analysis both senses of red in the phrase "the red flag" could be recognised. The main disadvantage of qualitative approaches to corpus analysis is that their findings can not be extended to wider populations with the same degree of certainty that quantitative analyses can. This is because the findings of the research are not tested to discover whether they are statistically significant or due to chance.

Quantitative analysis: Statistically reliable and generalisable results.


In quantitative research we classify features, count them, and even construct more complex statistical models in an attempt to explain what is observed. Findings can be generalised to a larger population, and direct comparisons can be made between two corpora, so long as valid sampling and significance techniques have been used. Thus, quantitative analysis allows us to discover which phenomena are likely to be genuine reflections of the behaviour of a language or variety, and which are merely chance occurences. The more basic task of just looking at a single

language variety allows one to get a precise picture of the frequency and rarity of particular phenomena, and thus their relative normality or abnomrality. However, the picture of the data which emerges from quantitative analysis is less rich than that obtained from qualitative analysis. For statistical purposes, classifications have to be of the hardand-fast (so-called "Aristotelian" type). An item either belongs to class x or it doesn't. So in the above example about the phrase "the red flag" we would have to decide whether to classify "red" as "politics" or "colour". As can be seen, many linguistic terms and phenomena do not therefore belong to simple, single categories: rather they are more consistent with the recent notion of "fuzzy sets" as in the red example. Quantatitive analysis is therefore an idealisation of the data in some cases. Also, quantatitve analysis tends to sideline rare occurences. To ensure that certain statistical tests (such as chi-squared) provide reliable results, it is essential that minimum frequencies are obtained - meaning that categories may have to be collapsed into one another resulting in a loss of data richness.

A recent trend
From this brief discussion it can be appreciated that both qualitative and quantitative analyses have something to contribute to corpus study. There has been a recent move in social science towards multi-method approaches which tend to reject the narrow analytical paradigms in favour of the breadth of information which the use of more than one method may provide. In any case, as Schmied (1993) notes, a stage of qualitative research is often a precursor for quantitative analysis, since before linguistic phenomena can be classified and counted, the categories for classification must first be identified. Schmied demonstrates that corpus linguistics could benefit as much as any field from multi-method research.

Corpus Representativeness
As we saw in Session One, Chomsky criticised corpus data as being only a small sample of a large and potentially infinite population, and that it would therefore be skewed and hence unrepresentative of the population as a whole. This is a valid criticism, and it applied not just to corpus linguistics but to any form of scientific investigation which is based on sampling. However, the picture is not as drastic as it first appears, as there are many safeguards which may be applied in sampling to ensure maximum representativeness. First, it must be noted that at the time of Chomsky's criticisms, corpus collection and analysis was a long and pains-taking task, carried out by hand, with the result that the finished corpus had to be of a manageable size for hand analysis. Although size is not a guarantee of representativeness, it does enter significantly into the factors which must be considered in the production of a maximally representative corpus. Thus, Chomsky's criticisms were at least partly true at the time of those early corpora. However, today we have powerful computers which can store and manipulate many millions of words. The issue of size is no longer the problem that it used to be.

Random sampling techniques are standard to many areas of science and social science, and these same techniques are also used in corpus building. But there are additional caveats which the corpus builder must be aware of. Biber (1993) emphasises that we need to define as clearly as possible the limits of the population which we wish to study, before we can define sampling procedures for it. This means that we must rigourously define our sampling frame - the entire population of texts from which we take our samples. One way to do this is to use a comprehensive bibliographical index - this was the approach taken by the Lancaster-Oslo/Bergen corpus who used the British National Bibliography and Willing's Press Guide as their indices. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to your particular area of interest. For example, all the German-language books in Lancaster University library that were published in 1993. This approach is one which was used in building the Brown corpus. Read about a different kind approach, which was used in collecting the spoken parts of the British National Corpus, in Corpus Linguistics, chapter 3, page 65. Biber (1993) also points out the advantage of determining beforehand the hierarchical structure (or strata) of the population. This refers to defining the different genres, channels etc. that it is made up if. For example, written German could be made up of genres such as:

newspaper reporting romantic fiction legal statutes scientific writing poetry and so on....

Stratificational sampling is never less representative than pure probablistic sampling, and is often more representative, as it allows each individual stratum to be subjected to probablistic sampling. However, these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification. Read about optimal lengths and number of sample sizes, and the problems of using standard statistical equations to determine these figures in Corpus Linguistics, chapter 3, page 66.

Frequency Counts
This is the most straight-forward approach to working with quantitative data. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme.

For instance, we might set up a classification scheme to look at the frequency of the four major parts of speech: noun, verb, adjective and adverb. These four classes would constitute our types. Another example inolves the simple one-to-one mapping of form onto classification. In other words, we count the number of times each word appears in the corpus, resulting in a list which might look something like: abandon: 5 abandoned: 3 abandons: 2 ability: 5 able: 28 about: 128 etc..... More often, however, the use of a classification scheme implies a deliberate act of categorisation on the part of the investigator. Even in the case of word frequency analysis, variant forms of the same lexeme may be lemmatised before a frequency count is made. For instance, in the example above, abandon, abandons and abandoned might all be classed as the lexeme ABANDON. Very often the classification scheme used will correspond to the type of linguistic annotation which will have already been introduced into the corpus at some earlier stage (see Session 2). An example of this might be an analysis of the incidence of different parts of speech in a corpus which had already been part-of-speech tagged.

Working with Proportions


Frequency counts are useful, but they have certain disadvantages. When one wishes to compare one data set with another, for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type, they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. This is not a problem when the two corpora that are being compared are of the same size, but when they are of different sizes frequency counts are little more than useless. The following example compares two such corpora, looking at the frequency of the word boot Type of corpus Number of words Number of instances of boot English Spoken 50,000 English Written 500,000 50 500

A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get: spoken English: 50/50,000 X 100 = 0.1% written English: 500/500,000 X 100 = 0.1%

Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0.1%) for both the written and spoken corpora. Even where disparity of size is not an issue, it is often better to use proportional statistics to present frequencies, since most people find them easier to understand than comparing fractions of unusual numbers like 53,000. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is: ratio = number of occurrences of the type / number of tokens in the entire sample This result can be expressed as a fraction, or more commonly as a decimal. However, if that results in an unwieldy looking small number (in the above example it would be 0.0001) the ratio can then be multiplied by 100 and represented as a percentage.

Significance Testing
Significance tests allow us to determine whether or not a finding is the result of a genuine difference between two (or more) items, or whether it is just due to chance. For example, suppose we are examining the Latin versions of the Gospel of Matthew and the Gospel of John and we are looking at how third person singular speech is represented. Specifically we want to compare how often the present tense form of the verb "to say" is used ("dicit") with how often the perfect form of the verb is used ("dixit"). A simple count of the two verb forms in each text produces the following results: Text Matthew John no. of occurences of dicit no. of occurences of dixit 46 118 107 119

From these figures is looks as if John uses the present form ("dicit") proportionally more often than Matthew does, but to be more certain that this is not just due to co-incidence, we need to perform a further calculation - the significance test. There are several types of significance test available to the corpus linguist: the chi squared test, the [Student's] t-test, Wilcoxon's rank sum test and so on. Here we will only examine the chisquared test as it is the most commonly used significance test in corpus linguistics. This is a non-parametric test which is easy to calculate, even without a computer statistics package, and can be used with data in 2 X 2 tables, such as the example above. However, it should be noted that the chi-squared test is unreliable where very small numbers are involved and should not therefore be used in such cases. Also, proportional data (percentages etc) can not be used with the chi-squared test. The test compares the difference between the actual frequencies (the observed frequencies in the data) with those which one would expect if no factor other than chance had been operating (the

expected frequencies). The closer these two results are to each other, the greater the probablity that the observed frequencies are influenced by chance alone. Having calculated the chi-squared value (we will omit this here and assume it has been done with a computer statistical package) we must look in a set of statistical tables to see how significant our chi-squared value is (usually this is also carried out automatically by computer). We also need one further value - the number of degrees of freedom which is simply: (number of columns in the frequency table - 1) x (number of rows in the frequency table - 1) In the example above this is equal to (2-1) x (2-1) = 1. We then look at the table of chi-square values in the row for the relevant number of degrees of freedom until we find the nearest chi-square value to the one which is calculated, and read off the probability value for that column. The closer to 0 the value, the more significant the difference is - i.e. the more unlikely that it is due to chance alone. A value close to 1 means that the difference is almost certainly due to chance. In practice it is normal to assign a cut-off point which is taken to be the difference between a significant result and an "insignificant" result. This is usually taken to be 0.05 (probablity values of less than 0.05 are written as "p < 0.05" and are assumed to be significant.) In our example about the use of dicit and dixit above we calculate a chi-squared value of 14.843. The table below shows the significant p values for the first 3 degrees of freedom: Degrees of Freedom p = 0.05 p = 0.01 p = 0.001 1 2 3 3.84 5.99 7.81 6.63 9.21 11.34 10.83 13.82 16.27

The number of degrees of freedom in our example is 1, and our result is higher than 10.83 (see the final column in the table) so the probability value for this chi-square value is 0.001. Thus, the difference between Matthew and John can be said to be significant at p < 0.001, and we can therefore say with a high degree of certainty that this difference is a true reflection of variation in the two texts and not due to chance.

Collocations
The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.

Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them. Two of the most commonly encountered formulae are: mutual information and the Z-score. Both tests provide similar data, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship. For each pair of words, a score is given - the higher the score the greater the degree of collocality. Mutual information and the Z-score are useful in the following ways:

They enable us to extract multiword units from corpus data, which can be used in lexicography and particularly specialist technical translation. We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word. We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerful collocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.

Multiple Variables
The tests that we have looked at so far can only pick up differences between particular samples (i.e. texts and copora) on particular variables (i.e. linguistic features) but they cannot provide a picture of the complex interrelationship of similarity and difference between a large number of samples, and large numbers of variables. To perform such comparisons we need to consider multivariate techniques. Those most commonly encountered in linguistic research are:

factor analysis principal components analysis multidimensional scaling cluster analysis

The aim of multivariate techniques is to summarise a large set of variables in terms of a smaller set on the basis of statistical similarities between the original variables, whilst at the same time losing the minimal amount of information about their differences.

Although we will not attempt to explain the complex mathematics behind these techniques, it is worth taking time to understand the stages by which they work: All the techniques begin with a basic cross-tabulation of the variables and samples. For factor analysis an intercorrelation matrix is then calculated from the cross-tabulation, which is used to attempt to "summarise" the similarities between the variables in terms of a smaller number of reference factors which the technique extracts. The hypothesis being that the many variables which appear in the original frequency cross-tabulation are in fact masking a smaller number of variables (the factors) which can help explain better why the observed frequency differences occur. Each variable receives a loading on each of the factors which are extracted, signifying its closeness to that factor. For example, in analysing a set of word frequencies across several texts one might find that words in a certain conceptual field (i.e. religion) received high loadings on one factor, whereas those in another field (e.g. government) loaded highly on another factor. Follow this link for an example of factor analysis. Correspondence analysis is similar to factor analysis, but it differs in the basis of its calculations. Multidimensional scaling (MDS) also makes use of an intercorrelation matrix, which is then converted to a matrix in which the correlation coefficients are replaced with rank order values. E.g. the highest correlation value recieves a rank order of 1, the next highest receives a rank order of 2 and so on. MDS then attempts to plot and arrange these variables so that the more closely related items are plotted closer together than the less closely related items. Cluster analysis involves assembling the variables into unique groups or "clusters" of similar items. A matrix is created, in a similar fashion to factor analysis (although this may be a distance matrix showing the degree of difference rather than similarity between the pairs of variables in the cross-tabulation). The matrix is then used to group the variables contained within it.

Log-linear Models
Here we will consider a different technique which deals with the interrelationships of several variables. As linguists, we often want to go beyond the simple description of a phenomenon, and explain what it is that causes the data to behave in a particular way. A loglinear analysis allows us to take a standard frequency cross-tabulation and find out which variables seem statistically most likely to be responsible for a particular effect. For example, let us imagine that we are interested in the factors which influence whether the word for is present or omitted from phrases of duration such as She studied [for] three years in Munich. We may hypothesise several factors which could have an effect on this, e.g. the text genre, the semantic category of the main verb and whether or not the verb is separated by an adverb from the phrase of duration. Any one of these factors might be solely responsible for the

omission of for, or it might be the case that a combination of factors are culpable. Finally, all the factors working together could be responsible for the presence/omission of for. A loglinear analysis provides us with a number of models which take these points into account. The way that we test the models in loglinear analysis is first to test the significance of associations in the most complex model - that is the model which assumes that all of the variables are working together. Then we take away each variable at a time from the model and see whether significance is maintained in each case, until we reach the model with the lowest possible dimensions. So in the above example, we would start with a model that posited three variables (e.g. genre, verb class and adverb separation) and test the significance of a three variable model. Then we would test each of the two variable models (taking away one variable in each case) and finally each of the three one-variable models. The best model would be taken to be the one with the fewest number of variables which still retained statistical significance.

Introduction
In this session we will examine the roles which corpora may play in the study of language. The importance of copora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language. Empirical data also allows us to study language varieties such as dialects or earlier periods in a language for which it is not possible to carry out a rationalist approach. It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety. Corpus linguistics, proper, should be seen as a subset of the activity within an empirical approach to linguistics. Although corpus linguistics entails an empirical approach, empirical linguistics does not always entail the use of a corpus. In the following pages we'll consider the roles which corpora use may play in a number of different fields of study related to language. We will focus on the conceptual issues of why corpus data are important to these areas, and how they can contribute to the advancement of knowledge in each, providing real examples of corpus use. In view of the huge amount of corpus-based linguistic research, the examples are necessarily selective - you can consult further reading for additional examples.

Corpora in Speech Research


A spoken corpus is important because of the following useful features:

It provides a broad sample of speech, extending over a wide selection of variables such as: o speaker gender

o o o

speaker age speaker class genre (e.g. newsreading, poetry, legal proceedings etc)

This allows generalisations to be made about spoken language as the corpus is as wide and as representative as possible. It also allows for variations within a given spoken language to be studied.

It provides a sample of naturalistic speech rather than speech elicited under aritificial conditions. The findings from the corpus are therefore more likely to reflect language as it is spoken in "real life" since the data is less likely to be subject to production monitoring by the speaker (such as trying to suppress a regional accent). Because the (transcribed) corpus has usually been enhanced with prosodic and other annotations it is easier to carry out large scale quantitative analyses than with fresh raw data. Where more than one type of annotation has been used it is possible to study the interrelationships between say, phonetic annotations and syntactic structure.

Prosodic annotation of spoken corpora


Because much phonetic corpus annotation has been at the level of prosody, this has been the focus of most of the phonetic and phonological research in spoken corpora. This work can be divided roughly into three types: 1. How do prosodic elements of speech relate to other linguistic levels? 2. How does what is actually perceived and transcribed relate to the actual acoustic reality of speech? 3. How does the typology of the text relate to the prosodic patterns in the corpus?

4.Corpora in Lexical Studies


5. Empirical data has been used in lexicography long before the discipline of corpus linguistics was invented. Samuel Johnson, for example, illustrated his dictionary with examples from literature, and in the 19th Century the Oxford Dictionary used citation slips to study and illustrate word usage. Corpora, however, have changed the way in which linguists can look at language. 6. A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined. 7. Follow this link for an example of the benefits of corpus linguistics in lexicography 8. Examples extracted from corpora can be easily organised into more meaningful groups for analysis. For example, by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together. Furthermore, because corpus data contains a rich amount of textual information - regional variety,

author, date, genre, part-of-speech tags etc it is easier to tie down usages of particular words or phrases as being typical of particular regional varieties, genres and so on. 9. The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc. However, finite corpora also have an important role in lexical studies - in the area of quantification. It is possible to rapidly produce reliable frequency counts and to subdivide these areas across various dimensions according to the varieties of language in which a word is used. 10. Finally, the ability to call up word combinations rather than individual words, and the existence of mutual information tools which establish relationships between co-occuring words (see Session 3) mean that we can treat phrases and collocations more systematically than was previously possible. A phraseological unit may consitute a piece of technical terminology or an idiom, and collocations are important clues to specific word senses.

Corpora and Grammar


Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types of research which have used corpora. Copora makes a useful tool for syntactical research because of :

The potential for the representative quantification of a whole language variety. Their role as empirical data for the testing of hypotheses derived from grammatical theory.

Many smaller-scale studies of grammar using corpora have included quantitative data analysis (for example, Schmied's 1993 study of relative clauses). There is now a greater interest in the more systematic study of grammatical frequency - for example, Oostdijk and de Haan (1994a) are aiming to analyse the frequency of the various English clause types. Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see Session One) has often meant that these two approaches have been viewed as separate and in competition with each other. However, there is a group of researchers who have used corpora in order to test essentially rationalist grammatical theory, rather than use it for pure description or the inductive generation of theory. At Nijmegen University, for instance, primarily rationalist formal grammars are tested on reallife language found in computer corpora (Aarts 1991). The formal grammar is first devised by reference to introspective techniques and to existing accounts of the grammar of the language. The grammar is then loaded into a computer parser and is run over a corpus to test how far it accounts for the data in the corpus. The grammar is then modified to take account of those analyses which it missed or got wrong.

Corpora and Semantics


The main contribution that corpus linguistics has made to semantics is by helping to establish an approach to semantics which is objective, and takes account of indeterminacy and gradience. Mindt (1991) demonstrates how a corpus can be used in order to provide objective criteria for assigning meanings to linguistic terms. Mindt points out that frequently in semantics, meanings of terms are described by reference to the linguist's own intuitions - the rationalist approach that we mentioned in the section on Corpora and Grammar. Mindt argues that semantic distinctions are associated in texts with characteristic observable contexts - syntactic, morphological and prosodic - and by considering the environments of the linguistic entities an empirical objective indicator for a particular semantic distinction can be arrived at. Another role of corpora in semantics has been in establishing more firmly the notions of fuzzy categories and gradience. In theoretical linguistics, categories are usually seen as being hard and fast - either an item belongs to a category or it does not. However, psychological work on categorisation suggests that cognitive categories are not usually "hard and fast" but instead have fuzzy boundaries, so it is not so much a question of whether an item belongs to one category or the other, but how often it falls into one category as opposed to the other one. In looking empirically at natural language in corpora it is clear that this "fuzzy" model accounts better for the data: clear-cut boundaries do not exist; instead there are gradients of membership which are connected with frequency of inclusion.

Corpora in Pragmatics and Discourse Analysis


The amount of corpus-based reseach in pragmatics and discourse analysis has been relatively small up to now. This is partly because these fields rely on context (Myers 1991) and the small samples of texts used in corpora tend to mean that they are somewhat removed from their social and textual contexts. Sometimes relevant social information (gender, class, region) is encoded within the corpus but it is still not always possible to infer context from corpus texts. Much of the work that has been carried out in this area has used the London-Lund corpus which was until recently the only truly conversational corpus. The main contribution of such research has been to the understanding of how conversation works, with respect to lexical items and phrases which have conversational functions. Stenstm (1984) correlated discourse items such as well, sort of and you know with pauses in speech and showed that such correlations related to whether or not the speaker expects a response from the addressee. Another study by Stenstm (1987) examined "carry-on signals" such as right, right-o and all right. These signals were classified according to the typology of their various functions e.g.:

right was used in all functions, but especially as a response, to evaluate a previous response or terminate an exchange. All right was used to mark a boundary between two stages in discourse.

that's right was used as an emphasiser. it's alright and that's alright were responses to apologies.

The availability of new conversational corpora, such as the spoken part of the BNC (British National Corpus) should provide a greater incentive both to extend and to replicate such studies, since the amount of conversational data available, and the social/geographical range of people recorded both will have increased. At present, quantitative analyses of corpus-based approaches to issues in pragmatics have been poorly served. Hopefully this is one area which will be exploited by linguists in the near future.

Corpora and Sociolinguistics


Although sociolinguistics is an empircal field of research it has hitherto relied primarily upon the collection of research-specific data which is often not intended for quantitative study and is thus not often rigorously sampled. Sometimes the data are also elicited rather than naturalistic data. A corpus can provide what these kinds of data cannot provide - a representative sample of naturalistic data which can be quantified. Although corpora have not as yet been used to a great extent in sociolinguistics, there is evidence that this is a growing field. The majority of studies in this area have concerned themselves with lexical studies in the area of language and gender. Kjellmer (1986), for example, used the Brown and LOB corpora to examine the masculine bias in American and British English. He looked at the occurrence of masculine and feminine pronouns, and at the occurrence of the items man/men and woman/women. As one would expect, the frequencies of the female items were much lower than the male items in both corpora. Interestingly, however, the female items were more common in British English than in American English. Another hypothesis of Kjellmer's was not supported in the corpora - that woman would be less "active", that is would be more frequently the objects rather than the subjects of verbs. In fact men and women had similar subject/object ratios. Holmes (1994) makes two important points about the methodology of these kinds of study, which are worth bearing in mind. First, when classifying and counting occurrences the context of the lexical item should be considered. For instance, whilst there is a non-gender marked alternative for policeman/policewoman, namely police officer, there is no such alternative for the -ess form in Duchess of York. The latter form should therefore be excluded from counts of "sexist" suffixes when looking at gender bias in writing. Second, Holmes points out the difficulty of classifying a form when it is actively undergoing semantic change. She argues that the word man can refer both to a single male (such as in the phrase A 35 year old man was killed, or can have a generic meaning which refers to mankind (such as Man has engaged in warfare for centuries. In phrases such as we need the right man for the job it is difficult to decide whether man is gender specific or could be replaced by person. These simple points should incite a more critical approach to data classification in further sociolinguistic work using corpora, both within and without the area of gender studies.

Corpora and Stylistics

Stylistics researchers are usually interested in individual texts or authors rather than the more general varieties of a language and tend not to be large-scale users of corpora. Nevertheless, some stylisticians are interested in investigating broader issues such as genre, and others have found corpora to be important sources of data in their research. In order to define an author's particular style, we must, in part examine the degree by which the author leans towards different ways of putting things (technical vs non-technical vocabulary, long sentences vs short sentences and so on). This task requires comparisons to be made not only internally within the author's own work, but also with other authors or the norms of the language or variety as a whole. As Leech and Short (1981) point out, stylistics often demands the use of quantification to back up judgements which may appear subjective rather than objective. This is where corpora can play a useful role. Another type of stylistic variation is the more general variation between genres and channels for example, one of the most common uses of corpora has been in looking at the differences between spoken and written language. Altenberg (1984) examined the differences in the ordering of cause-result constructions while Tottie (1991) looked at the differences in negation strategies. Other work has looked at variations between genres, using subsamples of corpora as a database. For example, Wilson (1992) used sections from the LOB and Kolhpur corpora, the Augustan Prose Sample and a sample of modern English conversation to examine the usage of since and found that causal since had evolved from being the main causal connective in late seventeenth century writing to being characteristic of formal learned writing in the twentieth century.

Corpora in the Teaching of Languages and Linguistics


Resources and practices in the teaching of languages and linguistics tend to reflect the division between the empirical and rationalist approaches. Many textbooks contain only invented examples and their descriptions are based upon intutition or second-hand accounts. Other books, however, are explicitly empirical and use examples and descriptions from corpora or other sources of real life language data. Corpus examples are important in language learning as they expose students to the kinds of sentences that they will encounter when using the language in real life situations. Students who are taught with traditional syntax textbooks which contain sentences such as Steve puts his money in the bank are often unable to analyse more complex sentences such as The government has welcomed a report by an Australian royal commission on the effects of Britain's atomic bomb testing programme in the Australian desert in the fifties and early sixties (from the Spoken English Corpus). Apart from being a source of empirical teaching data, corpora can be used to look critically at existing language teaching materials. Kennedy (1987a, 1987b) has looked at ways of expressing quantification and frequency in ESL (English as a second language) textbooks. Holmes (1988) has examined ways of expressing doubt and certainty in ESL textbooks, while Mindt (1992) has

looked at future time expressions in German textbooks of English. These studies have similar methodologies - they analyse the relevant constructions or vocabularies, both in the sample text books and in standard English corpora and then they compare their findings between the two sets. Most studies found that there were considerable differences between what textbooks are teaching and how native speakers actually use language as evidenced in the corpora. Some textbook gloss over important aspects of usage, or foreground less frequent stylistic choices at the expense of more common ones. The general conclusion from these studies is that nonempirically based teaching materials can be misleading and that corpus studies should be used to inform the production of material so that the more common choices of usage are given more attention than those which are less common. Read about language teaching for "special purposes" in Corpus Linguistics, Chapter 4, pages 104-105. Corpora have also been used in the teaching of linguistics. Kirk (1994) requires his students to base their projects on corpus data which they must analyse in the light of a model such as Brown and Levinson's politeness theory or Grice's co-operative principle. In taking this approach, Kirk is using corpora not only as a way of teaching students about variation in English but also to introduce them to the main features of a corpus-based approach to linguistic analysis. A further application of corpora in this field is their role in computer-assisted language learning. Recent work at Lancaster University has looked at the role of corpus-based computer software for teaching undergraduates the rudiments of grammatical analysis (McEnery and Wilson 1993). This software - Cytor - reads in an annotated corpus (either part-of-speech tagged or parsed) one sentence at a time, hides the annotation and asks the student to annotate the sentence him- or herself. Students can call up help in the form of the list of tag mnemomics, a frequency lexicon or concordances of examples. McEnery, Baker and Wilson (1995) carried out an experiment over the course of a term to determine how effective Cytor was at teaching part-of-speech learning by comparing two groups of students - one who were taught with Cytor, and another who were taught via traditional lecturer-based methods. In general the computer-taught students performed better than the human-taught students throughout the term.

Corpora and Historical Linguistics


Historical linguistics can be seen as a species of corpus linguistics, since the texts of a historical period or a "dead" language form a closed corpus of data which can only be extended by the (re)discovery of previously unknown manuscripts or books. In some cases it is possible to use (almost) all of the closed corpus of a language for research - something which can be done for ancient Greek for example, using the Theasurus Linguae Graecae corpus which contains most of extant ancient Greek literature. However, in practice historical linguistics has not tended to follow a strict corpus linguistic paradigm, instead taking a selective approach to empirical data, to look for evidence of a particular phemonema and making rough estimates at frequency. No real attempts were made to produce samples that were representative.

In recent years, however, some historical linguistics have changed their approach, resulting in an upsurge in strictly corpus-based historical linguistics and the building of corpora for this purpose. The most widely known English historical corpus is the Helsinki corpus. The Helsinki corpus contains approximately 1.6 million words of English dating from the earliest Old English Period (before AD 850) to the end of the Early Modern English period (1710). It is divided into three main periods - Old English, Middle English and Early Modern English - and each period is subdivided into a number of 100-year subperiods (or 70-year subperiods in some cases). The Helsinki corpus is representative in that it covers a range of genres, regional varieties and sociolinguistics variables such as gender, age, education and social class. The Helsinki team have also produced "satellite" corpora of early Scots and early American English. Other examples of English historical corpora in development are the Zrich Corpus of English Newspapers (ZEN), the Lampeter Corpus of Early Modern English Tracts (a sample of English pamphlets from between 1640 and 1740) and the ARCHER corpus (a corpus of British and American English from 1650-1990). The work which is carried out on historical corpora is qualitatively similar to that which is carried out on modern language corpora, although it is also possible to carry out work on the evolution of language through time. For example, Peitsara (1993) used four subperiods from the Helsinki corpus and calculated the frequencies of different prepositions introducing agent phrases. Throughout the period she found that the most common prepositions of this type were of and by, which were of almost equal frequency at the beginning of the period, but by the fifteenth century by was three times more common than of, and by 1640 by was eight times as common. Studies like this have particular importance in the context of Halliday's (1991) conception of language evolution as a motivated change tin the probabilities of the grammar. However, it is important to be aware of the limitations of corpus linguistics, as Rissanen (1989) pointed out. Rissanen identifies three main problems associated with using historical corpora 1. The "philologist's dilemma" - the danger that the use of a corpus and a computer may supplant the in-depth knowledge of language history which is to be gained from the study of original texts in their context. 2. The "God's truth fallacy" - the danger that a corpus may be used to provide representative conclusions about the entire language period, without understanding its limitations in the terms of which genres it does and does not cover. 3. The "mystery of vanishing reliability" - the more variables which are used in sampling and coding the corpus (periods, genres, age, gender etc) the harder it is to represent each one fully and achieve statistical reliability. The most effective way of solving this problem is to build larger corpora of course. Rissanen's reservations are vaild and important, but should not diminish the value of corpusbased linguistics, rather they should serve as warnings of possible pitfalls which need to be taken on board by scholars, since with appropriate care they are surmountable.

Corpora in Dialectology and Variation Studies


In this section we are concerned with geographical variation - corpora have long been recognised as a valuable source of comparison between language varieties as well as for the description of those varieties themselves. Certain corpora have tried to follow as far as possible the same sampling procedures as other corpora in order to maximise the degree of comparability. For examples, the LOB corpus contains roughly the same genres and sample sizes as the Brown corpus and is sampled from the same year ( i.e. 1961). The Kolhapur Indian corpus is also broadly parallel to Brown and LOB, although the sampling year is 1978. One of the earliest pieces of work using the LOB and Brown corpora in tandem was the production of a word frequency comparison of American and British written English. These corpora have also been used as the basis of more complex aspects of language such as the use of the subjunctive (Johansson and Norheim 1988). One role for corpora in national variation studies has been as a testbed for two theories of language variation. Quirk et al's (1985) "common core" hypothesis, and Braj Kachru's conception of national varieties as forming many unique "Englishes" which differ in important ways from one another. Most work on lexis and grammar comparing the Kolhapur Indian corpus with Brown and LOB has supported the common core hypothosis (Leitner 1991). However, there is still scope for the extension of such work. Few examples of dialect corpora exist at present - two of which are the Helsinki corpus of English dialects and Kirk's Northern Ireland Transcribed Corpus of Speech (NITCS). Both corpora consist of conversations with a fieldworker - in Kirk's corpus from Northern Ireland, and in the Helsinki corpus from several English regions. Dialectology is an empirical field of linguistics although it has tended to concentrate on experiments and less controlled sampling, rather than use corpora. Such elicitation experiments tend to focus on vocabulary and pronunciation, neglecting other aspects of linguistics such as syntax. Dialect corpora allow these other aspects to be studied, and because the corpora are sampled so as to be representative, quantitative as well as qualitative conclusions can be drawn about the target population as a whole.

Corpora and Psycholinguistics


Although psycholinguistics is inherently a laboratory subject, measuring mental processes such as the length of time it takes to position a syntactic boundary in reading or how eye movements change, corpora can still have a part to play in this field. One important use is as a source of data from which materials for laboratory experiments can be developed. Schreuder and Kerkman (1987) point out that frequency is an important consideration in a number of cognitive processes, including word recognition. The psycholinguist should not go blindly into experiments in areas such as this with only a vague notion of frequency to guide the selection and analysis of

materials. Sampled corpora can provide psycholinguists with more concrete and reliable information about frequency, including the frequencies of different senses and parts of speech of ambiguous words (if the corpora are annotated). A more direct example of the role of corpora in psycholinguistics can be seen from Garnham et al's (1981) study which used the London-Lund corpus to examine the occurence of speech errors in natural conversational English. Before the study was carried out nobody knew how frequent speech errors were in everyday language, because such an analysis required adequate amounts of natural conversation, while previous work on speech errors had been based on the gradual ad hoc accumulation of data from many different sources. However, the spoken corpus was able to provide exactly the kind of data that was required. Garnham's study was able to classify and count the frequencies of different error types and hence provide some estimate of the general frequency of these in relation to speakers' overall output. A third role for corpora lies in the the analysis of language pathologies, where an accurate picture of abnormal data must be constructed before it is possible to hypothesise and test what may be wrong with the human language processing system. Although little work has been done with sampled corpora to date, it is important to stress their potential for these analyses. Studies of the language of linguistically impaired people, and of the language of children who are developing their (normal) linguistic skills, lack the quantified representative descriptions which are available. In the last decade, however, there has been a move towards the empirical analysis of machine-readable data in these areas. For example, the Polytechnic of Wales (POW) corpus is a corpus of children's language; a corpus of impaired and normal language development was been collected at Reading University, while the CHILDES database contains a large amount of impaired and normal child language in several languages.

Corpora and Cultural Studies


It is only recently that the role of a corpus in telling us about culture has really begun to be explored. After the completion of the LOB corpus of British English, one of the earliest pieces of work to be carried out was a comparison of its vocabulary with the vocabulary of the American Brown corpus (Hofland and Johansson 1982). This revealed interesting differences which went beyond the purely linguistic ones such as spelling (colour/color) or morphology (got/gotten). Leech and Fallon (1992) used the results of these earlier studies, along with KWIC concordances of the two corpora to check up on the senses in which words were being used. They then grouped the differences which were statistically significant into fifteen broad categories. The frequencies of concepts in these categories revealed differences between the two countries which were primarily of cultural, not linguistic difference. For example - travel words were more frequent in American English than British English, perhaps suggestive of the larger size of the United States. Words in the domains of crime and the military were also more common in the American data, as was "violent crime" in the crime category, perhaps suggestive of the American "gun culture". In general, the findings seemed to suggest a picture of American culture at the time of the two corpora (1961) that was more macho and dynamic than British culture. Although this work is in its infancy and requires methodological refinement, it seems to be an interesting and promising

area of study, which could also integrate more closely work in language learning with that in national cultural studies.

Corpora and Social Psychology


Although linguists are the main users of corpora, they are not the sole users. Researchers in other fields which make use of language data have also recently taken an interest in the exploitation of corpus data - perhaps the most important of these have been social psychologists. Social psychologists require access to naturalistic data which cannot be reproduced in laboratory conditions (unlike many other psychology-related fields), while at the same time they are under pressure to quantify and test their theories rather than rely on qualitative data. This places them in a curious position. One area of research in social psychology is that of how and why people attempt to explain things. Explanations (or attributions) are important to the psychologist because they reveal the ways in which people regard their environment. To obtain data for studying explanations researchers have relied on naturally occuring texts such as newspapers, diaries, company reports etc. However, these are written texts, and most everyday human interaction takes place through the medium of speech. To solve this problem Antaki and Naji (1987) used the London-Lund corpus (of spoken language) as a source of data for explanations in everyday conversation. They took 200,000 words of conversation and retrieved all instances of the commonest causal conjunction because (and its variant cos). An analysis of a pilot sample derived a classification scheme for the data, which was then used to classify all the explanations according to what was being explained. For example "actions of speaker or speaker's group", "general states of affairs" and so on. A frequency analysis of the explanation types in the corpus showed that explanations of general states of affairs were the most common type of explanation (33.8%) followed by actions of speaker and speaker's group (28.8%) and actions of others (17.7%). This refuted previous theories that the prototypical type of explanation is the explanation of a person's single action. Work such as Antaki and Naji shows clearly the potential of corpora to test and modify theory in subjects which require naturalistic quantifiable language data, and one may expect other social psychologists to make use of corpora in the future.

Conclusion
In this session we have seen how a number of areas of language study have benefited from exploiting corpus data. To summarise, the main important advantages of corpora are:

Sampling and quantification. Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.

Ease of access. As all of the data collection has been dealt with by someone else, the researcher does not have to go through the issues of sampling, collection and encoding. The majority of corpora are readily available, either free or at low-cost price. Once the corpora have been obtained, it is usually easy to access the data within it, e.g. by using a concordance program. Enriched data. Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data. Naturalistic data. Corpus data is not always completely unmonitored in the sense that the people producing the spoken or written texts are unaware until after the fact that they are being asked to participate in the building of a corpus. But for the most part, the data are largely naturalistic, unmonitored and the product of real social contexts. Thus the corpus provides one of the most reliable sources of naturally occurring data that can be examined.

Glossary

common core hypothesis: The theory that all varieties of English have central fundamental properties in common with each other, which differ quantitatively rather than qualitatively. dialect: The term "dialect" is more difficult to define, in comparison to "national variety", since dialects cannot be readily distinguished from languages on solely empirical grounds. However, "dialect" is most commonly used to mean sub-national linguistic variation which is geographically motivated. Therefore, Australian English might not be considered to be a dialect of English, while Scottish English can be regarded this way, as Scotland is a part of the United Kingdom. A smaller subset of Scottish English - such as that which is spoken in Glasgow, would almost certainly be termed a dialect. elicited: Elicited data is data which is gained under non-naturalistic conditions. For example, a laboratory experiment, or when subjects are asked to "role-play" a situation. KWIC: Key Word In Context. prosody: Prosody refers to all aspects of the sound system above the level of segmental sounds e.g. stress, intonation and rhythm. The annotations in prosodically annotated corpora typically follow widely accepted descriptive frameworks for prosody such as that of O'Connor and Arnold (1961). Usually, only the most promintent intonations are annotated, rather than the intonation of every syllable. transitive and intransitive: Transitive verbs can take an object, while intransitive verbs can never take a direct object. empiricism: an empiricist approach to language is dominated by the observation of naturally occurring data, typically through the medium of the corpus. For example, we may decide to determine whether sentence x is a valid sentence of language y by looking in a corpus of the language in question and gathering evidence for the grammatically, or otherwise of the sentence. scientific method: No theory of science is ever complete. Popper states that empirical theories have to have the property not only of being verified, but of being able to be

falsified (the process of finding a rule by looking for exceptions of it). Science proceeds by speculation and hypothesis. This forms theories which have predictive power. rationalism: rationalist theories are based on the development of a theory of mind in the case of linguistics, and have as a fundamental goal cognitive plausibility. The aim is to develop a theory of human language processing, but actively seeks to make the claim that it represents how the processing is actually undertaken. homographs: Homographs are words which have the same spelling, although are different in either meaning, derivation or pronunciation. E.g. "boot" can mean an item of shoe-wear, or it can mean "to kick". lexeme: The head word form that one would look up for if one were looking for the word in the dictionary. For example, the forms kicks, kicked and kicking would all be reduced to the lexeme KICK. These variants form the lemma of the lexeme KICK. part-of-speech: A way of describing a lexical item in grammatical terms, e.g. singular common noun, comparative adjective, past participle.)

collocations: Collocations are characteristic, co-occurence patterns of words. For example: "Christmas" may collocate with "tree", "angel", and "presents". cross-tabulation: Put simply, this is just a table showing the frequencies for each variable across each sample. For example, the following table gives a cross-tabulation of modal verbs across 4 genres of text (labelled A, B, C, and D). Genre A B C D 210 148 59 89 120 49 36 23 100 86 15 46 24 29 13 4 43 34 12 28 3 4 0 1 0 10 12 4

Modal Verb can could may might must ought shall

intercorrelation matrix: This is calculated from a cross-tabulation (see above)and shows how statistically similar all pairs of variables are in their distributions across the various samples. The table below shows the intercorrelations between can, could, may, might, must, ought and shall taken from the table above. PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT can 1 could 0.544 1 may 0.798 0.186 might 0.765 0.782 must 0.796 0.807 ought 0.717 0.528 shall 0.118 0.026

Word can

could 0.544

may 0.798 might 0.765 must 0.796 ought 0.717 shall 0.118

0.186 0.782 0.807 0.528 0.026

1 0.521 0.637 0.554 0.601

0.521 1 0.795 0.587 0.032

0.637 0.795 1 0.816 0.306

0.554 0.587 0.816 1 0.078

0.601 0.032 0.306 0.078 1

The closer the score is to 1, the better the correlation between the two variables. The relationship between can and can is 1, as they are identical. Some variables show a greater similarity in their distributions than others: for instance, can shows a greater similarity to may (0.798) than it does to shall (0.118). non-parametric test: All statistical tests of significance belong to one of two distinct groups parametric and non-parametric.

Parametric tests make certain assumptions about the data on which the test is performed. First, there is the assumption that the data is drawn from a normal distribution (see below), second that the data is measured on an interval scale (e.g. any interval between two measurements is meaningful - such as a person's height in cms). Thirdly, parametric tests make use of parameters such as the mean and standard deviation. Non-parametric tests make no assumptions at all about the population from which the data is drawn. Knowledge of parameters is not necessary either. These tests are generally easier to learn and apply. normal distribution: A variable follows a normal distribution if it is continuous and if its frequency graph follows the characteristic, symmetrical, bell-shaped form in which all the values of mean, median and mode co-incide (see graph on the left). Type I and Type II errors: Although we can be confident that the results of a significance test are accurate, there is always a small chance that the decision made might be wrong. There are two ways that this can occur:

A Type I error occurs when we decide the difference is significant (due to factors other than chance) when in fact it is not. The probability of this happening is the same as the significance level of the test. This is the most serious type of error to make (equivalent to a judge finding an innocent suspect guilty). A Type II error occurs when we decide that the difference is due to chance, when in fact it is not. This is not so serious relatively.

Vous aimerez peut-être aussi