Académique Documents
Professionnel Documents
Culture Documents
"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.
Language acquisition
The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)].
Spelling conventions
Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.
Language pedagogy
Fries and Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagody had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.
Chomsky
Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance. Competence is best described as our tacit, internalised knowledge of a language.
Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form. Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence. Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it. However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.
The sentences of a natural language are finite. The sentences of a natural language can be collected and enumerated.
The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists...regarded the corpus as the sole explicandum of linguistics" (Leech, 1991). To be fair, not all linguists at the time made such bullish statements - Harris [1951) is probably the most enthusiastic exponent of this point, while Hockett [1948] did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time." The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in "Corpus Linguistics" Chapter 1, pages 7-8).
The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.
Whatever Chomsky's criticisms were, Abercrombie's were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time. The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died.
Chomsky re-examined
Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpusbased work. For example, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. Also, in the field of language acquisition the observation of naturally occuring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteenmonth-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed, and there is no evidence that a child at the one-word stage has meta-linguistic awareness. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies.
Processes
Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist. The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words occuring in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a concordance), and extract from this another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark. The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.
seen the failure of early corpus linguistics examined Chomsky's criticisms seen the failings of introspective data seen how corpus linguistics was revived
how corpus linguists study syntactic features (Section 2) how corpus linguistics balances enumeration with introspection (Section 3) how corpora can be used in language studies (Section 4)
Definition of a corpus
The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways. In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.
If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation. For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus. Leech (1993) describes 7 maxims which should apply in the annotation of text corpora.
Formats of Annotation
Currently, there are no widely agreed standards of representing information in texts and in the past many different approaches have been adopted, some more lasting than others. One longstanding annotation practice is known as COCOA refernces. COCOA was an early computer program used for extracting indexes of words in context from machine readable texts. Its conventions were carried forward into several other programs, notably the OCP (Oxford Concordance Program). The Longman-Lancaster corpus and the Helsinki corpus have also used COCOA references. Very simply, a COCOA reference consists of a balanced set of angled brackets (< >) which contains two entities:
A code which stands for a particular variable name. A string or set of strings, which are the instantiations of that variable.
For example, the code "A" could be used to refer to the variable "author" and the string would stand for the author's name. Thus COCOA references which indicate the author of a passage of text would look like the following: <A CHARLES DICKENS> <A WOLFGANG VON GOETHE> <A HOMER> COCOA references only represent an informal trend for encoding specific types of textual information, e.g. authors, dates and titles. Current trends are moving more towards more formalised international standards of encoding. The flagship of this current trend is the Text Encoding Iniative (TEI), a project sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing and the Association for Computers and the Humanites. Its aim is to provide standardised implementations for machine-readable text interchange. The TEI uses a form of document markup known as SGML (Standard Generalised Markup Language). SGML has the following advantages:
The TEI's contribution is a detailed set of guidelines as to how this standard is to be used in text encoding (Sperberg-McQueen and Burnard, 1994). In the TEI, each text (or document) consists of two parts - a header and the text itself. The header contains information such as the following:
author, title and date the edition or publisher used in creating the machine-readable text information about the encoding practices adopted.
Click here to read more about headers and text You might also want to read about the EAGLES advisory body in chapter 2 of Corpus Linguistics (page 29).
Orthography
It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task, but even with a basic machine-readable text, issues of encoding are vital, although to English speakers their extent may not be apparent at first. In languages other than English, the issue of accents and of non-Roman alphabets such as Greek, Russian and Japanese present a problem. IBM-compatible computers are capable of handling accented characers, but many other mainframe computers are unable to do this. Therefore, for maximum interchangeability, accented characters need to be encoded in other ways. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. For example, French speakers omit the accent entirely, writing Hlen as Helene. To handle the umlaut, German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter, so Frhling would become Fruehling or Fr"uhling. However, these strategies cause additional problems - in the case of the French, information is lost, while in the German extraneous information is added. In response to this the TEI has suggested that these characters are encoded as TEI entities, using the delimiting characters of & and ;. Thus, would be encoded by the TEI as
üaut;
Read about the handling of non-Roman alphabets and the transcription of spoken data in Corpus Linguistics, chapter 2, pages 34-36.
Types of annotation
Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow:
Part of Speech annotation Lemmatisation Parsing Semantics Discoursal and text linguistic annotation Phonetic transcription Prosody Problem-oriented tagging
Multilingual Corpora
Not all corpora are monolingual, and an increasing amount of work in being carried out on the building of multilingual corpora, which contain texts of several different languages. First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language, but each contains completely different texts in those several languages. For example, the Aarthus corpus of Danish, French and English contract law consists of a set of three monolingual law corpora, which is not comprised of translations of the same texts. The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. This refers to corpora which hold the same texts in more than one language. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew, Latin and Greek etc. A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" would be equivalent to "is smoking" in English.
At present there are few cases of annotated parallel corpora, and those which exist tend to be bilingual rather than multilingual. However, two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers a restricted range of text types (proceedings of the Candian Parliament). However, this is an area of growth, and the situation is likely to change dramatically in the near future.
Introduction
In this session we'll be looking at the techniques used to carry out corpus analysis. We'll reexamine Chomsky's argument that corpus linguistics will result in skewed data, and see the procedures used to ensure that a representative sample is obtained. We'll also be looking at the relationship between quantitative and qualitative research. Although the majority of this session is concerned with statisitical procedures which can be said to be quantitative, it is important not to ignore the importance of qualitative analyses.
With the statistical part of this session two points should be made.
First, that this section is of necessity incomplete. Space precludes the coverage of all of the techniques which can be used on corpus data. Second, we do not aim here to provide a "step-by-step" guide to statistics. Many of the techniques used are very complex and to explain the mathematics in full would require a separate session for each one. Other books, notably Language and Computers and Statistics for Corpus Linguistics (Oakes, M. - forthcoming) present these methods in more detail than we can give here. Tony McEnery, Andrew Wilson, Paul Baker.
language variety allows one to get a precise picture of the frequency and rarity of particular phenomena, and thus their relative normality or abnomrality. However, the picture of the data which emerges from quantitative analysis is less rich than that obtained from qualitative analysis. For statistical purposes, classifications have to be of the hardand-fast (so-called "Aristotelian" type). An item either belongs to class x or it doesn't. So in the above example about the phrase "the red flag" we would have to decide whether to classify "red" as "politics" or "colour". As can be seen, many linguistic terms and phenomena do not therefore belong to simple, single categories: rather they are more consistent with the recent notion of "fuzzy sets" as in the red example. Quantatitive analysis is therefore an idealisation of the data in some cases. Also, quantatitve analysis tends to sideline rare occurences. To ensure that certain statistical tests (such as chi-squared) provide reliable results, it is essential that minimum frequencies are obtained - meaning that categories may have to be collapsed into one another resulting in a loss of data richness.
A recent trend
From this brief discussion it can be appreciated that both qualitative and quantitative analyses have something to contribute to corpus study. There has been a recent move in social science towards multi-method approaches which tend to reject the narrow analytical paradigms in favour of the breadth of information which the use of more than one method may provide. In any case, as Schmied (1993) notes, a stage of qualitative research is often a precursor for quantitative analysis, since before linguistic phenomena can be classified and counted, the categories for classification must first be identified. Schmied demonstrates that corpus linguistics could benefit as much as any field from multi-method research.
Corpus Representativeness
As we saw in Session One, Chomsky criticised corpus data as being only a small sample of a large and potentially infinite population, and that it would therefore be skewed and hence unrepresentative of the population as a whole. This is a valid criticism, and it applied not just to corpus linguistics but to any form of scientific investigation which is based on sampling. However, the picture is not as drastic as it first appears, as there are many safeguards which may be applied in sampling to ensure maximum representativeness. First, it must be noted that at the time of Chomsky's criticisms, corpus collection and analysis was a long and pains-taking task, carried out by hand, with the result that the finished corpus had to be of a manageable size for hand analysis. Although size is not a guarantee of representativeness, it does enter significantly into the factors which must be considered in the production of a maximally representative corpus. Thus, Chomsky's criticisms were at least partly true at the time of those early corpora. However, today we have powerful computers which can store and manipulate many millions of words. The issue of size is no longer the problem that it used to be.
Random sampling techniques are standard to many areas of science and social science, and these same techniques are also used in corpus building. But there are additional caveats which the corpus builder must be aware of. Biber (1993) emphasises that we need to define as clearly as possible the limits of the population which we wish to study, before we can define sampling procedures for it. This means that we must rigourously define our sampling frame - the entire population of texts from which we take our samples. One way to do this is to use a comprehensive bibliographical index - this was the approach taken by the Lancaster-Oslo/Bergen corpus who used the British National Bibliography and Willing's Press Guide as their indices. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to your particular area of interest. For example, all the German-language books in Lancaster University library that were published in 1993. This approach is one which was used in building the Brown corpus. Read about a different kind approach, which was used in collecting the spoken parts of the British National Corpus, in Corpus Linguistics, chapter 3, page 65. Biber (1993) also points out the advantage of determining beforehand the hierarchical structure (or strata) of the population. This refers to defining the different genres, channels etc. that it is made up if. For example, written German could be made up of genres such as:
newspaper reporting romantic fiction legal statutes scientific writing poetry and so on....
Stratificational sampling is never less representative than pure probablistic sampling, and is often more representative, as it allows each individual stratum to be subjected to probablistic sampling. However, these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification. Read about optimal lengths and number of sample sizes, and the problems of using standard statistical equations to determine these figures in Corpus Linguistics, chapter 3, page 66.
Frequency Counts
This is the most straight-forward approach to working with quantitative data. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme.
For instance, we might set up a classification scheme to look at the frequency of the four major parts of speech: noun, verb, adjective and adverb. These four classes would constitute our types. Another example inolves the simple one-to-one mapping of form onto classification. In other words, we count the number of times each word appears in the corpus, resulting in a list which might look something like: abandon: 5 abandoned: 3 abandons: 2 ability: 5 able: 28 about: 128 etc..... More often, however, the use of a classification scheme implies a deliberate act of categorisation on the part of the investigator. Even in the case of word frequency analysis, variant forms of the same lexeme may be lemmatised before a frequency count is made. For instance, in the example above, abandon, abandons and abandoned might all be classed as the lexeme ABANDON. Very often the classification scheme used will correspond to the type of linguistic annotation which will have already been introduced into the corpus at some earlier stage (see Session 2). An example of this might be an analysis of the incidence of different parts of speech in a corpus which had already been part-of-speech tagged.
A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get: spoken English: 50/50,000 X 100 = 0.1% written English: 500/500,000 X 100 = 0.1%
Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0.1%) for both the written and spoken corpora. Even where disparity of size is not an issue, it is often better to use proportional statistics to present frequencies, since most people find them easier to understand than comparing fractions of unusual numbers like 53,000. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is: ratio = number of occurrences of the type / number of tokens in the entire sample This result can be expressed as a fraction, or more commonly as a decimal. However, if that results in an unwieldy looking small number (in the above example it would be 0.0001) the ratio can then be multiplied by 100 and represented as a percentage.
Significance Testing
Significance tests allow us to determine whether or not a finding is the result of a genuine difference between two (or more) items, or whether it is just due to chance. For example, suppose we are examining the Latin versions of the Gospel of Matthew and the Gospel of John and we are looking at how third person singular speech is represented. Specifically we want to compare how often the present tense form of the verb "to say" is used ("dicit") with how often the perfect form of the verb is used ("dixit"). A simple count of the two verb forms in each text produces the following results: Text Matthew John no. of occurences of dicit no. of occurences of dixit 46 118 107 119
From these figures is looks as if John uses the present form ("dicit") proportionally more often than Matthew does, but to be more certain that this is not just due to co-incidence, we need to perform a further calculation - the significance test. There are several types of significance test available to the corpus linguist: the chi squared test, the [Student's] t-test, Wilcoxon's rank sum test and so on. Here we will only examine the chisquared test as it is the most commonly used significance test in corpus linguistics. This is a non-parametric test which is easy to calculate, even without a computer statistics package, and can be used with data in 2 X 2 tables, such as the example above. However, it should be noted that the chi-squared test is unreliable where very small numbers are involved and should not therefore be used in such cases. Also, proportional data (percentages etc) can not be used with the chi-squared test. The test compares the difference between the actual frequencies (the observed frequencies in the data) with those which one would expect if no factor other than chance had been operating (the
expected frequencies). The closer these two results are to each other, the greater the probablity that the observed frequencies are influenced by chance alone. Having calculated the chi-squared value (we will omit this here and assume it has been done with a computer statistical package) we must look in a set of statistical tables to see how significant our chi-squared value is (usually this is also carried out automatically by computer). We also need one further value - the number of degrees of freedom which is simply: (number of columns in the frequency table - 1) x (number of rows in the frequency table - 1) In the example above this is equal to (2-1) x (2-1) = 1. We then look at the table of chi-square values in the row for the relevant number of degrees of freedom until we find the nearest chi-square value to the one which is calculated, and read off the probability value for that column. The closer to 0 the value, the more significant the difference is - i.e. the more unlikely that it is due to chance alone. A value close to 1 means that the difference is almost certainly due to chance. In practice it is normal to assign a cut-off point which is taken to be the difference between a significant result and an "insignificant" result. This is usually taken to be 0.05 (probablity values of less than 0.05 are written as "p < 0.05" and are assumed to be significant.) In our example about the use of dicit and dixit above we calculate a chi-squared value of 14.843. The table below shows the significant p values for the first 3 degrees of freedom: Degrees of Freedom p = 0.05 p = 0.01 p = 0.001 1 2 3 3.84 5.99 7.81 6.63 9.21 11.34 10.83 13.82 16.27
The number of degrees of freedom in our example is 1, and our result is higher than 10.83 (see the final column in the table) so the probability value for this chi-square value is 0.001. Thus, the difference between Matthew and John can be said to be significant at p < 0.001, and we can therefore say with a high degree of certainty that this difference is a true reflection of variation in the two texts and not due to chance.
Collocations
The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.
Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them. Two of the most commonly encountered formulae are: mutual information and the Z-score. Both tests provide similar data, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship. For each pair of words, a score is given - the higher the score the greater the degree of collocality. Mutual information and the Z-score are useful in the following ways:
They enable us to extract multiword units from corpus data, which can be used in lexicography and particularly specialist technical translation. We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word. We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerful collocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.
Multiple Variables
The tests that we have looked at so far can only pick up differences between particular samples (i.e. texts and copora) on particular variables (i.e. linguistic features) but they cannot provide a picture of the complex interrelationship of similarity and difference between a large number of samples, and large numbers of variables. To perform such comparisons we need to consider multivariate techniques. Those most commonly encountered in linguistic research are:
The aim of multivariate techniques is to summarise a large set of variables in terms of a smaller set on the basis of statistical similarities between the original variables, whilst at the same time losing the minimal amount of information about their differences.
Although we will not attempt to explain the complex mathematics behind these techniques, it is worth taking time to understand the stages by which they work: All the techniques begin with a basic cross-tabulation of the variables and samples. For factor analysis an intercorrelation matrix is then calculated from the cross-tabulation, which is used to attempt to "summarise" the similarities between the variables in terms of a smaller number of reference factors which the technique extracts. The hypothesis being that the many variables which appear in the original frequency cross-tabulation are in fact masking a smaller number of variables (the factors) which can help explain better why the observed frequency differences occur. Each variable receives a loading on each of the factors which are extracted, signifying its closeness to that factor. For example, in analysing a set of word frequencies across several texts one might find that words in a certain conceptual field (i.e. religion) received high loadings on one factor, whereas those in another field (e.g. government) loaded highly on another factor. Follow this link for an example of factor analysis. Correspondence analysis is similar to factor analysis, but it differs in the basis of its calculations. Multidimensional scaling (MDS) also makes use of an intercorrelation matrix, which is then converted to a matrix in which the correlation coefficients are replaced with rank order values. E.g. the highest correlation value recieves a rank order of 1, the next highest receives a rank order of 2 and so on. MDS then attempts to plot and arrange these variables so that the more closely related items are plotted closer together than the less closely related items. Cluster analysis involves assembling the variables into unique groups or "clusters" of similar items. A matrix is created, in a similar fashion to factor analysis (although this may be a distance matrix showing the degree of difference rather than similarity between the pairs of variables in the cross-tabulation). The matrix is then used to group the variables contained within it.
Log-linear Models
Here we will consider a different technique which deals with the interrelationships of several variables. As linguists, we often want to go beyond the simple description of a phenomenon, and explain what it is that causes the data to behave in a particular way. A loglinear analysis allows us to take a standard frequency cross-tabulation and find out which variables seem statistically most likely to be responsible for a particular effect. For example, let us imagine that we are interested in the factors which influence whether the word for is present or omitted from phrases of duration such as She studied [for] three years in Munich. We may hypothesise several factors which could have an effect on this, e.g. the text genre, the semantic category of the main verb and whether or not the verb is separated by an adverb from the phrase of duration. Any one of these factors might be solely responsible for the
omission of for, or it might be the case that a combination of factors are culpable. Finally, all the factors working together could be responsible for the presence/omission of for. A loglinear analysis provides us with a number of models which take these points into account. The way that we test the models in loglinear analysis is first to test the significance of associations in the most complex model - that is the model which assumes that all of the variables are working together. Then we take away each variable at a time from the model and see whether significance is maintained in each case, until we reach the model with the lowest possible dimensions. So in the above example, we would start with a model that posited three variables (e.g. genre, verb class and adverb separation) and test the significance of a three variable model. Then we would test each of the two variable models (taking away one variable in each case) and finally each of the three one-variable models. The best model would be taken to be the one with the fewest number of variables which still retained statistical significance.
Introduction
In this session we will examine the roles which corpora may play in the study of language. The importance of copora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language. Empirical data also allows us to study language varieties such as dialects or earlier periods in a language for which it is not possible to carry out a rationalist approach. It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety. Corpus linguistics, proper, should be seen as a subset of the activity within an empirical approach to linguistics. Although corpus linguistics entails an empirical approach, empirical linguistics does not always entail the use of a corpus. In the following pages we'll consider the roles which corpora use may play in a number of different fields of study related to language. We will focus on the conceptual issues of why corpus data are important to these areas, and how they can contribute to the advancement of knowledge in each, providing real examples of corpus use. In view of the huge amount of corpus-based linguistic research, the examples are necessarily selective - you can consult further reading for additional examples.
It provides a broad sample of speech, extending over a wide selection of variables such as: o speaker gender
o o o
speaker age speaker class genre (e.g. newsreading, poetry, legal proceedings etc)
This allows generalisations to be made about spoken language as the corpus is as wide and as representative as possible. It also allows for variations within a given spoken language to be studied.
It provides a sample of naturalistic speech rather than speech elicited under aritificial conditions. The findings from the corpus are therefore more likely to reflect language as it is spoken in "real life" since the data is less likely to be subject to production monitoring by the speaker (such as trying to suppress a regional accent). Because the (transcribed) corpus has usually been enhanced with prosodic and other annotations it is easier to carry out large scale quantitative analyses than with fresh raw data. Where more than one type of annotation has been used it is possible to study the interrelationships between say, phonetic annotations and syntactic structure.
author, date, genre, part-of-speech tags etc it is easier to tie down usages of particular words or phrases as being typical of particular regional varieties, genres and so on. 9. The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc. However, finite corpora also have an important role in lexical studies - in the area of quantification. It is possible to rapidly produce reliable frequency counts and to subdivide these areas across various dimensions according to the varieties of language in which a word is used. 10. Finally, the ability to call up word combinations rather than individual words, and the existence of mutual information tools which establish relationships between co-occuring words (see Session 3) mean that we can treat phrases and collocations more systematically than was previously possible. A phraseological unit may consitute a piece of technical terminology or an idiom, and collocations are important clues to specific word senses.
The potential for the representative quantification of a whole language variety. Their role as empirical data for the testing of hypotheses derived from grammatical theory.
Many smaller-scale studies of grammar using corpora have included quantitative data analysis (for example, Schmied's 1993 study of relative clauses). There is now a greater interest in the more systematic study of grammatical frequency - for example, Oostdijk and de Haan (1994a) are aiming to analyse the frequency of the various English clause types. Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see Session One) has often meant that these two approaches have been viewed as separate and in competition with each other. However, there is a group of researchers who have used corpora in order to test essentially rationalist grammatical theory, rather than use it for pure description or the inductive generation of theory. At Nijmegen University, for instance, primarily rationalist formal grammars are tested on reallife language found in computer corpora (Aarts 1991). The formal grammar is first devised by reference to introspective techniques and to existing accounts of the grammar of the language. The grammar is then loaded into a computer parser and is run over a corpus to test how far it accounts for the data in the corpus. The grammar is then modified to take account of those analyses which it missed or got wrong.
right was used in all functions, but especially as a response, to evaluate a previous response or terminate an exchange. All right was used to mark a boundary between two stages in discourse.
that's right was used as an emphasiser. it's alright and that's alright were responses to apologies.
The availability of new conversational corpora, such as the spoken part of the BNC (British National Corpus) should provide a greater incentive both to extend and to replicate such studies, since the amount of conversational data available, and the social/geographical range of people recorded both will have increased. At present, quantitative analyses of corpus-based approaches to issues in pragmatics have been poorly served. Hopefully this is one area which will be exploited by linguists in the near future.
Stylistics researchers are usually interested in individual texts or authors rather than the more general varieties of a language and tend not to be large-scale users of corpora. Nevertheless, some stylisticians are interested in investigating broader issues such as genre, and others have found corpora to be important sources of data in their research. In order to define an author's particular style, we must, in part examine the degree by which the author leans towards different ways of putting things (technical vs non-technical vocabulary, long sentences vs short sentences and so on). This task requires comparisons to be made not only internally within the author's own work, but also with other authors or the norms of the language or variety as a whole. As Leech and Short (1981) point out, stylistics often demands the use of quantification to back up judgements which may appear subjective rather than objective. This is where corpora can play a useful role. Another type of stylistic variation is the more general variation between genres and channels for example, one of the most common uses of corpora has been in looking at the differences between spoken and written language. Altenberg (1984) examined the differences in the ordering of cause-result constructions while Tottie (1991) looked at the differences in negation strategies. Other work has looked at variations between genres, using subsamples of corpora as a database. For example, Wilson (1992) used sections from the LOB and Kolhpur corpora, the Augustan Prose Sample and a sample of modern English conversation to examine the usage of since and found that causal since had evolved from being the main causal connective in late seventeenth century writing to being characteristic of formal learned writing in the twentieth century.
looked at future time expressions in German textbooks of English. These studies have similar methodologies - they analyse the relevant constructions or vocabularies, both in the sample text books and in standard English corpora and then they compare their findings between the two sets. Most studies found that there were considerable differences between what textbooks are teaching and how native speakers actually use language as evidenced in the corpora. Some textbook gloss over important aspects of usage, or foreground less frequent stylistic choices at the expense of more common ones. The general conclusion from these studies is that nonempirically based teaching materials can be misleading and that corpus studies should be used to inform the production of material so that the more common choices of usage are given more attention than those which are less common. Read about language teaching for "special purposes" in Corpus Linguistics, Chapter 4, pages 104-105. Corpora have also been used in the teaching of linguistics. Kirk (1994) requires his students to base their projects on corpus data which they must analyse in the light of a model such as Brown and Levinson's politeness theory or Grice's co-operative principle. In taking this approach, Kirk is using corpora not only as a way of teaching students about variation in English but also to introduce them to the main features of a corpus-based approach to linguistic analysis. A further application of corpora in this field is their role in computer-assisted language learning. Recent work at Lancaster University has looked at the role of corpus-based computer software for teaching undergraduates the rudiments of grammatical analysis (McEnery and Wilson 1993). This software - Cytor - reads in an annotated corpus (either part-of-speech tagged or parsed) one sentence at a time, hides the annotation and asks the student to annotate the sentence him- or herself. Students can call up help in the form of the list of tag mnemomics, a frequency lexicon or concordances of examples. McEnery, Baker and Wilson (1995) carried out an experiment over the course of a term to determine how effective Cytor was at teaching part-of-speech learning by comparing two groups of students - one who were taught with Cytor, and another who were taught via traditional lecturer-based methods. In general the computer-taught students performed better than the human-taught students throughout the term.
In recent years, however, some historical linguistics have changed their approach, resulting in an upsurge in strictly corpus-based historical linguistics and the building of corpora for this purpose. The most widely known English historical corpus is the Helsinki corpus. The Helsinki corpus contains approximately 1.6 million words of English dating from the earliest Old English Period (before AD 850) to the end of the Early Modern English period (1710). It is divided into three main periods - Old English, Middle English and Early Modern English - and each period is subdivided into a number of 100-year subperiods (or 70-year subperiods in some cases). The Helsinki corpus is representative in that it covers a range of genres, regional varieties and sociolinguistics variables such as gender, age, education and social class. The Helsinki team have also produced "satellite" corpora of early Scots and early American English. Other examples of English historical corpora in development are the Zrich Corpus of English Newspapers (ZEN), the Lampeter Corpus of Early Modern English Tracts (a sample of English pamphlets from between 1640 and 1740) and the ARCHER corpus (a corpus of British and American English from 1650-1990). The work which is carried out on historical corpora is qualitatively similar to that which is carried out on modern language corpora, although it is also possible to carry out work on the evolution of language through time. For example, Peitsara (1993) used four subperiods from the Helsinki corpus and calculated the frequencies of different prepositions introducing agent phrases. Throughout the period she found that the most common prepositions of this type were of and by, which were of almost equal frequency at the beginning of the period, but by the fifteenth century by was three times more common than of, and by 1640 by was eight times as common. Studies like this have particular importance in the context of Halliday's (1991) conception of language evolution as a motivated change tin the probabilities of the grammar. However, it is important to be aware of the limitations of corpus linguistics, as Rissanen (1989) pointed out. Rissanen identifies three main problems associated with using historical corpora 1. The "philologist's dilemma" - the danger that the use of a corpus and a computer may supplant the in-depth knowledge of language history which is to be gained from the study of original texts in their context. 2. The "God's truth fallacy" - the danger that a corpus may be used to provide representative conclusions about the entire language period, without understanding its limitations in the terms of which genres it does and does not cover. 3. The "mystery of vanishing reliability" - the more variables which are used in sampling and coding the corpus (periods, genres, age, gender etc) the harder it is to represent each one fully and achieve statistical reliability. The most effective way of solving this problem is to build larger corpora of course. Rissanen's reservations are vaild and important, but should not diminish the value of corpusbased linguistics, rather they should serve as warnings of possible pitfalls which need to be taken on board by scholars, since with appropriate care they are surmountable.
materials. Sampled corpora can provide psycholinguists with more concrete and reliable information about frequency, including the frequencies of different senses and parts of speech of ambiguous words (if the corpora are annotated). A more direct example of the role of corpora in psycholinguistics can be seen from Garnham et al's (1981) study which used the London-Lund corpus to examine the occurence of speech errors in natural conversational English. Before the study was carried out nobody knew how frequent speech errors were in everyday language, because such an analysis required adequate amounts of natural conversation, while previous work on speech errors had been based on the gradual ad hoc accumulation of data from many different sources. However, the spoken corpus was able to provide exactly the kind of data that was required. Garnham's study was able to classify and count the frequencies of different error types and hence provide some estimate of the general frequency of these in relation to speakers' overall output. A third role for corpora lies in the the analysis of language pathologies, where an accurate picture of abnormal data must be constructed before it is possible to hypothesise and test what may be wrong with the human language processing system. Although little work has been done with sampled corpora to date, it is important to stress their potential for these analyses. Studies of the language of linguistically impaired people, and of the language of children who are developing their (normal) linguistic skills, lack the quantified representative descriptions which are available. In the last decade, however, there has been a move towards the empirical analysis of machine-readable data in these areas. For example, the Polytechnic of Wales (POW) corpus is a corpus of children's language; a corpus of impaired and normal language development was been collected at Reading University, while the CHILDES database contains a large amount of impaired and normal child language in several languages.
area of study, which could also integrate more closely work in language learning with that in national cultural studies.
Conclusion
In this session we have seen how a number of areas of language study have benefited from exploiting corpus data. To summarise, the main important advantages of corpora are:
Sampling and quantification. Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.
Ease of access. As all of the data collection has been dealt with by someone else, the researcher does not have to go through the issues of sampling, collection and encoding. The majority of corpora are readily available, either free or at low-cost price. Once the corpora have been obtained, it is usually easy to access the data within it, e.g. by using a concordance program. Enriched data. Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data. Naturalistic data. Corpus data is not always completely unmonitored in the sense that the people producing the spoken or written texts are unaware until after the fact that they are being asked to participate in the building of a corpus. But for the most part, the data are largely naturalistic, unmonitored and the product of real social contexts. Thus the corpus provides one of the most reliable sources of naturally occurring data that can be examined.
Glossary
common core hypothesis: The theory that all varieties of English have central fundamental properties in common with each other, which differ quantitatively rather than qualitatively. dialect: The term "dialect" is more difficult to define, in comparison to "national variety", since dialects cannot be readily distinguished from languages on solely empirical grounds. However, "dialect" is most commonly used to mean sub-national linguistic variation which is geographically motivated. Therefore, Australian English might not be considered to be a dialect of English, while Scottish English can be regarded this way, as Scotland is a part of the United Kingdom. A smaller subset of Scottish English - such as that which is spoken in Glasgow, would almost certainly be termed a dialect. elicited: Elicited data is data which is gained under non-naturalistic conditions. For example, a laboratory experiment, or when subjects are asked to "role-play" a situation. KWIC: Key Word In Context. prosody: Prosody refers to all aspects of the sound system above the level of segmental sounds e.g. stress, intonation and rhythm. The annotations in prosodically annotated corpora typically follow widely accepted descriptive frameworks for prosody such as that of O'Connor and Arnold (1961). Usually, only the most promintent intonations are annotated, rather than the intonation of every syllable. transitive and intransitive: Transitive verbs can take an object, while intransitive verbs can never take a direct object. empiricism: an empiricist approach to language is dominated by the observation of naturally occurring data, typically through the medium of the corpus. For example, we may decide to determine whether sentence x is a valid sentence of language y by looking in a corpus of the language in question and gathering evidence for the grammatically, or otherwise of the sentence. scientific method: No theory of science is ever complete. Popper states that empirical theories have to have the property not only of being verified, but of being able to be
falsified (the process of finding a rule by looking for exceptions of it). Science proceeds by speculation and hypothesis. This forms theories which have predictive power. rationalism: rationalist theories are based on the development of a theory of mind in the case of linguistics, and have as a fundamental goal cognitive plausibility. The aim is to develop a theory of human language processing, but actively seeks to make the claim that it represents how the processing is actually undertaken. homographs: Homographs are words which have the same spelling, although are different in either meaning, derivation or pronunciation. E.g. "boot" can mean an item of shoe-wear, or it can mean "to kick". lexeme: The head word form that one would look up for if one were looking for the word in the dictionary. For example, the forms kicks, kicked and kicking would all be reduced to the lexeme KICK. These variants form the lemma of the lexeme KICK. part-of-speech: A way of describing a lexical item in grammatical terms, e.g. singular common noun, comparative adjective, past participle.)
collocations: Collocations are characteristic, co-occurence patterns of words. For example: "Christmas" may collocate with "tree", "angel", and "presents". cross-tabulation: Put simply, this is just a table showing the frequencies for each variable across each sample. For example, the following table gives a cross-tabulation of modal verbs across 4 genres of text (labelled A, B, C, and D). Genre A B C D 210 148 59 89 120 49 36 23 100 86 15 46 24 29 13 4 43 34 12 28 3 4 0 1 0 10 12 4
intercorrelation matrix: This is calculated from a cross-tabulation (see above)and shows how statistically similar all pairs of variables are in their distributions across the various samples. The table below shows the intercorrelations between can, could, may, might, must, ought and shall taken from the table above. PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT can 1 could 0.544 1 may 0.798 0.186 might 0.765 0.782 must 0.796 0.807 ought 0.717 0.528 shall 0.118 0.026
Word can
could 0.544
may 0.798 might 0.765 must 0.796 ought 0.717 shall 0.118
The closer the score is to 1, the better the correlation between the two variables. The relationship between can and can is 1, as they are identical. Some variables show a greater similarity in their distributions than others: for instance, can shows a greater similarity to may (0.798) than it does to shall (0.118). non-parametric test: All statistical tests of significance belong to one of two distinct groups parametric and non-parametric.
Parametric tests make certain assumptions about the data on which the test is performed. First, there is the assumption that the data is drawn from a normal distribution (see below), second that the data is measured on an interval scale (e.g. any interval between two measurements is meaningful - such as a person's height in cms). Thirdly, parametric tests make use of parameters such as the mean and standard deviation. Non-parametric tests make no assumptions at all about the population from which the data is drawn. Knowledge of parameters is not necessary either. These tests are generally easier to learn and apply. normal distribution: A variable follows a normal distribution if it is continuous and if its frequency graph follows the characteristic, symmetrical, bell-shaped form in which all the values of mean, median and mode co-incide (see graph on the left). Type I and Type II errors: Although we can be confident that the results of a significance test are accurate, there is always a small chance that the decision made might be wrong. There are two ways that this can occur:
A Type I error occurs when we decide the difference is significant (due to factors other than chance) when in fact it is not. The probability of this happening is the same as the significance level of the test. This is the most serious type of error to make (equivalent to a judge finding an innocent suspect guilty). A Type II error occurs when we decide that the difference is due to chance, when in fact it is not. This is not so serious relatively.