The International Corpus of Learner English

Teachers of English to Speakers of Other Languages, Inc.
(TESOL)
The International Corpus of Learner English: A New Resource for Foreign Language Learning
and Teaching and Second Language Acquisition Research
Author(s): Sylviane Granger
Reviewed work(s):
Source: TESOL Quarterly, Vol. 37, No. 3 (Autumn, 2003), pp. 538-546
Published by: Teachers of English to Speakers of Other Languages, Inc. (TESOL)
Stable URL: http://www.jstor.org/stable/3588404 .
Accessed: 13/02/2012 18:57
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Teachers of English to Speakers of Other Languages, Inc. (TESOL) is collaborating with JSTOR to digitize,
preserve and extend access to TESOL Quarterly.
http://www.jstor.org
The InternationalCorpusof LearnerEnglish:

A New Resourcefor ForeignLanguage
Learning and Teachingand SecondLanguage
AcquisitionResearch
SYLVIANE GRANGER
Universityof Louvain
Louvain-la-Neuve,Belgium
* In the late 1950s, when corpus linguistics made its debut on the
linguistic scene, it was a very modest enterprise in the hands of a small
group of enthusiasts. Looking back on this period, Leech (1991), one of
the pioneers of corpus linguistics, recalls that "for years, corpus linguistics was the obsession of a small group which received little or no
recognition from either linguistics or computer science" (p. 25). Since
that time, the group of enthusiasts has grown considerably and corpus
linguistics has progressively infiltrated most-if not all-language-related
disciplines. One of its major contributions has been in the field of
variation studies. The diversification of corpora has given linguists a firm
basis for comparing language varieties distinguished in terms of the
medium (spoken vs. written), the field (general vs. specialized), and
geographical status (World Englishes).
For years, foreign/second language learner varieties remained conspicuously absent from corpus-based research. Only in the early 1990s
did publishers and academics-concurrently
but independently-start
collecting and analyzing learner data. Two learner English corpora
originated in that early period: the Longman Learners' Corpus (see
Longman Corpus Network, 2003) and the International Corpus of
Learner English (ICLE; see Granger, n.d.). As the latter is now being
made available to the academic community (Granger, Dagneaux, &
Meunier, 2002), it seems only fitting to describe the corpus in detail and,
more importantly, to highlight the benefits it offers ESOL researchers
and teachers.
DESIGN CRITERIA
A computer learner corpus (CLC) is an electronic collection of
authentic texts produced by foreign or second language learners.
Although all corpora need to be assembled according to explicit design
criteria (Atkins & Clear, 1992), extra care has to be taken in collecting
the data for learner corpora given the large number of variables
affecting the learning/acquisition process. The ICLE is a very richly
538
TESOL QUARTERLY
documented corpus: More than 20 task and learner variableshave been

recorded for each of the texts in the corpus through a detailed profile
questionnaire completed by all learners. As shown in Figure 1, some of
these variables (medium, genre, average length, learner proficiency
level) were used as corpus design criteria and are therefore shared by all
texts in the corpus whereas others (gender, mother tongue background,
essay topic) differ from text to text. All the variableshave been stored in
a database and can be used by researchers as queries to compile
subcorpora that match certain criteria, thus allowing for interesting
comparisons (e.g., female vs. male learners, German-vs. Spanish-speaking
learners).
Learner Variables
The learners who have contributed data to the ICLEhave a great deal
in common. All are young adults (about 20 years old) who study English
in a non-English-speakingcountry; that is, they are EFLrather than ESL
learners. They are all university undergraduates specializing in English
in their second, third, or fourth year, and their level can be roughly
described as advanced, although individual learners and learner groups
differ in proficiency. The corpus focuses on advanced interlanguage
partly because of the wish to compensate for its relative neglect in
comparison with lower proficiency levels, resulting in a dearth of
pedagogical materials for the advanced learner.
FIGURE 1
ICE Task and Learner Variables
BRIEF REPORTS
539
In spite of their similarities in terms of age, L2 status, and proficiency

level, the learners display some significant differences, the most important one being mother tongue. The ICLE database covers 11 different
mother tongue backgrounds: Bulgarian, Czech, Dutch, Finnish, French,
German, Italian, Polish, Russian, Spanish, and Swedish. These groups
are further subcategorized according to the geographical provenance of
the learners, thus distinguishing between Dutch-speaking learners from
the Netherlands and Belgium, or Finnish-speaking learners from Finland and Sweden. In addition, the learners' knowledge of other foreign
languages is recorded. Another variable with a potentially significant
impact on learner output is the amount of time learners have spent in an
English-speaking country. ICLE learners differ considerably in this
respect: 40% have never stayed in an English-speaking country whereas
some 30% have lived in an English-speaking environment for 3 months
or more. A last relevant variable is gender: The corpus contains data
from both male and female learners, although the latter clearly constitute the majority (80%).
Task Variables
The ICLE data share a large number of task attributes. They consist
exclusively of written productions of a particular genre, namely, essay
writing, and represent general English rather than English for specific
purposes. They are, on average, 700 words in length, unabridged. The
topics are extremely varied, although the majority of them (85%) are
argumentative. (Given the difficulty in collecting a sufficient number of
argumentative essays,we allowed for the inclusion of a small portion25% at most-of literary essays in the data.) The query system allows
researchers to select essayson the same or similar topics. For instance, by
entering the key word womenas a search term, the researcher can retrieve
a subcorpus of essayson topics such as "Feministshave done more harm
to the cause of women than good," "Women's Liberation," "Single
women should not be allowed to have artificialinsemination,"and "Have

real women disappeared?"
The essays also include certain differences in task settings. Recorded
variables pertaining to the task are whether there was a time limit for
writing, whether the essaywas part of an exam, and whether the learners
were allowed to use language reference tools such as grammars or
dictionaries.
Size and Representativeness

The ICLE database contains 3,640 essays, totaling 2.5 million words.
Each of the 11 national (i.e., Li-differentiated) varieties comprises
around 330 essaystotaling approximately200,000 words. Compared with
540
TESOL QUARTERLY
current very large corpora, such as the British National Corpus (100
million words) or the Bank of English (450 million words), the ICLE is
very small. However, when it comes to learner language, size cannot
simply be assessed in terms of the number of words. Equally important is
the number of learners, and, in this respect, the ICLE, which contains
writing by well over 3,000 learners, constitutes a solid empirical basis for
second language acquisition (SLA) and foreign language teaching
research. (There are slightly more essays than learners, as some learners
contributed more than one essay without exceeding the maximum limit
of 1,000 words per learner.) However, because of its limited number of
words, the ICLE cannot be used for all types of linguistic investigation. It
lends itself well to the analysis of high-frequency phenomena at all
linguistic levels (morphology, grammar, lexis, discourse) but is unsuited
for the study of infrequent linguistic items.
ANALYSIS OF THE CORPUS
Contrastive Interlanguage Analysis

The method most frequently used so far to analyze the ICLE is
contrastive interlanguage analysis, an approach that consists in carrying
out either a comparison of learner data with native speaker data (L2 vs.
L1) or a comparison between different types of learner data (L2 vs. L2)
(see Granger, 1996).
The first type of comparison makes it possible to uncover the patterns
of use distinguishing learner data from native data. These fall into two
categories: qualitative differences (misuse) and quantitative differences
(over- and underuse). This type of analysis is greatly facilitated by text
retrieval programs such as WordSmith Tools (Scott, 1996), which uses
the compare lists function to give researchers immediate access to the
words or phrases that are significantly under- or overused by learners.
The concordand collocatedisplay functions are also extremely valuable as
they shed light on the recurring patterns or collocates that learners use,
whether correctly or incorrectly.
The second type of comparison is essential to establish whether the
differences uncovered are developmental or transfer related. With its
wide range of mother-tongue backgrounds, the ICLE is an ideal resource
to establish the importance of transfer in SLA. According to Odlin
(1989), if the phenomenon of transfer is still incompletely understood, it
is largely due to the heterogeneity of the data used:
A brief look at the studies cited will show considerable variation in the
numbers of subjects, in the backgrounds of the subjects,and in the empirical
data, which come from tape-recorded samples of speech, from student
writing, from various types of tests, and from other sources. (p. 151)
BRIEF REPORTS
541
A highly controlled learner corpus such as the ICLE, with its strict design
criteria and rich documentation, should go some way in answering
Odlin's call for "improvements in data gathering" (p. 151).
Using this computer-aided contrastive approach, researchers have
been able to uncover a wide range of patterns of under-, over-, and
misuse in learner lexis, (lexico-) grammar, and discourse (see Centre for
English Corpus Linguistics, 2002, for a comprehensive learner corpus
bibliography based on the ICLE or other learner corpora). Among the
many topics that have been analyzed so far on the basis of ICLE data are
high-frequency words, Romance words, recurrent combinations, collocations and formulae, prefabricated language, lexical profiling, lexical
variation, adjective intensification, the verb make, progressives, passives,
modality, noun phrase complexity, demonstratives, contractions, logical
connectors, causal links, conjunctions, participle clauses, direct questions, tense errors, lexical errors, part-of-speech tagging, and parsing.
Computer-Aided
Error Analysis
Differences in frequency patterns are not the only differences between learner and native corpora. Learner writing, even at an advanced
proficiency level, is characterized by a much higher error rate than
native writing (e.g., in the French subcorpus of the ICLE, the rate is
1 error in every 16 words). As current spelling and grammar-checking
programs are not capable of detecting, let alone correcting, the majority
of these errors (Granger & Meunier, 1994), error annotation is the only
solution for the time being. This time-consuming but highly rewarding
process consists in annotating all errors (or errors in a particular
category, e.g., verb complementation or modals) in the text files using a
standardized system of error tags and an error editor to speed up the
process (see Dagneaux, Denness, & Granger, 1998, for a detailed
description).
Once files have been error tagged, it is possible to search for any error
category using a text retrieval program such as WordSmith Tools, sort the
errors in various ways, and analyze them in the full context of the text.
Although error tagging has not been used on a large scale yet, preliminary work shows the tremendous potential of the approach (see Granger,
1999, for an analysis of verb tense errors).
ENGLISH LANGUAGE TEACHING APPLICATIONS
Learner corpus research opens up exciting pedagogical perspectives
in a wide range of areas of English language teaching (ELT) pedagogy:
materials design, syllabus design, language testing, and classroom methodology. Here I limit discussion to the first area (for the use of learner
542
TESOL QUARTERLY
corpora in language testing, see Hasselgren, 2002; for classroom methodology, see Seidlhofer, 2002).
The link between corpus-based research and teaching is based on the
idea that corpus evidence suggests "which language items and processes
are most likely to be encountered by language users, and which
therefore may deserve more investment of time in instruction" (Kennedy,
1998, p. 281). The area where corpus information is used most extensively, to the point of having become standard practice, is ELT lexicography: All monolingual learners' dictionaries are now corpus based. Work
has also started on the production of corpus-informed textbooks, although progress in this area is rather slow (however, see Carter, Hughes,
& McCarthy, 2000; Thurstun & Candlin, 1997). As regards grammar,
although a corpus-based ELT grammar has yet to be written, the
frequency information contained in Biber, Johansson, Leech, Conrad,
and Finegan's (1999) corpus-based grammar of spoken and written
English could be used-and, it is hoped, will soon be used-to design
one.
Although the benefit of a corpus approach to teaching is evident,
linguists are keen to point out that it is not a panacea (cf. Conrad, 1999,
p. 17; McCarthy & Carter, 2001, p. 338). Perusal of native corpus data,
however detailed, will never tell anything about the degree of difficulty
of words and structures for learners. Learner corpora are the resource
par excellence to access this type of information. Evidence of learner
under-, over-, and misuse can help materials designers and teachers
select and rank ELT material at a particular proficiency level.
The benefits that can be derived from using learner corpora are
apparent from the few CLC-informed ELT resources that exist. The
Longman Essential Activator (LEA, 1997) is the first learners' dictionary to
incorporate CLC data. The compilers of the dictionary used the Longman
Learners' Corpus to find out how learners used the words covered in the
LEA. They then turned the information into help boxesdesigned to warn
learners against typical errors (Gillard & Gadsby, 1998). Although the
LEA targets all EFL/ESL learners irrespective of their LI background,
some CLC-based tools are tailor-made for particular groups of learners.
Milton's WordPilot (n.d.) software is a writing kit especially designed for
Hong Kong learners of English (see Milton, 1998). It contains error
recognition exercises intended to sensitize learners to the most common
errors attested in a Hong Kong learners' corpus. The program also
includes a concordance tool and native corpora of specific genres
intended to provide learners with authentic native examples of words
with which they have difficulty. Allan's (2002) Web-based TeleNexnetwork
is designed to provide support to
(see http://www.telenex.hku.hk)
in
teachers
secondary-level English
Hong Kong. The Web site contains
both students' problems files that describe areas of learner difficulty
BRIEF REPORTS
543
extracted from a learner corpus and teaching implicationsfiles intended to

help teachers deal with these problems in the classroom. The IWiLL
language learning environment (Wible, Kuo, Chien, Liu, & Tsao, 2001)
is a highly interactive tool that allows students and teachers to create and
use an online database of Taiwanese learners' essays and teachers' error
annotations. Other Web-based projects (see; Cowan, Choi, & Kim, 2003;
Kindt & Wright, 2001) bear witness to the tremendous potential of the
Internet for CLC-based ELT applications.
Although the ICLE has not yet resulted in concrete ELT resources, its
tremendous potential in this respect is obvious. Thanks to its differentiation of mother tongue backgrounds, users can distinguish problem
areas shared by all learners at an advanced level from those that are
specific to a particular learner group and can fine-tune teaching materials accordingly.
CONCLUSION
In the preface to the first volume devoted to learner corpora, Leech
(1998) states that "the concept of a learner corpus is an idea 'whose
hour' has come" (p. xvi). At the time, however, most efforts were still
expended on collecting data, establishing methodologies for learner
corpus research, and trying them out in various case studies. The release
of a learner corpus such as the ICLE marks the beginning of a new stage
in the evolution of learner corpus research. The time has come to use
the resource on a wider scale in both SLA and ELT.
On a more theoretical level, the ICLE data can be used alongside
other data types of a more experimental nature to give SLA theories a
more solid empirical foundation, in particular as regards the important
question of LI transfer. On a practical level, the ICLE can help produce
more learner-aware pedagogical material designed for advanced EFL
learners in general or focused on the needs of one national learner
population.
ACKNOWLEDGMENTS
I acknowledge
the support
Scientific Research Fund.
in this research
provided
by the Belgian
National
THE AUTHOR
Sylviane Granger is a professor of English language and linguistics and director of
the Centre for English Corpus Linguistics at the University of Louvain. Her edited
publications include LearnerEnglish on Computer(Longman, 1998) and Computer
Learner Corpora,Second Language Acquisition and Foreign Language Teaching (with
J. Hung and S. Petch-Tyson; Benjamins, 2002).
544
TESOL QUARTERLY
REFERENCES
Allan, Q. G. (2002). The TELEC Secondary Learner Corpus: A resource for teacher
development. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computerlearner
corpora,second language acquisition and foreign language teaching (pp. 195-211).
Amsterdam: Benjamins.
Atkins, S., & Clear,J. (1992). Corpus design criteria. Literaryand LinguisticComputing,
7(1), 1-16.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman
grammarof spokenand writtenEnglish. Harlow, England: Longman.
Carter, R., Hughes, R., & McCarthy, M. (2000). Exploring grammar in context.
Cambridge: Cambridge University Press.
Centre for English Corpus Linguistics. (2002). List of publications.Retrieved May 22,
2003, from http://juppiter.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/publications
.html
Conrad, S. (1999). The importance of corpus-based research for language teachers.
System:An InternationalJournal of EducationalTechnologyand AppliedLinguistics,27,
1-18.
Cowan, R, Choi, H. E., & Kim, D. H. (2003). Four questions for error diagnosis and
correction in CALL. CALICOJournal, 20, 451-463.
Dagneaux, E., Denness, S., & Granger, S. (1998). Computer-aided error analysis.
System:An InternationalJournal of Educational Technologyand AppliedLinguistics,26,
163-174.
Gillard, P., & Gadsby, A. (1998). Using a learners' corpus in compiling ELT
dictionaries. In S. Granger (Ed.), LearnerEnglish on computer(pp. 159-171).
London: Addison Wesley Longman.
Granger, S. (1996). From CA to CIA and back: An integrated approach to
computerized bilingual and learner corpora. In K. Aijmer, B. Altenberg, &
M. Johansson (Eds.), Languages in contrast(Lund Studies in English 88, pp. 3751). Lund, Sweden: Lund University Press.
Granger, S. (Ed.). (1998). LearnerEnglish on computer.London: Addison Wesley
Longman.
Granger, S. (1999). Use of tenses by advanced EFL learners: Evidence from an errortagged computer corpus. In H. Hasselgard & S. Oksefjell (Eds.), Out of corpora:
Studiesin honourof StigJohansson(pp. 191-202). Amsterdam: Rodopi.
Granger, S. (n.d.). International Corpusof LearnerEnglish. Retrieved May 22, 2003,
from http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/icle.htm
Granger, S., Dagneaux, E., & Meunier, F. (Eds.). (2002). The internationalcorpusof
learner English: Handbook and CD-ROM. Louvain-la-Neuve, Belgium: Presses
Universitaires de Louvain. (Available from http://www.i6doc.com)
Granger, S., Hung,J., & Petch-Tyson, S. (Eds.). (2002). Computerlearnercorpora,second
language acquisitionand foreign language teaching.Amsterdam: Benjamins.
Granger, S., & Meunier, F. (1994). Towards a grammar checker for learners of
English. In U. Fries & G. Tottie (Eds.), Creatingand using English language corpora
(pp. 79-91). Amsterdam: Rodopi.
Hasselgren, A. (2002). Learner corpora and language testing: Smallwords as markers
of learner fluency. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer
learnercorpora,second language acquisition and foreign language teaching (pp. 143173). Amsterdam: Benjamins.
Kennedy, G. (1998). An introductionto corpuslinguistics.London: Longman.
Kindt, D., & Wright, M. (2001). Integrating language learning and teaching with the
constructionof computerlearnercorpora.Retrieved May 22, 2003, from http://www
.nufs.ac.jp/~dukindt/pages/SOCCpapers.html
BRIEF REPORTS
545
Leech, G. (1991). The state of the art in corpus linguistics. In B. Altenberg &
K. Aijmer (Eds.), English corpuslinguistics (pp. 8-29). London: Longman.
Leech, G. (1998). Learner corpora: What they are and what can be done with them.
In S. Granger (Ed.), LearnerEnglish on computer(pp. xiv-xx). London: Addison
Wesley Longman.
Longman Corpus Network. (2003). The Longman learners'corpus.Retrieved May 22,
2003, from http://www.longman-elt.com/dictionaries/corpus/lcleam.html
Longman essentialactivator.(1997). Harlow, England: Addison Wesley Longman.
McCarthy, M., & Carter, R. (2001). Size isn't everything: Spoken English, corpus, and
the classroom. TESOLQuarterly,35, 337-340.
Milton, J. (1998). Exploiting LI and interlanguage corpora in the design of an
electronic language learning and production environment. In S. Granger (Ed.),
LearnerEnglish on computer(pp. 186-198). London: Longman.
Milton, J. (n.d.). WordPilot [Computer software]. (Available from http://home
.ust.hk/~autolang/download_WP.htm)
Odlin, T. (1989). Language transfer: Cross-linguisticinfluence in language learning.
Cambridge: Cambridge University Press.
Scott, M. (1996). WordSmith Tools [Computer software]. Oxford: Oxford University
Press.
Seidlhofer, B. (2002). Pedagogy and local learner corpora: Working with learningdriven data. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computerlearner
corpora,second language acquisition and foreign language teaching (pp. 213-234).
Amsterdam: Benjamins.
Thurstun, J., & Candlin, C. (1997). ExploringacademicEnglish:A workbook
for student
essay writing. Sydney, Australia: National Centre for English Language Teaching
and Research.
Wible, D., Kuo C.-H., Chien, F.-Y.,Liu, A., & Tsao, N.-L. (2001). A Web-based EFL
writing environment: Integrating information for learners, teachers, and researchers. Computersand Education, 37, 297-315.
The MultimediaAdult ESL Learner Corpus

STEPHEN REDER, KATHRYN HARRIS, and KRISTEN SETZLER
PortlandState University
Portland, Oregon,UnitedStates
* This report describes an innovative corpus project that will add several
to the emerging connections
between corpus
important dimensions
linguistics and TESOL. A multimedia learner corpus, the Multimedia
Adult ESL Learner Corpus (MAELC), is being collected within an adult
ESL instructional
environment.
This Lab School environment
(see
is jointly operated by the Applied Linhttp://www.labschool.pdx.edu)
guistics Department at Portland State University and Portland Community College, an adult ESL provider. Low-level adult ESL classrooms
within a regular program are continuously recorded with multiple video
cameras and microphones.
By the end of the 5-year project period
546
TESOL QUARTERLY

The International Corpus of Learner English

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The International Corpus of Learner English

Transféré par

Droits d'auteur :

Formats disponibles

Teachers of English to Speakers of Other Languages, Inc.

The InternationalCorpusof LearnerEnglish:

documented corpus: More than 20 task and learner variableshave been

In spite of their similarities in terms of age, L2 status, and proficiency

women should not be allowed to have artificialinsemination,"and "Have

Size and Representativeness

Contrastive Interlanguage Analysis

extracted from a learner corpus and teaching implicationsfiles intended to

The MultimediaAdult ESL Learner Corpus

Vous aimerez peut-être aussi