Vous êtes sur la page 1sur 26

Introduction to

corpus linguistics

BTANT 129 w5
Corpus
• The old school concept
– A collection of texts especially if complete and
self-contained: the corpus of Anglo-Saxon verse
The Oxford Companion to the English Language

• The modern view


– A collection of naturally occurring language text
chosen to characterize a state or variety of a
language
• John Sinclair Corpus Concordance Collocation OUP

BTANT 129 w5
Corpus vs. archive
• Text archive
• Collection of texts in their original format
(Oxford Text Archive:
http://ota.ox.ac.uk/)
• Corpus
• texts collected and processed in a unified,
systematic manner
British National Corpus:
http://www.natcorp.ox.ac.uk/

BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
Short history
Brief mention of just a select few!
• Brown Corpus (Brown university)
– 1 m words
– 15 genres
– 500 samples 2000 words each
– Area: US
– Time: 1961
• LOB Corpus (Lancaster-Bergen-Oslo)
– GB replica of Brown
BTANT 129 w5
Cobuild
• Major corpus initiative by Collins and
Birmingham Univ. John Sinclair
• 1991 20 m
• -> Bank of English currently 450 m
words
• http://www.cobuild.collins.co.uk

BTANT 129 w5
British National Corpus
• 100 m words careful selection
• 10 % spoken material
• time span 1960 (fiction) – 1975 non-ficion)
• 40-50 000 word texts
• TEI compliant SGML coding
• http://www.comp.lancs.ac.uk/ucrel/bncind
ex/

BTANT 129 w5
BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to
varieties of English around the world
• 500 texts (300 written 200 spoken) of
2000 words each
• time span: 1990-0996
• ICE-GB available in demo version
• syntactic annotation, graphical tool
ICECUP
BTANT 129 w5
BTANT 129 w5
Corpus processing: tokenization
• Preprocessing
– tokenization segmenting the text into
sentences
• sometimes tricky: sentence delimiters in mid-
sentence positions
words
• multi-word units – problem
– Normalization
• restoring clitics, abbreviations ("can't", "I've")

BTANT 129 w5
Corpus processing: tagging
• Tagging
– labelling every word with its Part of Speech
category
– Problem: ambiguity
• out of context, words can belong to different
part of speech or have different analysis within
the same POS
– set N vs. set V
– bánt 'bánik' VBD vagy 'bánt' VBZ

BTANT 129 w5
Corpus processing: disambiguation
• Disambiguation
– defining the correct analysis in context
• Two approaches:
• both needs manually corrected training corpus
– statistical
• Hidden Markov model
• calculating probability within a span of usually one or two
words
• rate of success can be around 98%
– rule-based

BTANT 129 w5
Syntactic annotation
• Difficult to do on such a scale
• shallow parsing
• Treebank:
collection of syntactically analyzed
sentences
• Penn treebank
• http://www.cis.upenn.edu/~treebank/

BTANT 129 w5
Recent trends
• Word sense ambiguation (SENSEVAL)
• http://www.itri.brighton.ac.uk/events/senseval/
• Message understanding
• http://www.itl.nist.gov/iaui/894.02/related_pro
jects/muc/index.html
• SEMANTIC WEB
• making information on the web understandable
for machines
• a vision requiring a huge effort, not clear
whether feasible at all

BTANT 129 w5
Representative sample?
• A corpus any size is inevitably a sample
• Of what?
• Two approaches
– sampling speakers – demographic sampling
– sampling their output – text type sample

BTANT 129 w5
The notion of representativeness
• Sample vs. population
• sample should be proportional to the
population for a given feature
– example for demographic sampling
if we know from census figures that 48% of people in
living in Budapest are male
we should compile our sample so that 48% of the
informants are male
-> our sample is representative of Budapest
residents for gender

BTANT 129 w5
Trouble with representativeness

• What should be the units of sampling?


• Registers, text types, genres etc.
• But no independent evidence about their
ratio in the totality of language output
-> representativeness is an ideal but
impossible to implement

BTANT 129 w5
Approaches to Representativeness

• Douglas Biber:
• Rejects notion of proportional sampling
• Sample should be as varied as possible
• Representativeness measured in terms
of wide variety of text types included in
the sample

BTANT 129 w5
The Web as a corpus?
• Pro: • Cons:
• immense database • lots of rubbish,
• dynamically growing irrelevant data
• ideal 'quick and • difficult to extract
dirty' method hits
• no language analysis
• only string query,
which is crude

BTANT 129 w5
One quick example
• Representativity or representativeness
• Throw the two words at Google and have
a look at the figures
• Think about the conclusions
• There are special front-end sites

BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5

Vous aimerez peut-être aussi