Introduction To Corpus Linguistics

Introduction to
corpus linguistics
BTANT 129 w5
Corpus
• The old school concept
– A collection of texts especially if complete and
self-contained: the corpus of Anglo-Saxon verse
The Oxford Companion to the English Language
• The modern view

– A collection of naturally occurring language text
chosen to characterize a state or variety of a
language
• John Sinclair Corpus Concordance Collocation OUP
BTANT 129 w5
Corpus vs. archive
• Text archive
• Collection of texts in their original format
(Oxford Text Archive:
http://ota.ox.ac.uk/)
• Corpus
• texts collected and processed in a unified,
systematic manner
British National Corpus:
http://www.natcorp.ox.ac.uk/
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
Short history
Brief mention of just a select few!
• Brown Corpus (Brown university)
– 1 m words
– 15 genres
– 500 samples 2000 words each
– Area: US
– Time: 1961
• LOB Corpus (Lancaster-Bergen-Oslo)
– GB replica of Brown
BTANT 129 w5
Cobuild
• Major corpus initiative by Collins and
Birmingham Univ. John Sinclair
• 1991 20 m
• -> Bank of English currently 450 m
words
• http://www.cobuild.collins.co.uk
BTANT 129 w5
British National Corpus
• 100 m words careful selection
• 10 % spoken material
• time span 1960 (fiction) – 1975 non-ficion)
• 40-50 000 word texts
• TEI compliant SGML coding
• http://www.comp.lancs.ac.uk/ucrel/bncind
ex/
BTANT 129 w5
BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to
varieties of English around the world
• 500 texts (300 written 200 spoken) of
2000 words each
• time span: 1990-0996
• ICE-GB available in demo version
• syntactic annotation, graphical tool
ICECUP
BTANT 129 w5
BTANT 129 w5
Corpus processing: tokenization
• Preprocessing
– tokenization segmenting the text into
sentences
• sometimes tricky: sentence delimiters in mid-
sentence positions
words
• multi-word units – problem
– Normalization
• restoring clitics, abbreviations ("can't", "I've")
BTANT 129 w5
Corpus processing: tagging
• Tagging
– labelling every word with its Part of Speech
category
– Problem: ambiguity
• out of context, words can belong to different
part of speech or have different analysis within
the same POS
– set N vs. set V
– bánt 'bánik' VBD vagy 'bánt' VBZ
BTANT 129 w5
Corpus processing: disambiguation
• Disambiguation
– defining the correct analysis in context
• Two approaches:
• both needs manually corrected training corpus
– statistical
• Hidden Markov model
• calculating probability within a span of usually one or two
words
• rate of success can be around 98%
– rule-based
BTANT 129 w5
Syntactic annotation
• Difficult to do on such a scale
• shallow parsing
• Treebank:
collection of syntactically analyzed
sentences
• Penn treebank
• http://www.cis.upenn.edu/~treebank/
BTANT 129 w5
Recent trends
• Word sense ambiguation (SENSEVAL)
• http://www.itri.brighton.ac.uk/events/senseval/
• Message understanding
• http://www.itl.nist.gov/iaui/894.02/related_pro
jects/muc/index.html
• SEMANTIC WEB
• making information on the web understandable
for machines
• a vision requiring a huge effort, not clear
whether feasible at all
BTANT 129 w5
Representative sample?
• A corpus any size is inevitably a sample
• Of what?
• Two approaches
– sampling speakers – demographic sampling
– sampling their output – text type sample
BTANT 129 w5
The notion of representativeness
• Sample vs. population
• sample should be proportional to the
population for a given feature
– example for demographic sampling
if we know from census figures that 48% of people in
living in Budapest are male
we should compile our sample so that 48% of the
informants are male
-> our sample is representative of Budapest
residents for gender
BTANT 129 w5
Trouble with representativeness
• What should be the units of sampling?

• Registers, text types, genres etc.
• But no independent evidence about their
ratio in the totality of language output
-> representativeness is an ideal but
impossible to implement
BTANT 129 w5
Approaches to Representativeness
• Douglas Biber:
• Rejects notion of proportional sampling
• Sample should be as varied as possible
• Representativeness measured in terms
of wide variety of text types included in
the sample
BTANT 129 w5
The Web as a corpus?
• Pro: • Cons:
• immense database • lots of rubbish,
• dynamically growing irrelevant data
• ideal 'quick and • difficult to extract
dirty' method hits
• no language analysis
• only string query,
which is crude
BTANT 129 w5
One quick example
• Representativity or representativeness
• Throw the two words at Google and have
a look at the figures
• Think about the conclusions
• There are special front-end sites
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5
BTANT 129 w5

Introduction To Corpus Linguistics

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Introduction To Corpus Linguistics

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to

• The modern view

• What should be the units of sampling?

Vous aimerez peut-être aussi