Vous êtes sur la page 1sur 18

Session 4: Annotation

http://tinyurl.com/669o4zt

20th February, 2013


Corpus Linguistics

More than the Text:


Annotation - what, why, how?

Ylva Berglund Prytz and Martin Wynne


IT Services

http://tinyurl.com/669o4zt
You can only find what is in the corpus...

...so unless what you are searching for is


a surface feature of the text,
or someone has marked it with a tag,
you cannot easily find it.
What can you find automatically?
(if not, what do you need to annotate?)

• All instances of ‘work’


• All instances of ‘WORK’ (lemma)
• All instances of ‘work’ as a verb
• All instances of ‘work’ in fiction
• All instances of ‘work’ spoken by women
• All instances of ‘work’ at end of clause
• All instances of ‘work’ in jokes
Annotation: what is it?

The practice of adding interpretative linguistic


information to a corpus.

It can be useful to differentiate:


Metadata - information about the text
Structural markup - information about the text structure
Linguistic annotation - information about the linguistic
categories identified in the text
(but the terms are not always used in this way, or consistently at all...)
Metadata in the British National Corpus

An example text from the BNC: CA2.xml


Types of linguistic annotation

– Morphosyntactic / wordclass / part-of-speech


(POS)
– Syntactic (e.g. phrase, clause, mood...)
– Semantic
– Pragmatic
– Discourse
– Phonetic
– Phonological
–…
Hands-on Exercise

'Borrow': Exercise 1.5 'More search features'

Exploring further (optional): look at collocates of commit and


deed, this time including inflected forms
Potential problems with annotation

It can:
• be incorrect
• be inconsistent
• follow the ‘wrong’ theory
• have the 'wrong' level of granularity
• use the 'wrong' tag-set
• introduce subjective interpretations
How do you annotate?
• What are you marking up? (POS, lemma, clause?)
• How are you annotating? (manually/automatically?)
• With which tag-set? (CLAWS, Penn Treebank?)
• Format of annotation? (HTML, XML, Chat?)
• Whose linguistic analysis? (mine, or a more established,
standard and 'consensus-backed' way of doing it?)
• How are you going to use the annotations for your
analysis?
• How are your annotations going to be shared with other
researchers?
Good practice in annotation

• Annotations should be separable


• Detailed and explicit documentation should be
provided
• Annotation practices should be linguistically
consensual
• Annotation should observe standards
(Leech 2005)
Example: non-standard
A01 0010 The Fulton County Grand Jury said Friday an investigation
A01 0020 of Atlanta's recent primary election produced "no evidence" that
A01 0030 any irregularities took place. The jury further said in term-end
A01 0040 presentments that the City Executive Committee, which had over-all
A01 0050 charge of the election, "deserves the praise and thanks of the
A01 0060 City of Atlanta" for the manner in which the election was conducted.
Example: non-standard

|SA01:1 the_AT Fulton_NP County_NN Grand_JJ Jury_NN


said_VBD Friday_NR an_AT investigation_NN of_IN Atlanta's_NP$
recent_JJ primary_NN election_NN produced_VBD no_AT
evidence_NN that_CS any_DTI irregularities_NNS took_VBD
place_NN ._.
|SA01:2 the_AT jury_NN further_RBR said_VBD in_IN term-
end_NN presentments_NNS that_CS the_AT City_NN Executive_JJ
Committee_NN ,_, which_WDT had_HVD over-all_JJ charge_NN
of_IN the_AT election_NN ,_, deserves_VBZ the_AT praise_NN
and_CC thanks_NNS of_IN the_AT City_NN of_IN Atlanta_NP
for_IN the_AT manner_NN in_IN which_WDT the_AT election_NN
was_BEDZ conducted_VBN ._.
Example: standard?
<text>
<file id=A01>
<p>
<s c="0000003 002" n=00001>
<w AT>The <w NP1>Fulton <w NN1>County <w JJ>Grand <w
NN1>Jury <w VVD>said <w NPD1>Friday <w AT1>an <w
NN1>investigation <w IO>of <w NP1>Atlanta<w GE>'s <w
JJ>recent <w JJ>primary <w NN1>election <w VVD>produced
<quote>
<w AT>no <w NN1>evidence
</quote>
<w CST>that <w DD>any <w NN2>irregularities <w VVD>took
<w NN1>place<c YSTP>.
Example: standard
<body>
<pb n="85"/>
<div1 n="2" type="u">
<head><s n="1"><w type="CRD" lemma="1937">1937</w></s>
</head>
<pb n="87"/>
<div2 n="1" type="u">
<p><s n="2"><w type="NP0" lemma="joe">Joe </w><w type="CJC"
lemma="and">and </w><w type="NP0" lemma="harry">Harry
</w><w type="VVD" lemma="stand">stood </w><w type="PRP"
lemma="on">on </w><w type="AT0" lemma="the">the </w><w
type="NN1" lemma="platform">platform </w><w type="NN1"
lemma="side">side </w><w type="PRP" lemma="by">by </w><w
type="NN1" lemma="side">side</w><c type="PUN">.</c></s>
Annotation standards?

Use of standards can help to ensure successful:


• interpretation,
• interchange,
• preservation,
• incorporation into other resources,
• processing by generic software.
And is a way of resolving tricky encoding decisions, and
of justifying and documenting your decisions, and not
reinventing the wheel every time.
Some online taggers
Example sentence: “John was very offended by her remarks”

Free CLAWS WWW trial service (http://ucrel.lancs.ac.uk/claws/trial.html)

C5: John_NP0 was_VBD very_AV0 offended_AJ0 by_PRP her_DPS remarks_NN2 ._.

C7: John_NP1 was_VBDZ very_RG offended_JJ by_II her_APPGE remarks_NN2 ._.

Cognitive Computation Group , University of Illinois at Urbana-Champaign (


http://l2r.cs.uiuc.edu/~cogcomp/eoh/posdemo.html)

(NNP John) (VBD was) (RB very) (VBN offended) (IN by) (PP$ her) (NNS remarks) (. .)

CST's Part-Of-Speech tagger (http://www.cst.dk/online/pos_tagger/uk/)

John/NNP was/VBD very/RB offended/VBN by/IN her/PRP$ remarks/NNS ./.

Infogistics tTAG (http://www.infogistics.com/posdemo.htm)

([ John_NNP ])
<: was_VBD :>
very_RB offended_VBN by_IN ([ her_PRP$ remarks_NNS ])._.
Next week: Creating a corpus

Same time, same place


Please register via IT Services webpage

Reading tip:

Developing Linguistic Corpora: a Guide to Good Practice,


edited by Martin Wynne
http://ota.ox.ac.uk/documents/creating/dlc/index.htm

Vous aimerez peut-être aussi