Académique Documents
Professionnel Documents
Culture Documents
extracting information
ignoring information
8 / 45
What is Chunking?
Chunking is partial parsing. A chunker:
assigns a partial syntactic structure to a sentence.
yields atter structures than full parsing (xed tree depth,
usually max. 2 vs arbitrarily deep trees)
only deals with chunks (simplied constituents which
usually only capture constituents up to their head)
Doesnt try to deal with all of language
Doesnt attempt to resolve all semantically signicant
decisions
Uses deterministic grammars for easy-to-parse pieces, and
other methods for other pieces, depending on task.
9 / 45
Chunks vs Constituents
1. Parsing
[
[ G.K. Chesterton ],
[
[ author ] of
[
[ The Man ] who was
[ Thursday ]
]
]
]
2. Chunking:
[ G.K. Chesterton ],
[ author ] of
[ The Man ] who was
[ Thursday ]
10 / 45
Extracting Information: Coreference Annotation
11 / 45
Extracting Information: Message Understanding
12 / 45
Ignoring Information: Lexical Acquisition
studying syntactic patterns, e.g. nding verbs in a corpus,
displaying possible arguments
e.g. gave, in 100 les of the Penn Treebank corpus
replaced internal details of each noun phrase with NP
gave NP
gave up NP in NP
gave NP up
gave NP help
gave NP to NP
use in lexical acquisition, grammar development
13 / 45
Analogy with Tokenising and Tagging
fundamental in NLP: segmentation and labelling
tokenization and tagging
other similarities: nite-state; application specic
14 / 45
What are chunks?
Abney (1994):
[when I read] [a sentence], [I read it]
[a chunk] [at a time]
Chunks are non-overlapping regions of text:
[walk] [straight past] [the lake]
(Usually) each chunk contains a head, with the possible
addition of some preceding function words and modiers
[ walk ] [straight past ] [the lake ]
Chunks are non-recursive:
rhythmic patterns
Chunking might be a rst step in full parsing
17 / 45
Representing Chunks: Tags vs Trees
Chunks can be represented by:
IOB tags: each token is tagged with one of three special
chunk tags, INSIDE, OUTSIDE, or BEGIN
less context-dependence
<DT><JJ>?<NN>
<NN|JJ>+
<NN.*>
chunk a sequence of words matching a tag pattern
grammar = r"""
NP: {<DT><JJ><NN><NNS>}
"""
31 / 45
A Simple Chunk Parser
grammar = r"""
NP: {<PP\$>?<JJ>*<NN>} # chunk determiner/possessive,
adjectives and nouns
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"),
("down", "RP"), ("her", "PP$"), ("long", "JJ"),
("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(tagged_tokens)
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
32 / 45
Rule ordering and defaults
If a tag pattern matches at multiple overlapping locations, the
rst match takes precedence. For example, if we apply a rule
that matches two consecutive nouns to a text containing three
consecutive nouns, then the rst two nouns will be chunked
When a chunk rule is applied to a chunking hypothesis, it will
only create chunks that do not partially overlap with chunks
in the hypothesis
33 / 45
Tracing
The trace argument species whether debugging output
should be shown during parsing.
The debugging output shows the rules that are applied, and
shows the chunking hypothesis at each stage of processing.
34 / 45
Developing Chunk Parsers
tagged_tokens = [("The", "DT"), ("enchantress", "NN"),
("clutched", "VBD"), ("the", "DT"), ("beautiful", "JJ"),
cp1 = nltk.RegexpParser(r"""
NP: {<DT><JJ><NN>} # Chunk det+adj+noun
{<DT|NN>+} # Chunk sequences of NN and DT
""")
cp2 = nltk.RegexpParser(r"""
NP: {<DT|NN>+} # Chunk sequences of NN and DT
{<DT><JJ><NN>} # Chunk det+adj+noun
""")
>>> print cp1.parse(tagged_tokens, trace=1)
# Input:
<DT> <NN> <VBD> <DT> <JJ> <NN>
# Chunk det+adj+noun:
<DT> <NN> <VBD> {<DT> <JJ> <NN>}
# Chunk sequences of NN and DT:
{<DT> <NN>} <VBD> {<DT> <JJ> <NN>}
(S
(NP The/DT enchantress/NN)
35 / 45
More Chunking Rules: Chinking
chink: sequence of stopwords
chinking: process of removing a sequence of tokens from a
chunk
A chink rule chinks anything that matches a given tag pattern.
36 / 45
More Chunking Rules: Chinking
Entire chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink a/DT big/JJ cat/NN
Output: a/DT big/JJ cat/NN
Middle of a chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink big/JJ
Output: [a/DT] big/JJ [cat/NN]
End of a chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink cat/NN
Output: [a/DT big/JJ] cat/NN
37 / 45
Chinking Example
The following grammar puts the entire sentence into a single chunk, then
excises the chink:
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
tagged_tokens = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the",
cp = nltk.RegexpParser(grammar)
>>> print cp.parse(tagged_tokens)
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))
38 / 45
Evaluating Chunk Parsers
Process:
1. take some already chunked text
2. strip o the chunks
3. rechunk it
4. compare the result with the original chunked text
Metrics:
Information Extraction
Question Answering