Vous êtes sur la page 1sur 12

Natural Language Processing Applications

Lecture 5: Chunking with NLTK


Claire Gardent
CNRS/LORIA
Campus Scientique,
BP 239,
F-54 506 Vanduvre-l`es-Nancy, France
2007/2008
1 / 45
Chunking vs. Parsing
What are chunks?
Representing chunks
Chunker Input and Output
Chunking in NLTK
Evaluating Chunkers
Summary and Reading
2 / 45
Syntax, Grammars and Syntactic Analysis (Parsing)
Syntax captures structural relationships between words and
phrases i.e., describes the constituent structure of NL
expressions
Grammars are used to describe the syntax of a language
Syntactic analysers assign a syntactic structure to a string on
the basis of a grammar
A syntactic analyser is also called a parser
3 / 45
Syntactic tree example
S
NP VP
John V NP PP
Adv V Det n Prep NP
often gives a book to Mary
4 / 45
Why parse sentences in the rst place?
Parsing is usually an intermediate stage in a larger processing
framework.
It is useful e.g., for interpreting a string (assigning it a
meaning representation) or for comparing strings (machine
translation)
5 / 45
But Parsing has its problems ...
Coverage
No complete grammar of any language
Sapir: All grammars leak
Ambiguity
As coverage increases, so does ambiguity.
Problem of ranking parses by degree of plausibility
6 / 45
Problems with Full Parsing, 2
Complexity of rule-based chart parsing is O(n
3
) in length of
sentence, multiplied by factor O(G
2
), where G is size of
grammar.
Practical results are often better, but still slow for parsing
large (e.g., the web) corpora in reasonable time.
Finite state machines have worst-case complexity O(n) in
length of string.
7 / 45
Chunking vs Parsing
Chunking is a popular alternative to full parsing :
more ecient: based on Finite State techniques (Finite state
machines have worst-case complexity O(n) in length of string)
more robust (always give a solution)
often deterministic (gives only one solution)
often sucient when the application involves:

extracting information

ignoring information
8 / 45
What is Chunking?
Chunking is partial parsing. A chunker:
assigns a partial syntactic structure to a sentence.
yields atter structures than full parsing (xed tree depth,
usually max. 2 vs arbitrarily deep trees)
only deals with chunks (simplied constituents which
usually only capture constituents up to their head)
Doesnt try to deal with all of language
Doesnt attempt to resolve all semantically signicant
decisions
Uses deterministic grammars for easy-to-parse pieces, and
other methods for other pieces, depending on task.
9 / 45
Chunks vs Constituents
1. Parsing
[
[ G.K. Chesterton ],
[
[ author ] of
[
[ The Man ] who was
[ Thursday ]
]
]
]
2. Chunking:
[ G.K. Chesterton ],
[ author ] of
[ The Man ] who was
[ Thursday ]
10 / 45
Extracting Information: Coreference Annotation
11 / 45
Extracting Information: Message Understanding
12 / 45
Ignoring Information: Lexical Acquisition
studying syntactic patterns, e.g. nding verbs in a corpus,
displaying possible arguments
e.g. gave, in 100 les of the Penn Treebank corpus
replaced internal details of each noun phrase with NP
gave NP
gave up NP in NP
gave NP up
gave NP help
gave NP to NP
use in lexical acquisition, grammar development
13 / 45
Analogy with Tokenising and Tagging
fundamental in NLP: segmentation and labelling
tokenization and tagging
other similarities: nite-state; application specic
14 / 45
What are chunks?
Abney (1994):
[when I read] [a sentence], [I read it]
[a chunk] [at a time]
Chunks are non-overlapping regions of text:
[walk] [straight past] [the lake]
(Usually) each chunk contains a head, with the possible
addition of some preceding function words and modiers
[ walk ] [straight past ] [the lake ]
Chunks are non-recursive:

A chunk cannot contain another chunk of the same category


15 / 45
What are chunks?
Chunks are non-exhaustive

Some words in a sentence may not be grouped into a chunk


[take] [the second road] that [is] on [the left hand side]
NP postmodiers (e.g., PPs, relative clauses) are often
recursive and/or structurally ambiguous:

they are not included in noun chunks.


Chunks are typically subsequences of constituents (they dont
cross constituent boundaries)
noun groups everything in NP up to and including the
head noun
verb groups everything in VP (including auxiliaries) up to
and including the head verb
16 / 45
Psycholinguistic Motivations
Chunks as processing units evidence that humans tend to
read texts one chunk at a time
Chunks are phonologically relevant

prosodic phrase breaks

rhythmic patterns
Chunking might be a rst step in full parsing
17 / 45
Representing Chunks: Tags vs Trees
Chunks can be represented by:
IOB tags: each token is tagged with one of three special
chunk tags, INSIDE, OUTSIDE, or BEGIN

A token is tagged as BEGIN if it is at the beginning of a


chunk, and contained within that chunk

Subsequent tokens within the chunk are tagged INSIDE

All other tokens are tagged OUTSIDE.


trees spanning the entire text
18 / 45
Tag Representation
19 / 45
Tree Representation
20 / 45
Chunker Input and Output
Input: Chunk parsers usually operate on tagged texts.
[ the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN ]
Output: the chunker combines chunks along with the intervening
tokens into a chunk structure. A chunk structure is a two-level tree
that spans the entire text, and contains both chunks and
un-chunked tokens.
(S: (NP: I)
saw
(NP: the big dog)
on
(NP: the hill))
21 / 45
Viewing NLTK Chunked corpora
The CoNLL 2000 corpus contains 270k words of Wall Street
Journal text, annotated with chunk tags in the IOB format.
>>> print nltk.corpus.conll2000.chunked_sents(train)[99]
(S
(PP Over/IN)
(NP a/DT cup/NN)
(PP of/IN)
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
(VP told/VBD)
(NP his/PRP story/NN)
./.)
22 / 45
Viewing NLTK Chunked corpora
We can also select which chunk types to read (only NPs)
>>> print nltk.corpus.conll2000.chunked_sents(train, chunk_types=(NP,))[99]
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
told/VBD
(NP his/PRP story/NN)
23 / 45
Chunk Parsing: Accuracy
Chunk parsing attempts to do less, but does it more accurately.
Smaller solution space
Less word-order exibility within chunks than between chunks.
Better locality:

doesnt attempt to deal with unbounded dependencies

less context-dependence

doesnt attempt to resolve ambiguity only do those things


which can be done reliably
[the boy] [saw] [the man] [with a telescope]
less error propagation
24 / 45
Chunk Parsing: Domain Independence
Chunk parsing can be relatively domain independent, in that
dependencies involving lexical or semantic information tend to
occur at levels higher than chunks:
attachment of PPs and other modiers
argument selection
constituent re-ordering
25 / 45
Chunk Parsing: Eciency
Chunk parsing is more ecient:
smaller solution space
relevant context is small and local
chunks are non-recursive
can be implemented with a nite state automaton (FSA)
can be applied to very large text sources
26 / 45
Chunking with Regular Expressions, 1
Assume input is tagged.
Identify chunks (e.g., noun groups) by sequences of tags:
announce any new policy measures in his . . .
VB DT JJ NN NNS IN PRP$
27 / 45
Chunking with Regular Expressions, 2
Assume input is tagged.
Identify chunks (e.g., noun groups) by sequences of tags:
announce any new policy measures in his . . .
VB DT JJ NN NNS IN PRP$
Dene rules in terms of tag patterns
grammar = r"""
NP: {<DT><JJ><NN><NNS>}
"""
28 / 45
Extending the example
Extending the example:
in his Mansion House speech
IN PRP$ NNP NNP NN
DT or PRP$: <DT|PRP$><JJ><NN><NNS>
JJ and NN are optional: <DT|PRP$><JJ>*<NN>*<NNS>
we can have NNPs: <DT|PRP$><JJ>*<NNP>*<NN>*<NNS>
NN or NNS: <DT|PRP$><JJ>*<NNP>*<NN>*<NN|NNS>
29 / 45
Tag Strings and Tag Patterns
A tag string is a string consisting of tags delimited with
angle-brackets, e.g.,<DT><JJ><NN><VBD><DT><NN>
NLTK tag patterns are a special kind of Regular Expressions
over tag strings:

the angle brackets group their contents into atomic units


<NN>+ matches one or more repetitions of the tag string
<NN>
<NN|JJ> matches the tag strings <NN> or <JJ>

Wildcard . is constrained not to cross tag boundaries, e.g.


<NN.*> matches any single tag starting with <NN>

Whitespace is ignored in tag patterns, e.g.


<NN | JJ> is equivalent to <NN|JJ>
30 / 45
Chunk Parsing in NLTK
regular expressions over part-of-speech tags
Tag string: a string consisting of tags delimited with angle-brackets,
e.g. <DT><JJ><NN><VBD><DT><NN>
Tag pattern: regular expression over tag strings

<DT><JJ>?<NN>

<NN|JJ>+

<NN.*>
chunk a sequence of words matching a tag pattern
grammar = r"""
NP: {<DT><JJ><NN><NNS>}
"""
31 / 45
A Simple Chunk Parser
grammar = r"""
NP: {<PP\$>?<JJ>*<NN>} # chunk determiner/possessive,
adjectives and nouns
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"),
("down", "RP"), ("her", "PP$"), ("long", "JJ"),
("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(tagged_tokens)
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
32 / 45
Rule ordering and defaults
If a tag pattern matches at multiple overlapping locations, the
rst match takes precedence. For example, if we apply a rule
that matches two consecutive nouns to a text containing three
consecutive nouns, then the rst two nouns will be chunked
When a chunk rule is applied to a chunking hypothesis, it will
only create chunks that do not partially overlap with chunks
in the hypothesis
33 / 45
Tracing
The trace argument species whether debugging output
should be shown during parsing.
The debugging output shows the rules that are applied, and
shows the chunking hypothesis at each stage of processing.
34 / 45
Developing Chunk Parsers
tagged_tokens = [("The", "DT"), ("enchantress", "NN"),
("clutched", "VBD"), ("the", "DT"), ("beautiful", "JJ"),
cp1 = nltk.RegexpParser(r"""
NP: {<DT><JJ><NN>} # Chunk det+adj+noun
{<DT|NN>+} # Chunk sequences of NN and DT
""")
cp2 = nltk.RegexpParser(r"""
NP: {<DT|NN>+} # Chunk sequences of NN and DT
{<DT><JJ><NN>} # Chunk det+adj+noun
""")
>>> print cp1.parse(tagged_tokens, trace=1)
# Input:
<DT> <NN> <VBD> <DT> <JJ> <NN>
# Chunk det+adj+noun:
<DT> <NN> <VBD> {<DT> <JJ> <NN>}
# Chunk sequences of NN and DT:
{<DT> <NN>} <VBD> {<DT> <JJ> <NN>}
(S
(NP The/DT enchantress/NN)
35 / 45
More Chunking Rules: Chinking
chink: sequence of stopwords
chinking: process of removing a sequence of tokens from a
chunk
A chink rule chinks anything that matches a given tag pattern.
36 / 45
More Chunking Rules: Chinking
Entire chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink a/DT big/JJ cat/NN
Output: a/DT big/JJ cat/NN
Middle of a chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink big/JJ
Output: [a/DT] big/JJ [cat/NN]
End of a chunk
Input: [a/DT big/JJ cat/NN]
Operation: Chink cat/NN
Output: [a/DT big/JJ] cat/NN
37 / 45
Chinking Example
The following grammar puts the entire sentence into a single chunk, then
excises the chink:
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
tagged_tokens = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the",
cp = nltk.RegexpParser(grammar)
>>> print cp.parse(tagged_tokens)
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))
38 / 45
Evaluating Chunk Parsers
Process:
1. take some already chunked text
2. strip o the chunks
3. rechunk it
4. compare the result with the original chunked text
Metrics:

precision: what fraction of the returned chunks were correct?

recall : what fraction of correct chunks were returned?


39 / 45
Evaluating Chunk Parsers in NLTK
First, atten a chunk structure into a tree consisting only of a root node
and leaves:
>>> correct = nltk.chunk.tagstr2tree(
... "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN
>>> correct.flatten()
(S: (the, DT) (little, JJ) (cat, NN) (sat, VBD)
(on, IN) (the, DT) (mat, NN))
>>> grammar = r"NP: {<PRP|DT|POS|JJ|CD|N.*>+}"
>>> cp = nltk.RegexpParser(grammar)
>>> tagged_tokens = [("the", "DT"), ("little", "JJ"), ("cat", "NN"),
... ("sat", "VBD"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
>>> chunkscore = nltk.chunk.ChunkScore()
>>> guess = cp.parse(correct.flatten())
>>> chunkscore.score(correct, guess)
>>> print chunkscore
ChunkParse score:
Precision: 100.0%
Recall: 100.0%
F-Measure: 100.0%
40 / 45
Cascaded Chunking
chunks so far are at
it is possible to build chunks of arbitrary depth by connecting
the output of one chunker to the input of another.
41 / 45
Cascaded chunking
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|S>+$} # Chunk rightmost verbs and arguments/adjun
S: {<NP><VP>} # Chunk NP, VP
"""
cp = nltk.RegexpParser(grammar)
tagged_tokens = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"),
("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
>>> print cp.parse(tagged_tokens)
(S
(NP Mary/NN)
saw/VBD
(S
(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
42 / 45
Repeated Cascaded Chunking
Repeat the process by adding an optional second argument loop to
specify the number of times the set of patterns should be run:
>>> cp = nltk.RegexpParser(grammar, loop=2)
>>> print cp.parse(tagged_tokens)
(S
(NP John/NNP)
thinks/VBZ
(S
(NP Mary/NN)
(VP
saw/VBD
(S
(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))
43 / 45
Summary
Chunking is less ambitious than full parsing, but more
ecient.
Maybe sucient for many practical tasks:

Information Extraction

Question Answering

Extracting subcatgorization frames

Providing features for machine learning, e.g., for building


Named Entity recognizers.
44 / 45
Reading
Jurafsky and Martin, Section 10.5
NLTK Book chapter on Chunking
Steven Abney. Parsing By Chunks. In: Robert Berwick,
Steven Abney and Carol Tenny (eds.), Principle-Based
Parsing. Kluwer Academic Publishers, Dordrecht. 1991.
Steven Abney. Partial Parsing via Finite-State Cascades. J. of
Natural Language Engineering, 2(4): 337-344. 1996.
Abneys publications:
http://www.vinartus.net/spa/publications.html
45 / 45

Vous aimerez peut-être aussi