Identifying Phrasal Verbs Using Many Bilingual Corpora

Identifying Phrasal Verbs Using Many Bilingual Corpora
Karl Pichotta John DeNero

Department of Computer Science Google, Inc.
University of Texas at Austin denero@google.com
pichotta@cs.utexas.edu
Abstract We focus on a particular subset of MWEs, English

phrasal verbs. A phrasal verb consists of a head
We address the problem of identifying mul- verb followed by one or more particles, such that
tiword expressions in a language, focus- the meaning of the phrase cannot be determined by
ing on English phrasal verbs. Our poly- combining the simplex meanings of its constituent
glot ranking approach integrates frequency
words (Baldwin and Villavicencio, 2002; Dixon,
statistics from translated corpora in 50 dif-
ferent languages. Our experimental eval- 1982; Bannard et al., 2003).1 Examples of phrasal
uation demonstrates that combining statisti- verbs include count on [rely], look after [tend], or
cal evidence from many parallel corpora us- take off [remove], the meanings of which do not in-
ing a novel ranking-oriented boosting algo- volve counting, looking, or taking. In contrast, there
rithm produces a comprehensive set of English are verbs followed by particles that are not phrasal
phrasal verbs, achieving performance compa- verbs, because their meaning is compositional, such
rable to a human-curated set.
as walk towards, sit behind, or paint on.
We identify phrasal verbs by using frequency
1 Introduction statistics calculated from parallel corpora, consist-
ing of bilingual pairs of documents such that one
A multiword expression (MWE), or noncomposi- is a translation of the other, with one document in
tional compound, is a sequence of words whose English. We leverage the observation that a verb
meaning cannot be composed directly from the will translate in an atypical way when occurring as
meanings of its constituent words. These idiosyn- the head of a phrasal verb. For example, the word
cratic phrases are prevalent in the lexicon of a lan- look in the context of look after will tend to trans-
guage; Jackendoff (1993) estimates that their num- late differently from how look translates generally.
ber is on the same order of magnitude as that of sin- In order to characterize this difference, we calculate
gle words, and Sag et al. (2002) suggest that they a frequency distribution over translations of look,
are much more common, though quantifying them then compare it to the distribution of translations of
is challenging (Church, 2011). The task of identify- look when followed by the word after. We expect
ing MWEs is relevant not only to lexical semantics that idiomatic phrasal verbs will tend to have unex-
applications, but also machine translation (Koehn et pected translation of their head verbs, measured by
al., 2003; Ren et al., 2009; Pal et al., 2010), informa- the Kullback-Leibler divergence between those dis-
tion retrieval (Xu et al., 2010; Acosta et al., 2011), tributions.
and syntactic parsing (Sag et al., 2002). Awareness Our polyglot ranking approach is motivated by the
of MWEs has empirically proven useful in a num- hypothesis that using many parallel corpora of dif-
ber of domains: Finlayson and Kulkarni (2011), for ferent languages will help determine the degree of
example, use MWEs to attain a significant perfor- semantic idiomaticity of a phrase. In order to com-
mance improvement in word sense disambiguation;
1
Venkatapathy and Joshi (2006) use features associ- Nomenclature varies: the term verb-particle construction
ated with MWEs to improve word alignment. is also used to denote what we call phrasal verbs; further, the
term phrasal verb is sometimes used to denote a broader class

Research conducted during an internship at Google. of constructions.
bine evidence from multiple languages, we develop Feature Description
a novel boosting algorithm tailored to the task of L (50) KL Divergence for each language L
ranking multiword expressions by their degree of id- 1 frequency of phrase given verb
iomaticity. We train and evaluate on disjoint subsets 2 PMI of verb and particles
of the phrasal verbs in English Wiktionary2 . In our 3 1 with interposed pronouns
experiments, the set of phrasal verbs identified au-
tomatically by our method achieves held-out recall Table 1: Features used by the polyglot ranking system.
that nears the performance of the phrasal verbs in
WordNet 3.0, a human-curated set. Our approach
strongly outperforms a monolingual system, and and the task of ranking those candidates by their se-
continues to improve when incrementally adding mantic idiosyncracy. With English phrasal verbs, it
translation statistics for 50 different languages. is straightforward to enumerate all desired verbs fol-
lowed by one or more particles, and rank the entire
2 Identifying Phrasal Verbs set.
The task of identifying phrasal verbs using corpus Using Parallel Corpora. There have been a num-
information raises several issues of experimental de- ber of approaches proposed for the use of multilin-
sign. We consider four central issues below in moti- gual resources for MWE identification (Melamed,
vating our approach. 1997; Villada Moiron and Tiedemann, 2006; Caseli
et al., 2010; Tsvetkov and Wintner, 2012; Salehi
Types vs. Tokens. When a phrase is used in con- and Cook, 2013). Our approach differs from pre-
text, it takes a particular meaning among its pos- vious work in that we identify MWEs using transla-
sible senses. Many phrasal verbs admit composi- tion distributions of verbs, as opposed to 11, 1m,
tional senses in addition to idiomatic onescontrast or mn word alignments, most-likely translations,
idiomatic look down on him for his politics with bilingual dictionaries, or distributional entropy. To
compositional look down on him from the balcony. the best of our knowledge, ours is the first approach
In this paper, we focus on the task of determining to use translational distributions to leverage the ob-
whether a phrase type is a phrasal verb, meaning that servation that a verb typically translates differently
it frequently expresses an idiomatic meaning across when it heads a phrasal verb.
its many token usages in a corpus. We do not at-
tempt to distinguish which individual phrase tokens 3 The Polyglot Ranking Approach
in the corpus have idiomatic senses.
Our approach uses bilingual and monolingual statis-
Ranking vs. Classification. Identifying phrasal tics as features, computed over unlabeled corpora.
verbs involves relative, rather than categorical, judg- Each statistic characterizes the degree of idiosyn-
ments: some phrasal verbs are more compositional crasy of a candidate phrasal verb, using a single
than others, but retain a degree of noncomposition- monolingual or bilingual corpus. We combine fea-
ality (McCarthy et al., 2003). Moreover, a poly- tures for many language pairs using a boosting algo-
semous phrasal verb may express an idiosyncratic rithm that optimizes a ranking objective using a su-
sense more or less often than a compositional sense pervised training set of English phrasal verbs. Each
in a particular corpus. Therefore, we should expect of these aspects of our approach is described in de-
a corpus-driven system not to classify phrases as tail below; for reference, Table 1 provides a list of
strictly idiomatic or compositional, but instead as- the features used.
sign a ranking or relative scoring to a set of candi-
dates. 3.1 Bilingual Statistics
One of the intuitive properties of an MWE is that
Candidate Phrases. We distinguish between the
its individual words likely do not translate literally
task of identifying candidate multiword expressions
when the whole expression is translated into another
2
http://en.wiktionary.org language (Melamed, 1997). We capture this effect
by measuring the divergence between how a verb to its subphrase most commonly aligned to the verb
translates generally and how it translates when head- in e. It expresses how this verb is translated in the
ing a candidate phrasal verb. context of a phrasal verb construction.3 Equation (1)
A parallel corpus is a collection of document defines a distribution over all phrases x of a foreign
pairs hDE , DF i, where DE is in English, DF is in language.
another language, one document is a translation of We also assign statistics to verbs as they are trans-
the other, and all documents DF are in the same lated outside of the context of a phrase. Let v(e)
language. A phrase-aligned parallel corpus aligns be the verb of a phrasal verb candidate e, which
those documents at a sentence, phrase, and word is always its first word. For a single-word verb
level. A phrase e aligns to another phrase f if some phrase v(e), we can compute the constituent transla-
word in e aligns to some word in f and no word in tion probability Pv(e) (x), again using Equation (1).
e or f aligns outside of f or e, respectively. As a The difference between Pe (x) and Pv(e) (x) is that
result of this definition, the words within an aligned the latter sums over all translations of the verb v(e),
phrase pair are themselves connected by word-level regardless of whether it appears in the context of e:
alignments. X
Given an English phrase e, define F (e) to be the Pv(e) (x) = P (f |v(e)) (1 (v(e), f ), x)
set of all foreign phrases observed aligned to e in a f F (v(e))
parallel corpus. For any f F (e), let P (f |e) be the
For a one-word phrase such as v(e), 1 (v(e), f )
conditional probability of the phrase e translating to
is the subphrase of f that most commonly directly
the phrase f . This probability is estimated as the
word-aligns to the one word of v(e).
relative frequency of observing f and e as an aligned
Finally, for a phrase e and its verb v(e), we calcu-
phrase pair, conditioned on observing e aligned to
late the Kullback-Leibler (KL) divergence between
any phrase in the corpus:
the translation distribution of v(e) and e:
N (e, f )
P (f |e) = P 0
X Pv(e) (x)
f 0 N (e, f ) DKL Pv(e) kPe = Pv(e) (x) ln (2)
x
Pe (x)
with N (e, f ) the number of times e and f are ob-
served occurring as an aligned phrase pair. where the sum ranges over all x such that Pv(e) (x) >
Next, we assign statistics to individual verbs 0. This quantifies the difference between the trans-
within phrases. The first word of a candidate phrasal lations of es verb when it occurs in e, and when it
verb e is a verb. For a candidate phrasal verb e and occurs in general. Figure 1 illustrates this computa-
a foreign phrase f , let 1 (e, f ) be the subphrase of tion on a toy corpus.
f that is most commonly word-aligned to the first Smoothing. Equation (2) is defined only if, for ev-
word of e. As an example, consider the phrase pair ery x such that Pv(e) (x) > 0, it is also the case
e = talk down to and f = hablar con menosprecio. that Pe (x) > 0. In order to ensure that this con-
Suppose that when e is aligned to f , the word talk is dition holds, we smooth the translation distributions
most frequently aligned to hablar. Then 1 (e, f ) = toward uniform. Let D be the set of phrases with
hablar. non-zero probability under either distribution:
For a phrase e and its set F (e) of aligned trans-
lations, we define the constituent translation proba- D = {x : Pv(e) (x) > 0 or Pe (x) > 0}
bility of a foreign subphrase x as:
X Then, let UD be the uniform distribution over D:
Pe (x) = P (f |e) (1 (e, f ), x) (1)
1/|D| if x D
f F (e) UD (x) =
0 if x
/D
where is the Kronecker delta function, taking value 3
To extend this statistic to other types of multiword expres-
1 if its arguments are equal and 0 otherwise. Intu- sions, one could compute a similar distribution for other content
itively, Pe assigns the probability mass for every f words in the phrase.
0%
0 5 10 15 20 25 30 35 40 45 50
Number of languages (k)
Aligned Phrase Pair N (e, f ) 1 (e, f ) tion distributions for any inflected form ei E:
looking forward to 1 X
0 0
1 deseando L (e) = DKL Pv(e i)
kP ei
deseando |E|
ei E
looking forward to 3.2 Monolingual Statistics

3 mirando
mirando adelante a We also collect a number of monolingual statistics
for each phrasal verb candidate, motivated by the
looking
5 mirando considerable body of previous work on the topic
mirando (Church and Hanks, 1990; Lin, 1999; McCarthy et
al., 2003). The monolingual statistics are designed
looking to identify frequent collocations in a language. This
3 buscando
buscando a set of monolingual features is not comprehensive, as
we focus our attention primarily on bilingual fea-
mirando deseando buscando \begin{tabular}{rrrr}
tures in ths paper. &\textit{mirando} &\textit{deseando} &\tex
Pv(e) (x) 5
= 0.625 0 3
= 0.375 As above, define E to be &$\frac{5}{8}=0.625
$P_{v(e)}(x)$ the set of morpholog- $ &$0$ &$\fr
8 8
ically inflected\hline \\ [-1ex]
variants of a candidate e, and let
0 $P'_{v(e)}(x)$&$0.610 $ &$0.02$ &$0.3
Pv(e) (x) 0.610 0.02 0.373 V be the set of\hline
inflected variants of the head verb
\\ [-1ex]
3 1
v(e) of e. We define three
$P_e(x)$ statistics calculated
&$\frac{3}{4}=0.75 from
$ &$\frac{1}{4}=0.25 $ &
Pe (x) = 0.75 = 0.25 0 \hline
4 4
the phrase counts of a\\ [-1ex]
monolingual English corpus.
$P'_e(x)$&$0.729 $ &$0.254 $ &$0.0
Pe0 (x) 0.729 0.254 0.02 1 (e) \\
First, we define\hline to be the relative frequency of
[-1ex]
the candidate e,\end{tabular}
given es head verb, summed over
0
DKL (Pv(e kPe0i ) = 0.109 + 0.045 + 1.159 = 1.005 morphological variants:
i)
0
1D_{KL} (P'_{v(e_i)} \| P'_{e_i})
(e) = ln P (E|V ) = -0.109 + -0.045 +
Figure 1: The computation of DKL (Pv(e i)
kPe0i ) using a P
toy corpus, for e = looking forward to. Note that the sec- e E N (ei )
= ln P i
vi V N (vi )
ond aligned phrase pair contains the third, so the seconds
count of 3 must be included in the thirds count of 5.
where N (x) is the number of times phrase x was
observed in the monolingual corpus.
When computing divergence in Equation (2), we use
Second, define 2 (e) to be the pointwise mutual
the smoothed distributions Pe0 and Pv(e)
0 :
information (PMI) between V (the event that one of
the inflections of the verb in e is observed) and R,
Pe0 (x) = Pe (x) + (1 )UD (x)
the event of observing the rest of the phrase:
0
Pv(e) (x) = Pv(e) (x) + (1 )UD (x).
2 (e)
We use = 0.95, which distributes 5% of the total = PMI(V, R)
probability mass evenly among all events in D.
= lg P (V, R) lg (P (V )P (R))
Morphology. We calculate statistics for morpho- = lg P (E) lg (P (V )P (R))
logical variants of an English phrase. For a candi- X X
= lg N (ei )lg N (vi )lg N (r)+lg N
date English phrasal verb e (for example, look up),
ei E vi V
let E denote the set of inflections of that phrasal verb
(for look up, this will be [look|looks|looked|looking] where N is the total number of tokens in the corpus,
up). We extract the variants in E from the verb en- and logarithms are base-2. This statistic character-
tries in English Wiktionary. The final score com- izes the degree of association between a verb and
puted from a phrase-aligned parallel corpus translat- its phrasal extension. We only calculate 2 for two-
ing English sentences into a language L is the aver- word phrases, as it did not prove helpful for longer
age KL divergence of smoothed constituent transla- phrases.
Finally, define 3 (e) to be the relative frequency Algorithm 1 Recall-Oriented Ranking AdaBoost
of the phrasal verb e augmented by an accusative 1: for i = 1 : |X| do
pronoun, conditioned on the verb. Let A be the 2: w[i] 1/|X|
set of phrases in E with an accusative pronoun (it, 3: end for
them, him, her, me, you) optionally inserted either at 4: for t = 1 : T do
the end of the phrase or directly after the verb. For 5: for all h H do
e = look up, A = {look up, look X up, look up X, 6: h 0
looks up, looks X up, looks up X, . . . }, with X an 7: for i = 1 : |X| do
accusative pronoun. The 3 statistic is similar to 1 , 8: if xi 6 h then
but allows for an intervening or following pronoun: 9: h h + w[i]
10: end if
3 (e) = ln P (A|V ) 11: end for
P
e A N (ei ) 12: end for
= ln P i . 13: ht argmaxhH |B h |
vi V N (vi )
14: t ln(B /ht )
This statistic is designed to exploit the intuition that 15: for i = 1 : |X| do
phrasal verbs frequently have accusative pronouns 16: if xi ht then
either inserted into the middle (e.g. look it up) or at 17: w[i] Z1 w[i] exp (t )
the end (e.g. look down on him). 18: else
19: w[i] Z1 w[i] exp (t )
3.3 Ranking Phrasal Verb Candidates 20: end if
Our goal is to assign a single real-valued score to 21: end for
each candidate e, by which we can rank candidates 22: end for
according to semantic idiosyncrasy. For each lan-
guage L for which we have a parallel corpus, we
defined, in section 3.1, a function L (e) assigning to this gold-standard training set. We choose this
real values to candidate phrasal verbs e, which we recall-at-N metric so as to not directly penalize pre-
hypothesize is higher on average for more idiomatic cision errors, as our training set is incomplete.
compounds. Further, in section 3.2, we defined real- Define H to be the set of N -element sets contain-
valued monolingual functions 1 , 2 , and 3 for ing the top proposals for each weak ranker (we use
which we hypothesize the same trend holds. Be- N = 2000). That is, each element of H is a set con-
cause each score individually ranks all candidates, taining the 2000 highest values for some L or i .
it is natural to view each L and i as a weak rank- We define the baseline error B to be 1 E[R], with
ing function that we can combine with a supervised R the recall-at-N of a ranker ordering the candidate
boosting objective. We use a modified version of phrases in the set H at random. The value E[R] is
AdaBoost (Freund and Schapire, 1995) that opti- estimated by averaging the recall-at-N of 1000 ran-
mizes for recall. dom orderings of H.
For each L and i , we compute a ranked list Algorithm 1 gives the formulation of the Ada-
of candidate phrasal verbs, ordered from highest to Boost training algorithm that we use to combine
lowest value. To simplify learning, we consider only weak rankers. The algorithm maintains a weight
the top 5000 candidate phrasal verbs according to vector w (summing to 1) which contains a positive
1 , 2 , and 3 . This pruning procedure excludes real number for each gold standard phrasal verb in
candidates that do not appear in our monolingual the training set X. Initially, w is uniformly set to
corpus. 1/|X|. At each iteration of the algorithm, w is mod-
We optimize the ranker using an unranked, in- ified to take higher values for recently misclassi-
complete training set of phrasal verbs. We can eval- fied examples. We repeatedly choose weak rankers
uate the quality of the a ranker by outputting the top ht H (and corresponding real-valued coefficients
N ranked candidates and measuring recall relative t ) that correctly rank examples with high w values.
Lines 512 of Algorithm 1 calculate the weighted about across after against along
error values h for every weak ranker set h H. among around at before behind
between beyond by down for
The error h will be 1 if h contains none of X and 0
from in into like off
if h contains all of X, as w always sums to 1. Line on onto outside over past
13 picks the ranker ht H whose weighted error is round through to towards under
as far as possible from the random baseline error B . up upon with within without
Line 14 calculates a coefficient t for ht , which will
be positive if ht < B and negative if ht > B . Table 2: Particles and prepositions allowed in phrasal
verbs gathered from Wiktionary.
Intuitively, t encodes the importance of ht it will
be high if ht performs well, and low if it performs
poorly. The Z in lines 17 and 19 is the normalizing Many of the idioms and derived terms are not
constant ensuring the vector w sums to 1. phrasal verbs (e.g. kick the bucket, make-or-break).
After termination of Algorithm 1, we have We filter out any phrases not of the form V P + , with
weights 1 , . . . , T and lists h1 , . . . , hT . Define ft V a verb, and P + denoting one or more occurrences
as the function that generated the list ht (each ft will of particles and prepositions from the list in Table 2.
be some L or i ). Now, we define a final combined We omit prepositions that do not productively form
function , taking a phrase e and returning a real English phrasal verbs, such as amid and as. This
number: process also omits some compounds that are some-
T
X times called phrasal verbs, such as light verb con-
(e) = t ft (e).
structions, e.g. have a go (Butt, 2003), and noncom-
t=1
positional verb-adverb collocations, e.g. look for-
We standardize the scores of individual weak
ward.
rankers to have mean 0 and variance 1, so that their
There are a number of extant phrasal verb cor-
scores are comparable.
pora. For example, McCarthy et al. (2003) present
The final learned ranker outputs a real value, in-
graded human compositionality judgments for 116
stead of the class labels frequently found in Ada-
phrasal verbs, and Baldwin (2008) presents a large
Boost. This follows previous work using boosting
set of candidates produced by an automated system,
for learning to rank (Freund et al., 2003; Xu and Li,
with false positives manually removed. We use Wik-
2007). Our algorithm differs from previous methods
tionary instead, in an attempt to construct a maxi-
because we are seeking to optimize for Recall-at-N ,
mally comprehensive data set that is free from any
rather than a ranking loss.
possible biases introduced by automatic extraction
4 Experimental Evaluation processes.
4.1 Training and Test Set 4.2 Filtering and Data Partition
In order to train and evaluate our system, we con- The merged list of phrasal verbs extracted from Wik-
struct a gold-standard list of phrasal verbs from tionary included some common collocations that
the freely available English Wiktionary. We gather have compositional semantics (e.g. know about), as
phrasal verbs from three sources within Wiktionary: well as some very rare constructions (e.g. cheese
down). We removed these spurious results system-
1. Entries labeled as English phrasal verbs4 ,
atically by filtering out very frequent and very infre-
2. Entries labeled as English idioms5 , and quent entries. First, we calculated the log probability
of each phrase, according to a language model built
3. The derived terms6 of English verb entries. from a large monolingual corpus of news documents
4
http://en.wiktionary.org/wiki/Category: and web documents, smoothed with stupid back-
English_phrasal_verbs off (Brants et al., 2007). We sorted all Wiktionary
5
http://en.wiktionary.org/wiki/Category:
English_idioms
phrasal verbs according to this value. Then, we se-
6
For example, see http://en.wiktionary.org/ lected the contiguous 75% of the sorted phrases that
wiki/take#Derived_terms minimize the variance of this statistic. This method
Recall-at-1220 4.5 Results
Dev Test
We optimize a ranker using the boosting algorithm
Boosted Baseline
Frequent Candidates 17.0 19.3
WordNet 3.0 Frequent 41.6 43.7 described in section 3.3, using the features from Ta-
WordNet 3.0 Filtered 49.4 48.8 ble 1, optimizing performance on the Wiktionary de-
velopment set described in section 4.2. Monolingual
Monolingual Only 30.1 30.2
and bilingual statistics are calculated using the cor-
Bilingual Only 47.1 43.9
pora described in section 4.3, with candidate phrasal
Monolingual+Bilingual 50.8 47.9
verbs being drawn from the set described in section
Table 3: Our boosted ranker combining monolingual 4.4.
and bilingual features (bottom) compared to three base- We evaluate our method of identifying phrasal
lines (top) gives comparable performance to the human- verbs by computing recall-at-N . This statistic is the
curated upper bound. fraction of the Wiktionary test set that appears in the
top N proposed phrasal verbs by the method, where
N is an arbitrary number of top-ranked candidates
removed a few very frequent phrases and a large held constant when comparing different approaches
number of rare phrases. The remaining phrases were (we use N = 1220). We do not compute precision,
split randomly into a development set of 694 items because the test set to which we compare is not an
and a held-out test set of 695 items. exhaustive list of phrasal verbs, due to the develop-
ment/test split, frequency filtering, and omissions in
4.3 Corpora the original lexical resource. Proposing a phrasal
verb not in the test set is not necessarily an error, but
Our monolingual English corpus consists of news ar- identifying many phrasal verbs from the test set is an
ticles and documents collected from the web. Our indication of an effective method. Recall-at-N is a
parallel corpora from English to each of 50 lan- natural way to evaluate a ranking system where the
guages also consist of documents collected from gold-standard data is an incomplete, unranked set.
the web via distributed data mining of parallel doc- Table 3 compares our approach to three baselines
uments based on the text content of web pages using the Recall-at-1220 metric evaluated on both
(Uszkoreit et al., 2010). the development and test sets. As a lower bound, we
The parallel corpora were segmented into aligned evaluated the 1220 most frequent candidates in our
sentence pairs and word-aligned using two iterations Monolingual corpus (Frequent Candidates).
of IBM Model 1 (Brown et al., 1993) and two iter- As a competitive baseline, we evaluated the set of
ations of the HMM-based alignment model (Vogel phrasal verbs in WordNet 3.0 (Fellbaum, 1998). We
et al., 1996) with posterior symmetrization (Liang et selected the most frequent 1220 out of 1781 verb-
al., 2006). This training recipe is common in large- particle constructions in WordNet (WordNet 3.0 Fre-
scale machine translation systems. quent). A stronger baseline resulted from apply-
ing the same filtering procedure to WordNet that
we did to Wiktionary: sorting all verb-particle en-
4.4 Generating Candidates
tries by their language model score and retaining the
To generate the set of candidate phrasal verbs con- 1220 consecutive entries that minimized language
sidered during evaluation, we exhaustively enumer- model variance (WordNet 3.0 Filtered). WordNet
ated the Cartesian product of all verbs present in the is a human-curated resource, and yet its recall-at-N
previously described Wiktionary set (V), all parti- compared to our Wiktionary test set is only 48.8%,
cles in Table 2 (P) and a small set of second parti- indicating substantial divergence between the two
cles T = {with, to, on, }, with the empty string. resources. Such divergence is typical: lexical re-
The set of candidate phrasal verbs we consider dur- sources often disagree about what multiword expres-
ing evaluation is the product V P T , which con- sions to include (Lin, 1999).
tains 96,880 items. The three final lines in Table 3 evaluate our
50% Recall-at-1220
Dev Test
40%
Bilingual only 47.1 43.9
Test set recall-at-1220
Combined with AdaBoost Bilingual+1 48.1 46.9

Individual Bilingual Statistics Bilingual+2 50.1 48.3
30%
Bilingual+3 48.4 46.3
Bilingual+1 + 2 50.2 47.9
20% Bilingual+1 + 3 49.0 47.4
Bilingual+2 + 3 50.4 49.4
10% Bilingual+1 + 2 + 3 50.8 47.9
Table 4: An ablation of monolingual statistics shows that

0% they are useful in addition to the 50 bilingual statistics
0 5 10 15 20 25 30 35 40 45 50
combined, and no single statistic provides maximal per-
Number of languages (k) formance.
Figure 2: The solid line shows recall-at-1220 when com-

bining the k best-performing bilingual statistics and three The dotted line in Figure 2 shows that individual
monolingual statistics. The dotted line shows the indi- bilingual statistics have recall-at-1220 ranging from
vidual performance of the kth best-performing bilingual 34.4% to 5.0%. This difference reflects the differ-
statistic, when applied in isolation to rank candidates. ent sizes of parallel corpora and usefulness of dif-
ferent languages in identifying English semantic id-
iosyncrasy. Combining together the signal of mul-
boosted ranker. Automatically detecting phrasal
tiple languages is clearly beneficial, and including
verbs using monolingual features alone strongly out-
many low-performing languages still offers overall
performed the frequency-based lower bound, but un-
improvements.
derperformed the WordNet baseline. Bilingual fea-
Table 4 shows the effect of adding different sub-
tures, using features from 50 languages, proved sub-
sets of the monolingual statistics to the set of all
stantially more effective. The combination of both
50 bilingual statistics. Monolingual statistics give
types of features yielded the best performance, out-
a performance improvement of up to 5.5% recall
performing the human-curated WordNet baseline on
on the test set, but the comparative behavior of the
the development set (on which our ranker was opti-
various combinations of the i is somewhat unpre-
mized) and approaching its performance on the held-
dictable when training on the development set and
out test set.
evaluating on the test set. The pointwise mutual in-
formation of a verb and its particles (2 ) appears to
4.6 Feature Analysis
be the most useful feature. In fact, the test set per-
The solid line in Figure 2 shows the recall-at-1220 formance of using 2 alone outperforms the combi-
for a boosted ranker using all monolingual statistics nation of all three. The best combination even out-
and k bilingual statistics, for increasing k. Bilin- performs the WordNet 3.0 baseline on the test set,
gual statistics are added according to their individual though optimizing on the development set would not
recall, from best-performing to worst. That is, the select this model.
point at k = 0 uses only 1 , 2 , and 3 , the point at
k = 1 adds the best individually-performing bilin- 4.7 Error Analysis
gual statistic (Spanish) as a weak ranker, the next Table 5 shows the 100 highest ranked phrasal verb
point adds the second-best bilingual statistic (Ger- candidates by our system that do not appear in either
man), etc. Boosting maximizes performance on the the development or test sets. Most of these candi-
development set, and evaluation is performed on the dates are in fact English phrasal verbs that happened
test set. We use T = 53 (equal to the total number to be missing from Wiktionary, some are present
of weak rankers). in Wiktionary but were removed from the reference
pick up pat on tap into fit for charge with suit against
catch up burst into muck up haul up give up get off
get through get up get in tack on buzz about do like
plump for haul in keep up with strap on catch up with suck into
get round chop off slap on pitch into get into inquire into
drop behind get on catch up on pass on cue from carry around
get around get over shoot at pick over shoot by shoot in
make up to get past cast down set up with rule off hand round
piss on hit by break down move for lead off pluck off
flip through edge over strike off plug into keep up go past
set off pull round see about stay on put up sidle up to
buzz around take off set up slap in head towards shoot past
inquire for tuck up lie with well before go on with reel from
drive along snap off barge into whip on put down instance through
bar from cut down on let in tune in to move off suit in
lean against well beyond get down to go across sail into lie over
hit with chow down on look after catch at
Table 5: The highest ranked phrasal verb candidates from our full system that do not appear in either Wiktionary set.
Candidates are presented in decreasing rank; pat on is the second highest ranked candidate.
sets during filtering, and the remainder are in fact tiword expressions by picking out sufficiently com-
not phrasal verbs (true precision errors). mon phrases that align to single target-side tokens.
These errors fall largely into two categories. Tsvetkov and Wintner (2012) generate candidate
Some candidates are compositional, but contain pol- MWEs by finding one-to-one alignments in paral-
ysemous verbs, such as hit by, drive along, and head lel corpora which are not in a bilingual dictionary,
towards. In these cases, prepositions disambiguate and ranking them based on monolingual statistics.
the verb, which naturally affects translation distri- The system of Salehi and Cook (2013) is perhaps
butions. Other candidates are not phrasal verbs, but the closest to the current work, judging noncompo-
instead phrases that tend to have a different syntac- sitionality using string edit distance between a can-
tic role, such as suit against, instance through, fit didate phrases automatic translation and its com-
for, and lie over (conjugated as lay over). A care- ponents individual translations. Unlike the current
ful treatment of part-of-speech tags when computing work, their method does not use distributions over
corpus statistics might address this issue. translations or combine individual bilingual values
with boosting; however, they find, as we do, that in-
5 Related Work corporating many languages is beneficial to MWE
The idea of using word-aligned parallel corpora identification.
to identify idiomatic expressions has been pur- A large body of work has investigated the identifi-
sued in a number of different ways. Melamed cation of noncompositional compounds from mono-
(1997) tests candidate MWEs by collapsing them lingual sources (Lin, 1999; Schone and Jurafsky,
into single tokens, training a new translation model 2001; Fazly and Stevenson, 2006; McCarthy et
with these tokens, and using the performance of al., 2003; Baldwin et al., 2003; Villavicencio,
the new model to judge candidates noncomposi- 2003). Many of these monolingual statistics could
tionality. Villada Moiron and Tiedemann (2006) be viewed as weak rankers and fruitfully incorpo-
use word-aligned parallel corpora to identify Dutch rated into our framework.
MWEs, testing the assumption that the distributions There has also been a substantial amount of work
of alignments of MWEs will generally have higher addressing the problem of differentiating between
entropies than those of fully compositional com- literal and idiomatic instances of phrases in con-
pounds. Caseli et al. (2010) generate candidate mul- text (Katz and Giesbrecht, 2006; Li et al., 2010;
Sporleder and Li, 2009; Birke and Sarkar, 2006; Timothy Baldwin. 2008. A resource for evaluating the
Diab and Bhutada, 2009). We do not attempt this deep lexical acquisition of english verb-particle con-
task; however, techniques for token identification structions. In Proceedings of the LREC Workshop To-
could be used to improve type identification (Bald- wards a Shared Task for Multiword Expressions.
Colin Bannard, Timothy Baldwin, and Alex Lascarides.
win, 2005).
2003. A statistical approach to the semantics of verb-
particles. In Proceedings of the ACL Workshop on
6 Conclusion Multiword Expressions.
Julia Birke and Anoop Sarkar. 2006. A clustering ap-
We have presented the polyglot ranking approach
proach for the nearly unsupervised recognition of non-
to phrasal verb identification, using parallel corpora literal language. In Proceedings of European Chapter
from many languages to identify phrasal verbs. We of the Association for Computational Linguistics.
proposed an evaluation metric that acknowledges the Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J.
inherent incompleteness of reference sets, but dis- Och, and Jeffrey Dean. 2007. Large language mod-
tinguishes among competing systems in a manner els in machine translation. In Proceedings of the Joint
aligned to the goals of the task. We developed a Conference on Empirical Methods in Natural Lan-
recall-oriented learning method that integrates mul- guage Processing and Computational Natural Lan-
guage Learning.
tiple weak ranking signals, and demonstrated exper-
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
imentally that combining statistical evidence from a Pietra, and Robert L. Mercer. 1993. The mathematics
large number of bilingual corpora, as well as from of statistical machine translation: Parameter estima-
monolingual corpora, produces the most effective tion. Computational Linguistics.
system overall. We look forward to generalizing Miriam Butt. 2003. The light verb jungle. In Proceed-
our approach to other types of noncompositional ings of the Workshop on Multi-Verb Constructions.
phrases. Helena de Medeiros Caseli, Carlos Ramisch, Maria das
Gracas Volpe Nunes, and Aline Villavicencio. 2010.
Alignment-based extraction of multiword expressions.
Acknowledgments
Language Resources and Evaluation.
Special thanks to Ivan Sag, who argued for the Kenneth Ward Church and Patrick Hanks. 1990. Word
importance of handling multi-word expressions in association norms, mutual information, and lexicogra-
phy. Computational Linguistics, 16(1).
natural language processing applications, and who
Kenneth Church. 2011. How many multiword expres-
taught the authors about natural language syntax sions do people know? In Proceedings of the ACL
once upon a time. We would also like to thank the Workshop on Multiword Expressions.
anonymous reviewers for their helpful suggestions. Mona T. Diab and Pravin Bhutada. 2009. Verb noun
construction MWE token supervised classification. In
Proceedings of the ACL Workshop on Multiword Ex-
References pressions.
Robert Dixon. 1982. The grammar of english phrasal
Otavio Acosta, Aline Villavicencio, and Viviane Moreira. verbs. Australian Journal of Linguistics.
2011. Identification and treatment of multiword ex- Afsaneh Fazly and Suzanne Stevenson. 2006. Automat-
pressions applied to information retrieval. In Proceed- ically constructing a lexicon of verb phrase idiomatic
ings of the ACL Workshop on Multiword Expressions. combinations. In Proceedings of the European Chap-
Timothy Baldwin and Aline Villavicencio. 2002. Ex- ter of the Association for Computational Linguistics.
tracting the unextractable: A case study on verb- Christiane Fellbaum. 1998. WordNet: An Electronic
particles. In Proceedings of the Sixth Conference on Lexical Database. The MIT Press.
Natural Language Learning. Mark Finlayson and Nidhi Kulkarni. 2011. Detecting
Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and multiword expressions improves word sense disam-
Dominic Widdows. 2003. An empirical model of biguation. In Proceedings of the ACL Workshop on
multiword expression decomposability. In Proceed- Multiword Expressions.
ings of the ACL Workshop on Multiword Expressions. Yoav Freund and Robert E. Schapire. 1995. A decision-
Timothy Baldwin. 2005. Deep lexical acquisition of theoretic generalization of on-line learning and an ap-
verb-particle constructions. Computer Speech & Lan- plication to boosting. In Proceedings of the Confer-
guage, Special Issue on Multiword Expressions. ence on Computational Learning Theory.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Patrick Schone and Daniel Jurafsky. 2001. Is
Singer. 2003. An efficient boosting algorithm for knowledge-free induction of multiword unit dictionary
combining preferences. The Journal of Machine headwords a solved problem? In Proceedings of
Learning Research. the Conference on Empirical Methods in Natural Lan-
Ray Jackendoff. 1993. The Architecture of the Language guage Processing.
Faculty. MIT Press. Caroline Sporleder and Linlin Li. 2009. Unsupervised
Graham Katz and Eugenie Giesbrecht. 2006. Auto- recognition of literal and non-literal use of idiomatic
matic identification of non-compositional multi-word expressions. In Proceedings of the European Chapter
expressions using latent semantic analysis. In Pro- of the Association for Computational Linguistics.
ceedings of the ACL Workshop on Multiword Expres- Yulia Tsvetkov and Shuly Wintner. 2012. Extraction
sions. of multi-word expressions from small parallel corpora.
Philipp Koehn, Franz Josef Och, and Daniel Marcu. In Natural Language Engineering.
2003. Statistical phrase-based translation. In Proceed- Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Du-
ings of the North American Chapter of the Association biner. 2010. Large scale parallel document mining for
for Computational Linguistics. machine translation. In Proceedings of the Conference
Linlin Li, Benjamin Roth, and Caroline Sporleder. 2010. on Computational Linguistics.
Topic models for word sense disambiguation and Sriram Venkatapathy and Aravind K Joshi. 2006. Us-
token-based idiom detection. In Proceedings of the ing information about multi-word expressions for the
Association for Computational Linguistics. word-alignment task. In Proceedings of the ACL
Percy Liang, Ben Taskar, and Dan Klein. 2006. Align- Workshop on Multiword Expressions.
ment by agreement. In Proceedings of the North Begona Villada Moiron and Jorg Tiedemann. 2006.
American Chapter of the Association for Computa- Identifying idiomatic expressions using automatic
tional Linguistics. word-alignment. In Proceedings of the EACL Work-
Dekang Lin. 1999. Automatic identification of non- shop on Multiword Expressions in a Multilingual Con-
compositional phrases. In Proceedings of the Asso- text.
ciation for Computational Linguistics. Aline Villavicencio. 2003. Verb-particle constructions
Diana McCarthy, Bill Keller, and John Carroll. 2003. and lexical resources. In Proceedings of the ACL
Detecting a continuum of compositionality in phrasal workshop on Multiword expressions.
verbs. In Proceedings of the ACL Workshop on Multi- Stephan Vogel, Hermann Ney, and Christoph Tillmann.
word Expressions. 1996. HMM-based word alignment in statistical trans-
lation. In Proceedings of the Conference on Computa-
I. Dan Melamed. 1997. Automatic discovery of non-
tional linguistics.
compositional compounds in parallel data. In Pro-
ceedings of the Conference on Empirical Methods in Jun Xu and Hang Li. 2007. AdaRank: a boosting al-
Natural Language Processing. gorithm for information retrieval. In Proceedings of
the SIGIR Conference on Research and Development
Santanu Pal, Sudip Kumar Naskar, Pavel Pecina, Sivaji
in Information Retrieval.
Bandyopadhyay, and Andy Way. 2010. Handling
named entities and compound verbs in phrase-based Ying Xu, Randy Goebel, Christoph Ringlstetter, and
statistical machine translation. In Proceedings of the Grzegorz Kondrak. 2010. Application of the tightness
COLING 2010 Workshop on Multiword Expressions. continuum measure to chinese information retrieval.
In Proceedings of the COLING Workshop on Multi-
Zhixiang Ren, Yajuan Lu, Jie Cao, Qun Liu, and Yun
word Expressions.
Huang. 2009. Improving statistical machine transla-
tion using domain bilingual multiword expressions. In
Proceedings of the ACL Workshop on Multiword Ex-
pressions.
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A.
Copestake, and Dan Flickinger. 2002. Multiword ex-
pressions: A pain in the neck for NLP. In Proceedings
of the CICLING Conference on Intelligent Text Pro-
cessing and Computational Linguistics.
Bahar Salehi and Paul Cook. 2013. Predicting the com-
positionality of multiword expressions using transla-
tions in multiple languages. In Second Joint Confer-
ence on Lexical and Computational Semantics.

Identifying Phrasal Verbs Using Many Bilingual Corpora

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Identifying Phrasal Verbs Using Many Bilingual Corpora

Transféré par

Droits d'auteur :

Formats disponibles

Identifying Phrasal Verbs Using Many Bilingual Corpora

Karl Pichotta John DeNero

Abstract We focus on a particular subset of MWEs, English

looking forward to 3.2 Monolingual Statistics

Combined with AdaBoost Bilingual+1 48.1 46.9

Table 4: An ablation of monolingual statistics shows that

Figure 2: The solid line shows recall-at-1220 when com-

Vous aimerez peut-être aussi