Académique Documents
Professionnel Documents
Culture Documents
Machine Translation
Charles Schafer
David Smith
Johns Hopkins University
{
(~1K words)
…
… …
Chinese French Arabic Italian Danish Overview Serbian Uzbek Chechen Khmer
AMTA 2006 Finnish of Statistical MTBengali 8
Resource Availability
P(E|F) = P(F|E)P(E)
P(F)
- General Method:
1. Identify NE’s via classifier
2. Transliterate name
3. Translate/reorder honorifics
Inuktitut rendering of
English names changes the
string significantly but not
deterministically
Inuktitut rendering of
English names changes the
string significantly but not
deterministically
Train a probabilistic finite-state
transducer to model this ambiguous
transformation
AMTA 2006 Overview of Statistical MT 23
Transliteration
Inuktitut rendering of
English names changes the
string significantly but not
deterministically
- Number/Date Handling
- Morphological Analysis
- Analyze a word to its root form
(at least for word alignment)
was -> is believing -> believe
ruminerai -> ruminer ruminiez -> ruminer
- As a dimensionality reduction technique
- To allow lookup in existing dictionary
NULL
NULL
NULL
head-swapping
null siblings
AMTA 2006 Overview of Statistical
monotonicMT 38
Free Translation
Bad
dependencies
NULL
Parent-ancestors?
NULL
P(parent-child)
P(breakage)
P(I | ich)
I did not unfortunately receive an answer to this question
Deshalb haben wir allen Grund , die Umwelt in die Agrarpolitik zu integrieren
That is why we have every reason to integrate the environment in the agricultural policy
that is why we have every reason the environment in the agricultural policy to integrate
agricultural policy
therefore have we every reason the environment in the to integrate
,
that is why we have all reason , which environment in agricultural policy parliament
have therefore us all the reason of the environment into the agricultural policy successfully integrated
hence our any why that outside at agricultural policy too woven together
therefore , it of all reason for , the completion into that agricultural policy be
Deshalb haben wir allen Grund , die Umwelt in die Agrarpolitik zu integrieren
the
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am in medias res immediately . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-12,12-12
as i have little time , i am in medias res immediately . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-12,12-12
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i am in medias res immediately . | 0-1,0-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-12,12-12
as i have little time , i am in medias res immediately . | 0-0,0-0 1-1,1-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-
12,12-12
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i would immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
because i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-
11,11-11 12-12,12-12
as i have little time , i am in res medias immediately . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,11-11 10-10,10-10 11-11,8-8 12-12,12-12
because i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am in res medias immediately . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,11-11 10-10,10-10 11-11,8-8 12-12,12-12
the
póll’ oîd’ alṓpēx
fox NN/NN
the fox knows many things knows
VB/VB
many
JJ/JJ
things
A variant of CKY chart parsing.
NP V’ NP the
póll’ oîd’ alṓpēx NP/NP
fox
NP V’ NP many
NP/NP
things
NP V’ NP the
póll’ oîd’ alṓpēx NP/NP
fox
NP V’ NP many
VP/VP
VP
things
NP V’ NP the
póll’ oîd’ alṓpēx
fox
things
S
did
not
unfortunately
receive
an
answer
to
this
question
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
Assume word alignments are given.
AMTA 2006 Overview of Statistical MT 61
Phrase Models
I
did
not
unfortunately
receive
an
answer
to
this
question
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
Some good phrase pairs.
AMTA 2006 Overview of Statistical MT 62
Phrase Models
I
did
not
unfortunately
receive
an
answer
to
this
question
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
Some bad phrase pairs.
AMTA 2006 Overview of Statistical MT 63
“Count and Normalize”
• Usual approach: treat relative frequencies
of source phrase s and target phrase t as
probabilities
count ( s, t ) count ( s, t )
p( s | t ) p(t | s )
count (t ) count ( s )
• This leads to overcounting when not all
segmentations are legal due to unaligned
words.
AMTA 2006 Overview of Statistical MT 64
Hidden Structure
• But really, we don’t observe word
alignments.
• How are word alignment model
parameters estimated?
• Find (all) structures consistent with
observed data.
– Some links are incompatible with others.
– We need to score complete sets of links.
* Intrinsic
Human evaluation
* Extrinsic
Je suis fatigué.
Adequacy Fluency
Tired is I. 5 2
I am exhausted. 5 5
PRO
High quality
CON
Expensive!
PRO
CON
Bounded above
N-Gram
precision pn
n -gramhyp
countclip (n - gram ) by highest count
of n-gram in any
n -gramhyp
count (n - gram ) reference sentence
brevity
penalty
B= { e
1
(1- |ref| / |hyp|) if |ref| > |hyp|
otherwise
Bleu score:
1 N
Bleu= B exp pn
brevity penalty,
geometric
mean of N-Gram N n1
precisions
AMTA 2006 Overview of Statistical MT 73
Automatic Evaluation: Bleu Score
hypothesis 1 I am exhausted
hypothesis 2 Tired is I
reference 1 I am tired
reference 1 I am tired
1
p (ti | s ) exp( θ f ) Exponentiation makes it positive.
Z
for any given hypothesis i.
Gujarati
Nepali HINDI
Cognate Selection
Italian
Spanish
Catalan Romanian
Galician
some cognates
Arabic
Inuktitut
Memoryless Transducer
(Ristad & Yianilos 1997)
Farsi
ENGLISH
Turkish
Kazakh Uzbek
Kyrgyz
AMTA 2006 Overview of Statistical MT 90
* Multi-family bridge languages
Similarity Measures
for re-ranking cognate/transliteration hypotheses
2. Context similarity
2. Context similarity
independence vector
3 1 10 0 479 836 191 0
Construction of
context term vector
freedom vector
Construction of 681 184 104 0 21 4 141 0
context term vector
2. Context similarity
(correct) independence
nezavisnost
(incorrect) freedom
2. Context similarity
fC (wF)
F
fC (wE)
E
rf(wE)=
|CE| [min-ratio method]
* Adding more
bridge languages
helps
1. Data collection
- Bitext
- Monolingual text for language model (LM)
3. Tokenization
- Separation of punctuation
- Handling of contractions
5. Additional filtering
- Sentence length
- Removal of free translations
6. Training…