Académique Documents
Professionnel Documents
Culture Documents
Kevin Knight
USC/Information Sciences Institute USC/Computer Science Department
Machine Translation
The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.
Source: Ethnologue
2003
Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
2005
news broadcast
e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e f o e o oe t hu ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv ...
e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e s i e i ie t hu ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv ...
decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written in ancient hu ihgzsnwfv rqcffnw cw owgcnwf languages ... kowazoanv ...
The non-Turkish guy next to me is even deciphering Turkish! All he needs is a statistical table of letter-pair frequencies in Turkish
When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. - Warren Weaver, March 1947
When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. - Warren Weaver, March 1947 ... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague ... to make any quasi-mechanical translation scheme very hopeful. - Norbert Wiener, April 1947
Spanish/English corpus
1a. Garcia and associates . 1b. Garcia y asociados .
2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 5a. its clients are angry . 5b. sus clientes estan enfadados . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados .
7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos .
8a. the company has three groups . 8b. la empresa tiene tres grupos . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .
Spanish/English corpus
Translate: Clients do not sell pharmaceuticals in Europe.
1a. Garcia and associates . 1b. Garcia y asociados .
2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 5a. its clients are angry . 5b. sus clientes estan enfadados . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados .
7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos .
8a. the company has three groups . 8b. la empresa tiene tres grupos . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .
???
???
10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .
10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .
process of elimination
cognate?
zero fertility
When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. - Warren Weaver, March 1947
Broken English What hunger have I, Hungry I am so, I am so hungry, Have me that hunger
J ai si faim
I am so hungry
Mathematical Formulation
Given source sentence f: argmaxe P(e | f) = argmaxe P(f | e) P(e) / P(f) = argmaxe P(f | e) P(e) by Bayes Rule P(f) same for all e
French
Translation Model P(f | e)
Broken English
Language Model P(e)
English
J ai si faim
I am so hungry
Language Modeling
Goal of a language model for MT:
He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table
Need to make these decisions, because translation model may not have a lot of context information!
Probabilities can be tabulated from an online English corpus just like Weavers Turkish case.
with the stressed relationship part own longstanding its its for chinese boeing , ,
for its part, stressed the longstanding relationship with its own, chinese boeing boeing, for its part, stressed its own longstanding relationship with the chinese [Soricut & Marcu, 05]
Translation Model?
Process model of translation:
Translation Model?
Process model of translation:
Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Maria no di una bofetada a la verde bruja Maria no di una bofetada a la bruja verde
n(3|slap)
50k entries
P-Null
1 entry
t(la|the)
25m entries
d(j|i)
2500 entries
Trainable
?
Maria no di una bofetada a la bruja verde
P-Null
1 entry
t(la|the)
25m entries
d(j|i)
2500 entries
Still trainable!
P(f | e) =
( a m 0 l ) P-Null m 20 (1-P-Null) 0 i! (1 / 0!) i=0
d(j | aj, l, m)
fertility
word translation
re-ordering
Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25m-dimensional search space.
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
maison co-occurs with both the and house, but P(maison | house) can be raised without limit, to 1.0, while P(maison | the) is limited because of la (pigeonhole principle)
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower
Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower Inherent hidden structure revealed by EM training!
A Statistical MT Tutorial Workbook (Knight, 1999). Promises free beer.
The Mathematics of Statistical Machine Translation (Brown et al, 1993) Software: GIZA++
nationale
national nationaux nationales le la les the l ce
0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02
cette
agriculteurs les farmers cultivateurs
0.01
0.44 0.42 0.05
producteurs
0.02
[Brown et al 93]
nationale
national nationaux nationales le la les the l ce
0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02
potential translation e
cette
agriculteurs les farmers cultivateurs
0.01
0.44 0.42 0.05
P(f | e)
producteurs
0.02
Language Model w1 Translation Model e national f P(f | e) w2 the P(w2 | w1) 0.13
of 0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02 hong
a
another some kong
0.09
0.01 0.01 0.98
nationale
national nationaux nationales le la les the l ce
said
stated
0.01
0.01
potential translation e
cette
agriculteurs les farmers cultivateurs
0.01
0.44 0.42 0.05
P(f | e)
P(e)
producteurs
0.02
Language Model w1 Translation Model e national f P(f | e) w2 the P(w2 | w1) 0.13
of 0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02 hong
a
another some kong
0.09
0.01 0.01 0.98
nationale
national nationaux nationales le la les the l ce
said
stated
0.01
0.01
potential translation e
cette
agriculteurs les farmers cultivateurs
0.01
0.44 0.42 0.05
P(f | e)
P(e)
producteurs
0.02
you shut !
you shut up !
start
end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
start
end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
Classic Results
nous avons sign le protocole . we did sign the memorandum of agreement . we have signed the protocol . o tait le plan solide ? but where was the solid plan ? where was the economic base ? (Foreign Original) (Human Translation) (MT) (Foreign Original) (Human Translation) (MT)
the Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40 billion US dollars today provide data include that year to November china actually using foreign 46.959 billion US dollars and very slow = one page per day
Okay!
I know, so far, this talk should be called
Further Developments
Follow-on projects
Hong Kong Aachen Behavior Design Corporation
1994
1996
1998
2000
2002
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
2004
1994
1996
1998
2000
2002
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
2004
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
BLEU in Action
the gunman was shot to death by the police . the gunman was police kill . wounded police jaya of the gunman was shot dead by the police . the gunman arrested by police kill . the gunmen were killed . the gunman was shot to death by the police . gunmen were killed by police ?SUB>0 ?SUB>0 al by the police . the ringer is killed by the police . police killed the gunman . (Foreign Original) (Reference Translation) #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
BLEU in Action
the gunman was shot to death by the police . the gunman was police kill . wounded police jaya of the gunman was shot dead by the police . the gunman arrested by police kill . the gunmen were killed . the gunman was shot to death by the police . gunmen were killed by police ?SUB>0 ?SUB>0 al by the police . the ringer is killed by the police . police killed the gunman . green red = 4-gram match = word not matched (good!) (bad!) (Foreign Original) (Reference Translation) #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Adequacy
2.0
R2 = 88.0% R2 = 90.2%
Fluency
1.5
BLEU Score
1.0
0.5
-1.0
-1.5
-2.0
-2.5
Human Judgments
Experiment-Driven Progress
BLEU
35
30
Evaluate new MT research ideas every day! (and be alerted about bugs)
25
20
15
Mar 1
Apr 1
May 1
2005
BLEU score
0.25 0.2 0.15 0.1 0.05 0 10k 20k 40k 80k 160k 320k
Flaws of Word-Based MT
Cant translate multiple English words to one French word Cant translate phrases
real estate, note that, interest in
The MT Triangle
interlingua
logical form
logical form
syntax
syntax
words SOURCE
words TARGET
logical form
logical form
syntax
syntax
words
words
logical form
logical form
syntax
syntax
words SOURCE
words TARGET
interlingua
logical form
logical form
syntax
syntax
interlingua
Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text
logical form
logical form
syntax
syntax
interlingua
logical form
logical form
syntax
syntax
Well, these all seem like good ideas. Which one had the most dramatic effect on MT quality? None of them!
Phrases
interlingua How do you translate real estate into French? real estate real number dance number dance card memory card memory stick
logical form
logical form
syntax
syntax
phrases
phrases
words SOURCE
words TARGET
Phrase-Based Statistical MT
Morgen fliege ich nach Kanada zur Konferenz
Tomorrow
will fly
to the conference
In Canada
Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an overview.
Mary
did not
slap
the green witch
Mary
did not
slap
the green witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green)
Mary
did not
slap
the green witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (bruja verde, green witch)
Mary
did not
slap
the green
witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (bruja verde, green witch) (Maria no, Mary did not) (no di una bofetada, did not slap), (di una bofetada a la, slap the)
Mary
did not
slap
the green
witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (Maria no, Mary did not) (no di una bofetada, did not slap), (di una bofetada a la, slap the) (bruja verde, green witch) (Maria no di una bofetada, Mary did not slap) (a la bruja verde, the green witch)
Mary
did not
slap
the green
witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (Maria no, Mary did not) (no di una bofetada, did not slap), (di una bofetada a la, slap the) (bruja verde, green witch) (Maria no di una bofetada, Mary did not slap) (a la bruja verde, the green witch) (Maria no di una bofetada a la bruja verde, Mary did not slap the green witch)
P(f-f-f | e-e-e) =
Phrase-Based MT
This is currently the best way to do Statistical MT! What took so long to move from words to phrases?
Missing RAM
25m parameters billions of parameters Trick idea: build test-corpus-specific phrase table (takes 5 hours!) Now solved in commercial deployments
works better!
IDEA 1: maximize probability of the data IDEA 2: maximize BLEU score of MT system
20.64% BLEU
17.96% BLEU
Nesting
does not X itself becomes an X
CKY decoder
John Cocke
Need much more grammatical output Need accurate control over re-ordering Need accurate insertion of function words
String Output
.
Tree Output
.
Tree Output
.
Tree Output
.
SBAR
x0:S
x0 x1 x2 x0 x1 "," x2
0.82 0.02
(Chinese/ English)
x1:VB
S x0:NP VP x2:NP
x0 x1 x2 x1 x0 x2
0.54 0.44
x1:VB
(Arabic/ English)
subject-verb inversion
Format is Expressive
Phrasal Translation VP VBZ VBG is singing est, cantando PRO Non-constituent Phrases S VP x0:NP Non-contiguous Phrases VP VB x0:NP put PRT on poner, x0
hay, x0
there VB are
PP
P x1:NP of
x1,
, x0
Applications
Summary
Making good progress Algorithms + Data + Evaluation + Computers Interdisciplinary work
Natural language processing Machine learning Linguistics Automata theory
Thank you
Syntax-Based vs Phrase-Based
BLEU
phrase-based system
35
30
25
20
15
Mar 1
Apr 1
May 1
2005
Summary
Phrase-based models are state-of-the-art
Word alignments Phrase pair extraction & probabilities N-gram language models Beam search decoding Feature functions & learning weights
Available Resources
Bilingual corpora
100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu) Lots of French/English, Spanish/French/English, LDC European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
(www.isi.edu/~koehn/publications/europarl)
Sentence alignment
Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm) Xiaoyi Ma, LDC (Champollion)
Word alignment
GIZA, JHU Workshop 99 (www.clsp.jhu.edu/ws99/projects/mt/) GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html) Manually word-aligned test corpus (500 French/English sentence pairs), RWTH Aachen Shared task, NAACL-HLT03 workshop
Decoding
ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/) ISI Pharoah phrase-based decoder
Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/) Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)
AMTA
[Soricut et al, 2002] [Al-Onaizan & Knight, 1998]
EACL
[Cmejrek et al, 2003]
Computational Linguistics
[Brown et al, 1993] [Knight, 1999] [Wu, 1997]
AAAI
[Koehn & Knight, 2000]
IWNLG
[Habash, 2002]
EMNLP
[Marcu & Wong, 2002] [Fox, 2002] [Munteanu & Marcu, 2002]
MT Summit
[Charniak, Knight, Yamada, 2003]
NAACL
[Koehn, Marcu, Och, 2003] [Germann, 2003] [Graehl & Knight, 2004] [Galley, Hopkins, Knight, Marcu, 2004]
AI Magazine
[Knight, 1997]
www.isi.edu/~knight
[MT Tutorial Workbook]
40 20 0
1994
1996
1998
2000
2002
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
2004
1994
1996
1998
2000
2002
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
2004
???
Chinese/English Arabic/English French/English
1994
1996
1998
2000
2002
One Billion?
2004
Sentence Alignment
The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. El viejo est feliz porque ha pescado muchos veces. Su mujer habla con l. Los tiburones esperan.
Sentence Alignment
1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. 1. El viejo est feliz porque ha pescado muchos veces. 2. Su mujer habla con l. 3. Los tiburones esperan.
Sentence Alignment
1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. 1. El viejo est feliz porque ha pescado muchos veces. 2. Su mujer habla con l. 3. Los tiburones esperan.
Sentence Alignment
1. The old man is happy. He has fished many times. 2. His wife talks to him. 3. The sharks await. 1. El viejo est feliz porque ha pescado muchos veces. 2. Su mujer habla con l. 3. Los tiburones esperan.
Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0).
Chinese
Input (byte stream): Output:
Lower-Casing
English
Input (7 words):
" There , " said Bob . Output (7 words): " there , " said bob .
Idea of tokenizing and lower-casing:
The the The the Smaller vocabulary size. More robust counting and learning.
the
Language model
Given an English string e, assigns P(e) by formula good English string high P(e) random word sequence low P(e)
Decoding algorithm
Given a language model, a translation model, and a new sentence f find translation e maximizing P(e) P(f | e)
[20] [3]