Vous êtes sur la page 1sur 140

Whats New in Statistical Machine Translation

Kevin Knight
USC/Information Sciences Institute USC/Computer Science Department

Machine Translation

The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.

Thousands of Languages Are Spoken


MANDARIN SPANISH ENGLISH BENGALI 885,000,000 332,000,000 322,000,000 189,000,000 TURKISH URDU MIN NAN (China) JINYU (China) 59,000,000 58,000,000 49,000,000 45,000,000

HINDI PORTUGUESE RUSSIAN JAPANESE GERMAN

182,000,000 170,000,000 170,000,000 125,000,000 98,000,000

GUJARATI POLISH ARABIC UKRAINIAN

44,000,000 44,000,000 42,500,000 41,000,000

WU (China) JAVANESE KOREAN FRENCH VIETNAMESE

77,175,000 75,500,800 75,000,000 72,000,000 67,662,000

ITALIAN XIANG (China) MALAYALAM HAKKA (China)

37,000,000 36,015,000 34,022,000 34,000,000

TELUGU 66,350,000 YUE (China) 66,000,000 MARATHI 64,783,000 TAMIL 63,075,000

KANNADA ORIYA PANJABI SUNDA

33,663,000 31,000,000 30,000,000 27,000,000

Source: Ethnologue

Recent Progress in Statistical MT


2002
insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.

2003
Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

2005
news broadcast

foreign language speech recognition

English translation searchable archive

Warren Weaver (1947)

ingcmpnqsnwf cv fpn owoktvcv

hu ihgzsnwfv rqcffnw cw owgcnwf


kowazoanv ...

Warren Weaver (1947)

e e e e ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv ...

Warren Weaver (1947)

e e e the ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv ...

Warren Weaver (1947)

e he e the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv ...

Warren Weaver (1947)

e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv ...

Warren Weaver (1947)

e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e f o e o oe t hu ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv ...

Warren Weaver (1947)

e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv ...

Warren Weaver (1947)

e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e s i e i ie t hu ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv ...

Warren Weaver (1947)

decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written in ancient hu ihgzsnwfv rqcffnw cw owgcnwf languages ... kowazoanv ...

Warren Weaver (1947)


Can this be computerized?

The non-Turkish guy next to me is even deciphering Turkish! All he needs is a statistical table of letter-pair frequencies in Turkish

Collected mechanically from a Turkish body of text, or corpus

When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. - Warren Weaver, March 1947

When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. - Warren Weaver, March 1947 ... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague ... to make any quasi-mechanical translation scheme very hopeful. - Norbert Wiener, April 1947

Spanish/English corpus
1a. Garcia and associates . 1b. Garcia y asociados .
2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 5a. its clients are angry . 5b. sus clientes estan enfadados . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados .

7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos .
8a. the company has three groups . 8b. la empresa tiene tres grupos . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .

Spanish/English corpus
Translate: Clients do not sell pharmaceuticals in Europe.
1a. Garcia and associates . 1b. Garcia y asociados .
2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 5a. its clients are angry . 5b. sus clientes estan enfadados . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados .

7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos .
8a. the company has three groups . 8b. la empresa tiene tres grupos . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

???

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok .

???
10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

process of elimination

Centauri/Arcturan [Knight 97]


Your assignment, translate this to Arcturan:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . farok crrrok hihok yorok clok kantok ok-yurp

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

cognate?

Centauri/Arcturan [Knight 97]


Your assignment, put these words in order:
1a. ok-voon ororok sprok .
1b. at-voon bichat dat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .

{ jjat, arrat, mat, bat, oloat, at-yurp }

7a. lalok farok ororok lalok sprok izok enemok .


7b. wat jjat bichat wat dat vat eneat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

zero fertility

When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. - Warren Weaver, March 1947

The required statistical tables have millions of entries?

Too much for the computers of Weavers day.


Not enough RAM!

IBM Candide Project (1988-1994)


How to get quantities of human translation in computer readable form?
parallel corpus
IBMs John Cocke, inventor of CKY parsing & RISC processors Canadian bureaucrat

IBM Candide Project (1988-1994)


How to get quantities of human translation in computer readable form?
parallel corpus
IBMs John Cocke, inventor of CKY parsing & RISC processors Canadian bureaucrat

IBM Candide Project (1988-1994)


How to get quantities of human translation in computer readable form?
parallel corpus
IBMs John Cocke, inventor of CKY parsing & RISC processors

IBM Candide Project


[Brown et al 93]
French/English Bilingual Text Statistical Analysis French English Text Statistical Analysis English

Broken English What hunger have I, Hungry I am so, I am so hungry, Have me that hunger

J ai si faim

I am so hungry

Mathematical Formulation
Given source sentence f: argmaxe P(e | f) = argmaxe P(f | e) P(e) / P(f) = argmaxe P(f | e) P(e) by Bayes Rule P(f) same for all e

French
Translation Model P(f | e)

Broken English
Language Model P(e)

English

J ai si faim

Decoding algorithm argmaxe P(e) P(f | e)

I am so hungry

Language Modeling
Goal of a language model for MT:
He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table
Need to make these decisions, because translation model may not have a lot of context information!

American shrine American company

The Classic Language Model


Word Bigrams Process model of English: Generate each word based only on the previous word. P(I saw water on the table) = P(I | START) P(saw | I) P(water | saw) P(on | water) P(the | on) P(table | the) P(END | table)

Probabilities can be tabulated from an online English corpus just like Weavers Turkish case.

Trigram Language Model


to the said royal purchase plan trustco part operations of its its is international expand banking

[Soricut & Marcu, 05]

Trigram Language Model


to the said royal purchase plan trustco part operations of its its is international expand banking the banking trustco is said to expand its purchase part of its royal international plan operations

[Soricut & Marcu, 05]

Trigram Language Model


to the said royal purchase plan trustco part operations of its its is international expand banking the banking trustco is said to expand its purchase part of its royal international plan operations royal trustco said the purchase is part of its plan to expand its international banking operations

N-grams have a lot of semantics in them!

[Soricut & Marcu, 05]

Trigram Language Model


to the said royal purchase plan trustco part operations of its its is international expand banking the banking trustco is said to expand its purchase part of its royal international plan operations royal trustco said the purchase is part of its plan to expand its international banking operations

with the stressed relationship part own longstanding its its for chinese boeing , ,

for its part, stressed the longstanding relationship with its own, chinese boeing boeing, for its part, stressed its own longstanding relationship with the chinese [Soricut & Marcu, 05]

Translation Model?
Process model of translation:

Mary did not slap the green witch


Source-language morphological analysis Source parse tree Semantic representation Generate target structure

Maria no di una bofetada a la bruja verde

Translation Model?
Process model of translation:

Mary did not slap the green witch


Source-language morphological analysis Source parse tree Semantic representation Generate target structure What are all the possible moves and what probability tables control those moves?

Maria no di una bofetada a la bruja verde

The Classic Translation Model


Word Substitution/Permutation [Brown et al., 1993]
Process model of translation:

Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Maria no di una bofetada a la verde bruja Maria no di una bofetada a la bruja verde
n(3|slap)
50k entries

P-Null
1 entry

t(la|the)
25m entries

d(j|i)
2500 entries

Trainable

The Classic Translation Model


Word Substitution/Permutation [Brown et al., 1993]
Process model of translation:

Mary did not slap the green witch


n(3|slap)
50k entries

?
Maria no di una bofetada a la bruja verde

P-Null
1 entry

t(la|the)
25m entries

d(j|i)
2500 entries

Still trainable!

Classic Formula for P(f | e)


NULL stuff

P(f | e) =
( a m 0 l ) P-Null m 20 (1-P-Null) 0 i! (1 / 0!) i=0

sum over alignment possibilities

l m m n(i | ei) t(fj | eaj) i=1 j=1 j:aj <> 0

d(j | aj, l, m)

fertility

word translation

re-ordering

Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25m-dimensional search space.

Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower

All P(french-word | english-word) equally likely

Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower

la and the observed to co-occur frequently, so P(la | the) is increased.

Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower

maison co-occurs with both the and house, but P(maison | house) can be raised without limit, to 1.0, while P(maison | the) is limited because of la (pigeonhole principle)

Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower

settling down after another iteration

Unsupervised EM Training
la maison la maison bleue la fleur
the house the blue house the flower Inherent hidden structure revealed by EM training!
A Statistical MT Tutorial Workbook (Knight, 1999). Promises free beer.
The Mathematics of Statistical Machine Translation (Brown et al, 1993) Software: GIZA++

Sample Translation Probabilities


Translation Model e national f P(f | e)

nationale
national nationaux nationales le la les the l ce

0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02

cette
agriculteurs les farmers cultivateurs

0.01
0.44 0.42 0.05

producteurs

0.02

[Brown et al 93]

Translation Model e national f P(f | e)

nationale
national nationaux nationales le la les the l ce

0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02

new French sentence f

potential translation e

cette
agriculteurs les farmers cultivateurs

0.01
0.44 0.42 0.05

P(f | e)

producteurs

0.02

Language Model w1 Translation Model e national f P(f | e) w2 the P(w2 | w1) 0.13

of 0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02 hong

a
another some kong

0.09
0.01 0.01 0.98

nationale
national nationaux nationales le la les the l ce

said
stated

0.01
0.01

new French sentence f

potential translation e

cette
agriculteurs les farmers cultivateurs

0.01
0.44 0.42 0.05

P(f | e)

P(e)

producteurs

0.02

Language Model w1 Translation Model e national f P(f | e) w2 the P(w2 | w1) 0.13

of 0.47
0.42 0.05 0.03 0.50 0.21 0.16 0.09 0.02 hong

a
another some kong

0.09
0.01 0.01 0.98

nationale
national nationaux nationales le la les the l ce

said
stated

0.01
0.01

new French sentence f

potential translation e

cette
agriculteurs les farmers cultivateurs

0.01
0.44 0.42 0.05

P(f | e)

P(e)

producteurs

0.02

P(f | e) P(e) score for e

Search for Best Translation


voulez vous vous taire !

Search for Best Translation


voulez vous vous taire !

you you you quiet !

Search for Best Translation


voulez vous vous taire !

you you quiet !

Search for Best Translation


voulez vous vous taire !

quiet you you you !

Search for Best Translation


voulez vous vous taire !

shut you you you !

Search for Best Translation


voulez vous vous taire !

you shut !

Search for Best Translation


voulez vous vous taire !

you shut up !

Classic Decoding Algorithm


Given f, find the English string e that maximizes P(e) P(f | e) NP-Complete [Knight 99]. Brown et al 93: In this paper, we focus on the translation modeling problem. We hope to deal with the [decoding] problem in a later paper.

Beam Search Decoding


[Brown et al US Patent #5,477,451] 1st English 2nd English 3rd English 4th English word word word word

start

end

all source words covered

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

[Jelinek 69; Och, Ueffing, and Ney, 01]

Beam Search Decoding


[Brown et al US Patent #5,477,451] 1st English 2nd English 3rd English 4th English word word word word best predecessor link

start

end

all source words covered

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

[Jelinek 69; Och, Ueffing, and Ney, 01]

Classic Results
nous avons sign le protocole . we did sign the memorandum of agreement . we have signed the protocol . o tait le plan solide ? but where was the solid plan ? where was the economic base ? (Foreign Original) (Human Translation) (MT) (Foreign Original) (Human Translation) (MT)

the Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40 billion US dollars today provide data include that year to November china actually using foreign 46.959 billion US dollars and very slow = one page per day

Okay!
I know, so far, this talk should be called

Whats Old in Statistical Machine Translation!!

Further Developments
Follow-on projects
Hong Kong Aachen Behavior Design Corporation

JHU Summer Workshop 1999


Build & distribute statistical MT tools Create standard training & testing data Disseminate tutorial material MT in a Day Ask new questions

How Much Data Do We Need?


Quality of automatically trained machine translation system

Amount of bilingual training data

Advances in Statistical MT 2000-2004

Ready-to-Use Online Bilingual Data


180 160 140 120 Millions of words 100 (English side) 80 60 40 20 0
Chinese/English Arabic/English French/English

1994

1996

1998

2000

2002

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

2004

Ready-to-Use Online Bilingual Data


180 160 140 120 Millions of words 100 (English side) 80 60 40 20 0
Chinese/English Arabic/English French/English

1994

1996

1998

2000

2002

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

+ European parliament data [Koehn 05]

2004

BLEU Evaluation Metric


(Papineni et al 02)
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

N-gram precision (score is between 0 & 1)


What percentage of machine n-grams can be found in the reference translation? Gross measure over 1000 test sentences. Not allowed to use same portion of reference translation twice (cant cheat by typing out the the the the the) Brevity penalty: cant just type out single word the (and get precision 1.0)

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU in Action
the gunman was shot to death by the police . the gunman was police kill . wounded police jaya of the gunman was shot dead by the police . the gunman arrested by police kill . the gunmen were killed . the gunman was shot to death by the police . gunmen were killed by police ?SUB>0 ?SUB>0 al by the police . the ringer is killed by the police . police killed the gunman . (Foreign Original) (Reference Translation) #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

BLEU in Action
the gunman was shot to death by the police . the gunman was police kill . wounded police jaya of the gunman was shot dead by the police . the gunman arrested by police kill . the gunmen were killed . the gunman was shot to death by the police . gunmen were killed by police ?SUB>0 ?SUB>0 al by the police . the ringer is killed by the police . police killed the gunman . green red = 4-gram match = word not matched (good!) (bad!) (Foreign Original) (Reference Translation) #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

BLEU Tends to Predict Human Judgments


2.5

Adequacy
2.0

R2 = 88.0% R2 = 90.2%

Fluency
1.5

BLEU Score

Linear (Adequacy) Linear (Fluency)


-2.5 -2.0 -1.5 -1.0 -0.5

1.0

0.5

0.0 0.0 -0.5 0.5 1.0 1.5 2.0 2.5

-1.0

-1.5

-2.0

-2.5

Human Judgments

slide from G. Doddington (NIST)

Experiment-Driven Progress
BLEU
35

30

Evaluate new MT research ideas every day! (and be alerted about bugs)

25

20

15

ISI Syntax-Based MT Chinese/English NIST 2002 Test Set

Mar 1

Apr 1

May 1

2005

Draw Learning Curves


0.35 0.3

BLEU score

0.25 0.2 0.15 0.1 0.05 0 10k 20k 40k 80k 160k 320k

Swedish/English French/English German/English Finnish/English

# of sentence pairs used in training


Experiments by Philipp Koehn

Flaws of Word-Based MT
Cant translate multiple English words to one French word Cant translate phrases
real estate, note that, interest in

Isnt sensitive to syntax


Adjectives/nouns should swap order Verb comes at the beginning in Arabic

Doesnt understand the meaning (?)

The MT Triangle
interlingua

logical form

logical form

syntax

syntax

words SOURCE

words TARGET

The MT Swimming Pool


interlingua

logical form

logical form

syntax

syntax

words

words

Commercial Rule-Based Systems


interlingua

logical form

logical form

syntax

syntax

words SOURCE

words TARGET

interlingua

Knight et al 95 - meaning-based translation - composition rules

logical form

logical form

syntax

syntax

Language Model words SOURCE TARGET words

interlingua

Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text

logical form

logical form

syntax

syntax

Language Model words SOURCE TARGET words

interlingua

Yamada/Knight (01,02) - tree/string model - used existing target language parser

logical form

logical form

syntax

syntax

Language Model words SOURCE TARGET words

Well, these all seem like good ideas. Which one had the most dramatic effect on MT quality? None of them!

Phrases
interlingua How do you translate real estate into French? real estate real number dance number dance card memory card memory stick

logical form

logical form

syntax

syntax

phrases

phrases

words SOURCE

words TARGET

Phrase-Based Statistical MT
Morgen fliege ich nach Kanada zur Konferenz

Tomorrow

will fly

to the conference

In Canada

Foreign input segmented into phrases


phrase just means word sequence

Each phrase is probabilistically translated into English


P(to the conference | zur Konferenz) P(into the meeting | zur Konferenz)

Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an overview.

How to Learn the Phrase Translation Table?


One method: alignment templates [Och et al 99]

Start with word alignment


Collect all phrase pairs that are consistent with the word alignment

Word Alignment Induced Phrases


Maria no di una bofetada a la bruja verde

Mary
did not

slap
the green witch

Word Alignment Induced Phrases


Maria no di una bofetada a la bruja verde

Mary
did not

slap
the green witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green)

Word Alignment Induced Phrases


Maria no di una bofetada a la bruja verde

Mary
did not

slap
the green witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (bruja verde, green witch)

Word Alignment Induced Phrases


Maria no di una bofetada a la bruja verde

Mary
did not

slap
the green

witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (bruja verde, green witch) (Maria no, Mary did not) (no di una bofetada, did not slap), (di una bofetada a la, slap the)

Word Alignment Induced Phrases


Maria no di una bofetada a la bruja verde

Mary
did not

slap
the green

witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (Maria no, Mary did not) (no di una bofetada, did not slap), (di una bofetada a la, slap the) (bruja verde, green witch) (Maria no di una bofetada, Mary did not slap) (a la bruja verde, the green witch)

Word Alignment Induced Phrases


Maria no di una bofetada a la bruja verde

Mary
did not

slap
the green

witch
(Maria, Mary) (no, did not) (slap, di una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (di una bofetada a, slap) (Maria no, Mary did not) (no di una bofetada, did not slap), (di una bofetada a la, slap the) (bruja verde, green witch) (Maria no di una bofetada, Mary did not slap) (a la bruja verde, the green witch) (Maria no di una bofetada a la bruja verde, Mary did not slap the green witch)

Phrase Pair Probabilities


A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus. No EM training Just relative frequency:

P(f-f-f | e-e-e) =

count(f-f-f, e-e-e) ----------------------count(e-e-e)

Phrase-Based MT
This is currently the best way to do Statistical MT! What took so long to move from words to phrases?
Missing RAM
25m parameters billions of parameters Trick idea: build test-corpus-specific phrase table (takes 5 hours!) Now solved in commercial deployments

Missing computing power Many competing ideas to shake out


Koehn 03 summarizes several variations

Empirical effectiveness even better than intuition would predict

This is not building a ladder to the moon!


If you cant translate real estate into French, you are sunk

Advanced Training Methods


argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e

argmax P(e) x P(f | e) e

Advanced Training Methods


argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) e

works better!

Advanced Training Methods


argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) x length(e)1.1 e


Rewards longer hypotheses, since these are unfairly punished by P(e)

Advanced Training Methods


argmax P(e)2.4 x P(f | e) x length(e)1.1 x FEAT 3.7 e
Lots of features vote on every potential translation. Exponential model. Problem: How to set the exponent weights?

IDEA 1: maximize probability of the data IDEA 2: maximize BLEU score of MT system

20.64% BLEU

17.96% BLEU

WTM fixed at 1.0

plot by Emil Ettelaie

Maximum BLEU Training


Novel algorithm developed by [Och 03] Opened gates to feature hacking
Word-based feature to smooth phrase pair counts (Model1 Inverse) Phrase-specific propensities to re-order

Currently limited to ~25 features

Advances in Statistical MT 2005

Googles Language Model


Previously, largest language model was trained on 1b words of English 20b words of news significant impact on news translation 200b words of web helpful

Marylands Hiero system [Chiang 05]


Previously:
ne mange pas does not eat

New phrase pairs with variables and reordering


ne X pas does not X le X1 du X2 X2 's X1

Nesting
does not X itself becomes an X

CKY decoder

John Cocke

ISIs Syntax-Based MT System


First strong showing for an SMT system that knows what nouns and verbs are! Why syntax?
Frequent high-tech exports are bright spots for foreign trade growth of Guangdong has made important contributions.

Need much more grammatical output Need accurate control over re-ordering Need accurate insertion of function words

String Output
.

The gunman killed by police .

Tree Output
.

The gunman killed by police . DT NN VBD IN NN NPB PP NP-C VP S

Tree Output
.

Gunman by police shot . NN IN NN VBD NPB PP NP-C VP S

Tree Output
.

The gunman was killed by police . DT NN AUX VBN IN NN NPB PP NP-C VP S

Sample Rules Learned from Data


VP VB said IN that

SBAR
x0:S

"" "," x0 "" x0 "" "" "," x0 "" "," x0 x0

0.57 0.09 0.02 0.02 0.02

NP x0:NP IN from PP x1:NP

x1 x0 "" x1 x0 x1 "" x0 "" x1 x0 "" x1 "" x0

0.27 0.15 0.06 0.06 0.06

Sample Rules Learned from Data


S
x0:NP VP x2:NP

x0 x1 x2 x0 x1 "," x2

0.82 0.02

(Chinese/ English)

x1:VB

S x0:NP VP x2:NP

x0 x1 x2 x1 x0 x2

0.54 0.44

x1:VB

(Arabic/ English)

subject-verb inversion

Format is Expressive
Phrasal Translation VP VBZ VBG is singing est, cantando PRO Non-constituent Phrases S VP x0:NP Non-contiguous Phrases VP VB x0:NP put PRT on poner, x0

hay, x0

there VB are

Context-Sensitive Word Insertion NPB DT the x0:NNS x0

Multilevel Re-Ordering S x0:NP VP x1:VB x1, x0, x2 x2:NP2

Lexicalized Re-Ordering NP x0:NP

PP
P x1:NP of

x1,

, x0

[Knight & Graehl, 2005]

Story Gets More Interesting


MT

Applications

Automata Theory Tree Transducers (Rounds 70)

Story Gets More Interesting


Transformational Grammar (Chomsky 57) Linguistic Theory Applications MT

Automata Theory Tree Transducers (Rounds 70)

Story Gets More Interesting


Transformational Grammar (Chomsky 57) Linguistic Theory Applications MT (05) Compression (01) QA (03) Generation (00)

Automata Theory Tree Transducers (Rounds 70)

Story Gets More Interesting


Transformational Grammar (Chomsky 57) Linguistic Theory Applications MT (05) Compression (01) QA (03) Generation (00)

Automata Theory Tree Transducers (Rounds 70)

Algorithms Efficient Transducer Algorithms Generic Tree Toolkits

Summary
Making good progress Algorithms + Data + Evaluation + Computers Interdisciplinary work
Natural language processing Machine learning Linguistics Automata theory

Hope that more people will join!

Thank you

Syntax-Based vs Phrase-Based
BLEU

phrase-based system
35

30

25

20

15

Chinese/English NIST 2002 Test Set

Mar 1

Apr 1

May 1

2005

Future PhD Theses?


Syntax-based Language Models for Improving Statistical MT Discriminative Training of Millions of Features for MT Semantic Representations Induced from Multilingual EU and UN Data What Makes One Language Pair More Difficult to Translate Than Another A State-of-the-Art MT System Based on Syntactic Transformations New Training Methods for High-Quality Word Alignment + many unpredictable ones

Summary
Phrase-based models are state-of-the-art
Word alignments Phrase pair extraction & probabilities N-gram language models Beam search decoding Feature functions & learning weights

But the output is not English


Fluency must be improved Better translation of person names, organizations, locations More automatic acquisition of parallel data, exploitation of monolingual data across a variety of domains/languages Need good accuracy across a variety of domains/languages

Available Resources
Bilingual corpora
100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu) Lots of French/English, Spanish/French/English, LDC European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
(www.isi.edu/~koehn/publications/europarl)

20m words (sentence-aligned) of English/French, Ulrich Germann, ISI


(www.isi.edu/natural-language/download/hansard/)

Sentence alignment
Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm) Xiaoyi Ma, LDC (Champollion)

Word alignment
GIZA, JHU Workshop 99 (www.clsp.jhu.edu/ws99/projects/mt/) GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html) Manually word-aligned test corpus (500 French/English sentence pairs), RWTH Aachen Shared task, NAACL-HLT03 workshop

Decoding
ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/) ISI Pharoah phrase-based decoder

Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/) Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)

Some Papers Referenced on Slides


ACL
[Och, Tillmann, & Ney, 1999] [Och & Ney, 2000] [Germann et al, 2001] [Yamada & Knight, 2001, 2002] [Papineni et al, 2002] [Alshawi et al, 1998] [Collins, 1997] [Koehn & Knight, 2003] [Al-Onaizan & Knight, 2002] [Och & Ney, 2002] [Och, 2003] [Koehn et al, 2003]

AMTA
[Soricut et al, 2002] [Al-Onaizan & Knight, 1998]

EACL
[Cmejrek et al, 2003]

Computational Linguistics
[Brown et al, 1993] [Knight, 1999] [Wu, 1997]

AAAI
[Koehn & Knight, 2000]

IWNLG
[Habash, 2002]

EMNLP
[Marcu & Wong, 2002] [Fox, 2002] [Munteanu & Marcu, 2002]

MT Summit
[Charniak, Knight, Yamada, 2003]

NAACL
[Koehn, Marcu, Och, 2003] [Germann, 2003] [Graehl & Knight, 2004] [Galley, Hopkins, Knight, Marcu, 2004]

AI Magazine
[Knight, 1997]

www.isi.edu/~knight
[MT Tutorial Workbook]

Ready-to-Use Online Bilingual Data


140 120 100
Millions of words 80 (English side) 60
Chinese/English Arabic/English French/English

40 20 0

1994

1996

1998

2000

2002

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

2004

Ready-to-Use Online Bilingual Data


180 160 140 120 Millions of words 100 (English side) 80 60 40 20 0
Chinese/English Arabic/English French/English

1994

1996

1998

2000

2002

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

2004

+ 1m-20m words for many language pairs

Ready-to-Use Online Bilingual Data


180 160 140 120 100 80 60 40 20 0

???
Chinese/English Arabic/English French/English

Millions of words (English side)

1994

1996

1998

2000

2002

One Billion?

2004

From No Data to Sentence Pairs


Easy way: Linguistic Data Consortium (LDC) Really hard way: pay $$$
Suppose one billion words of parallel data were sufficient At 20 cents/word, thats $200 million

Pretty hard way: Find it, and then earn it!


De-formatting Remove strange characters Character code conversion Document alignment Sentence alignment Tokenization (also called Segmentation)

Sentence Alignment
The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. El viejo est feliz porque ha pescado muchos veces. Su mujer habla con l. Los tiburones esperan.

Sentence Alignment
1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. 1. El viejo est feliz porque ha pescado muchos veces. 2. Su mujer habla con l. 3. Los tiburones esperan.

Sentence Alignment
1. The old man is happy. 2. He has fished many times. 3. His wife talks to him. 4. The fish are jumping. 5. The sharks await. 1. El viejo est feliz porque ha pescado muchos veces. 2. Su mujer habla con l. 3. Los tiburones esperan.

Sentence Alignment
1. The old man is happy. He has fished many times. 2. His wife talks to him. 3. The sharks await. 1. El viejo est feliz porque ha pescado muchos veces. 2. Su mujer habla con l. 3. Los tiburones esperan.

Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0).

Tokenization (or Segmentation)


English
Input (some byte stream): "There," said Bob. Output (7 tokens or words): " There , " said Bob .

Chinese
Input (byte stream): Output:

Lower-Casing
English
Input (7 words):
" There , " said Bob . Output (7 words): " there , " said bob .
Idea of tokenizing and lower-casing:
The the The the Smaller vocabulary size. More robust counting and learning.

the

Recent Progress in Statistical MT


Why is that?
Better algorithms that learn patterns from data More data Faster, cheaper computers with more RAM Community-wide test sets Novel automated evaluation methods Shared software tools

Three Problems for Statistical MT


Translation model
Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations high P(f | e) <f,e> dont look like translations low P(f | e)

Language model
Given an English string e, assigns P(e) by formula good English string high P(e) random word sequence low P(e)

Decoding algorithm
Given a language model, a translation model, and a new sentence f find translation e maximizing P(e) P(f | e)

Web Language Models


She has a lot of nerve. French input
?

[20] [3]

It has a lot of nerve.

[Soricut, Knight, Marcu, 02]

Used by Google in 2005 to increase performance of their research MT system!

Vous aimerez peut-être aussi