Vous êtes sur la page 1sur 5

Attention based Neural Machine Translation - Assignment 2 Report

Sourav Mishra
Dept. CDS
Course MTech.
SR. No. 15700
Date of Submission 28/03/2019
souravmishra@iisc.ac.in

1 Introduction The sentences in these files start and end with


¡¿ unlike the training data, so we use a different
This document studies the use of attention based sentence level preprocessing function for the test
mechanisms for neural machine translation. The data.
following mechanisms are studied: However due to error in preprocessing, the rec-
ommended test set could not be used, so the train-
• Additive Attention (Bahdanau et al., 2014)
ing set is split into training and testing sets.
• Multiplicative Attention (Luong et al., 2015) 3 Procedure
• Scaled Dot product Attention (Vaswani et al., 3.1 Libraries Used
2017) The following libraries of Python 3.6 are used:
2 Datasets • matplotlib
The training dataset was obtained from • unicodedata
http://www.statmt.org/europarl/.(Koehn, 2005)
The Europarl parallel corpus is extracted from • numpy
the proceedings of the European Parliament. It • os
includes versions in 21 European languages: Ro-
manic (French, Italian, Spanish, Portuguese, Ro- • io
manian), Germanic (English, Dutch, German,
Danish, Swedish), Slavik (Bulgarian, Czech, Pol- • time
ish, Slovak, Slovene), Finni-Ugric (Finnish, Hun- • nltk
garian, Estonian), Baltic (Latvian, Lithuanian),
and Greek. • tensorflow
The parallel German-English corpus is used for
• re
training.
There are two files with names europarl-v7.de- 3.2 Preprocessing
en.de and europarl-v7.de-en.en respectively for
The following preprocessing steps are applied:
the German and English versions respectively.
These are of sizes 328.5 and 287.3 MB respec- • Remove the tags ¡¿ at the beginning and end
tively. of sentences.
The test dataset was obtained from
http://www.statmt.org/wmt14/test-full.tgz . • Add a start and end token to each sentence.
It contains several files. Since we are training • Clean the sentences by removing special
a English-German machine translation model, we characters.
only use the required files - newstest2014-deen-
src.de.sgm for German and newstest2014-deen- • Create a word index and reverse word index
src.en.sgm for English respectively. These are of (dictionaries mapping from word id and id
sizes 466.2 and 426.7 kB respectively. word).
• Pad each sentence to a maximum length.

In order to speed up the training process, first


100000 lines were taken from the training set and
of these lines, those with length greater than 15
were dropped. This resulted in a training dataset
of size around 14200.
After training on this set, a larger set was carved
out of the original training set and the first 14200
lines were removed after preprocessing to obtain
Figure 3: Attention Score equations
the test set. The resulting test set was of size 615. Courtesy: Lilian Weng
3.3 Model Architecture
In order to have a faster training process and mim-
ick the human way of machine translation, a net-
work of unidirectional GRU encoder decoder net-
work is used along with the attention models pro-
posed earlier as opposed to the commonly used
biLSTM encoder. A biLSTM has to look at the
entire sequence before predicting unlike a unidi-
rectional model.
The network looks like the image shown below:

Figure 4: Attention Mechanisms


Courtesy: Medium

3.4 Training Parameters


The following settings were used for training the
model:
• training set size = 14133
• maximum length of sentences = 14
• GRU encoder and decoder units = 1024
• embedding size = 256
• epochs = 5
Figure 1: Encoder Decoder Network
Courtesy: tensorflow tutorials • batch size = 32

The attention mechanisms differ in which the • input language = German


scores are calculated. The following figures show • target language = English
the attention equations and their mechanisms:
• English vocabulary size = 10583
• German vocabulary size = 7164
• loss function = Sparse Categorical Cross En-
tropy
The models were trained on google co-
Figure 2: Attention Computations lab. For comparing performance, BLEU
Courtesy: tensorflow tutorials score is chosen. Specifically the function
nltk.translate.bleu score.corpus bleu is used.
4 Results
Three models were trained, one for each attention
model. The observations are tabulated in Table 1.
The following plots (Figures 5, 6 and 7) show
the learning curves for the attention models:

Figure 7: Learning Curve for Scaled Dot Attention

Input: b’¡start¿ herr prasident ! die in-


termodalitat steckt noch in den kinderschuhen .
¡end¿’
The correct translation is ¡start¿ mr president
Figure 5: Learning Curve for Additive Attention , intermodality is still in its infancy . ¡end¿
Additive Attention translation: mr president ,
the question is programmed for the question of the
question of
Multiplicative Attention translation: mr
president , on january , on january , on january
, on january
Scaled Dot Product Attention translation:
mr president , the rules of order of order . ¡end¿

Test set:
Input: b’¡start¿ ich habe noch immer keine
antwort auf diese fragen erhalten . ¡end¿’
The correct translation is ¡start¿ i have still
not had a reply to these questions . ¡end¿
Additive Attention translation: i have no rea-
son for the council . ¡end¿
Multiplicative Attention translation: i have
no objection of the commissioner . ¡end¿
Scaled Dot Product Attention translation: i
Figure 6: Learning Curve for Multiplicative Attention have no objection to be a few reasons in this issue
. ¡end¿
The lines below are obtained from the machine
translation task on the training and test sets The figures below show the attention weights
respectively for different models. for the above sentences. Figures 8, 9, 10 denote
the attention weights for the sentences on training
Training set: set.
attention type training time cost BLEU score on training set BLEU score on test set
Additive 7728.728s 1.2272 0.722671 0.722237
Multiplicative 6517.002s 1.0979 0.722671 0.722237
Scaled Dot Product 6718.238s 1.1746 0.722671 0.722237

Table 1: Results obtained from different attention models.

Figure 10: Attention weights for Scaled Dot Product


Attention
Figure 8: Attention weights for Additive Attention
Figures 11, 12, 13 denote the attention weights
for the sentences on the test set.

Figure 11: Attention weights for Additive Attention

5 Conclusion
Figure 9: Attention weights for Multiplicative Atten-
tion The BLEU scores are not satisfactory and are the
same for all attention models. This is due to lesser
number of training data and training epochs to
speed up training.
The attention weights plot show which parts of
the input sentences were crucial for the translation
task. This shows that attention based models learn
how to align sequences on their own in somewhat
similar manner to humans.
Currently very small overlaps are obtained as
seen from the outputs above. But improvements
are expected with larger number of training epochs
and data.

References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint
arXiv:1409.0473.
Figure 12: Attention weights for Multiplicative Atten- Philipp Koehn. 2005. Europarl: A Parallel Corpus for
tion Statistical Machine Translation. In Conference Pro-
ceedings: the tenth Machine Translation Summit,
pages 79–86. AAMT, AAMT.
Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. 2015. Effective approaches to attention-
based neural machine translation. arXiv preprint
arXiv:1508.04025.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran As-
sociates, Inc.

Figure 13: Attention weights for Scaled Dot Product


Attention

Vous aimerez peut-être aussi