Neural Machine Translation Short Notes

Neural Machine Translation
Alok Debnath
March 12, 2019
1 Sutskever, 2014: Seq2Seq Learning with Neu-

ral Networks
1.1 Approach
• Encoder-decoder model
• Encoder uses a multilayered LSTM to map input sentence to a vector of
fixed dimensionality.
• Decoder also uses a multilayer LSTM to ”decode” the target sequence

from the vector.
1.2 Introduction
LSTMs are known to solve sequence to sequence problems. The idea used here
is that an LSTM can encode a sentence into a fixed dimensional vector, while
another LSTM is a language model conditioned on the inpur sequence.
One of the major modifications made in their model is that they reversed
the order of the source sentence, but not the target sentence.1
Sentence meaning is aggregated at each step along a network, so a variable
length sentence is mapped to a fixed length vector representation.
1 This works under the assumption that reversing the source sentence order can be used to
determine some of the short term dependencies which are not captured in the right order.
1
1.3 Model Architecture
RNNs compute a sequence of outputs {y1 , y2 ...yT } from a sequence of inputs
{x1 , x2 ...xT } by iteratively computing the equations:
ht = sigm(W hx xt + W hh ht−1 )
yt = W yh ht
from 1 to T .
Note that the information in each hidden state is an aggregaton of the sen-
tence meaning until that point. Therefore the node that has an < EOS > has
the aggregated meaning of the entire sentence.
We use LSTMs or GRUs instead of RNNs in order to get rid of the vanishing
gradient problem.2 LSTMs (cover term for both LSTMs and GRUs) are used
in order to preserve more than just short range dependencies.
LSTMs use a conditional probability function estimator p(y1 , y2 , y3 , ...yT |x1 , x2 , x3 , ...xT ),
where (x1 , x2 , x3 , ...xT ) is the input sequence and (y1 , y2 , y3 , ...yT ) is the ouptut
sequence.
The standard LSTM-LM formulation is as follows:
0
T
Y
p(y1 , y2 , y3 , ...yT ) = p(yt |v, y1 , y2 .y3 , ...yT )
t=1
where v is the initial hidden representation of the input sequence. Each p(yt |v, y1 , y2 .y3 , ...yT )
is a softmax over all the words in the vocabulary.
The paper reiterates that the input sequence in their experiments is reversed,
such that the first term, which would have initially not been retained, is now
retained closer to first word of the output sequence. The model uses two LSTMs
of four hidden layers each, and due to this reversal, depends heavily on the
language model for producing later words.
1.4 Experiments
1.4.1 Experiment Statistics
• Dataset: WMT’14
• Model details: 4 layers, 1000 cells per layer, 1000 dimensional word em-
beddings.
• Initialization: uniform distribution between -0.8 and 0.8
• Epochs and SGD Learning Rate: Fixed at 0.7 for first 5 epochs, halving
for every epoch after this. 7.5 epochs of training.
2 The vanishing gradient problem states that only short range dependencies of a language
model or information aggregation model are captured by an RNN, as there is no mechanism

to suggest that an input further away is more important than a more recent input.
2
• Gradiant scaling is done to avoid exploding gradient. s = ||g||2 where g is
the gradient divided by 128. if s > 5, g = 5g
s
• Minibatch training was normalized for each minibatch having the same
sentence length, with a 2x speedup.
1.4.2 Model Information

Given a source sentence S, and a translation T , the goal of training was to
maximize the log probability of the correct translation. Therefore:
• Training objective: X
1/|S| = log p(T |S)
(T,S)∈S
• After training:
T = arg max p(T |S)
T
• Partial hypothesis are stored and the most likely by using left to right beam
search decoder. The model’s log probability generates a higher confidence
in the hypotheses captured by the beam search.
• Reversing the sentence leads to a significant drop in perplexity.3
1.5 Major Drawback

• Meaning aggregation at the end of each sentence does not capture the role
of each source word reuired for its translation.
• Fixed length representation leads to information loss in longer sentences.
2 Bahdanau et. al., 2014: NMT by Jointly Learn-

ing to Align and Translate
2.1 Approach
• Attention mechanim on encoder-decoder model
• Variable length localized vector used for meaning aggregation
• Introduced an bidirectional alignment function.
3 Thehypothesis provided here is that there are multiple short term dependencies introdued
which do not normally exist. I do not believe that. I think that the reversing of the sentence
makes the conditional generation of the first few words of the LSTM-LM much easier, and
the later words depend more on the language model than the input.
3
2.2 Introduction
The paper introduces a bidirectional RNN with explicit weights for the two di-
rections as an encoder, and a search based decoder model that emulates search-
ing through a source sentence.
Unlike the model presented above, this model has to major improvements:
• The model captures both forward and backward, therefore allowing con-
text aggregation from both sides and determining the importance of con-
text words that occur after the focus words as well.
• Individual words from the entire search space are provided a weight, which
determines their relative importance to a given focus word.
4
2.3 Model Architecture
2.3.1 Decoder Design
The paper defines a conditional probability:
p(yi |y1 , y2 , y3 , ...yi−1 , x) = g(yi−1 , si , ci )
where yi is the state of the ith output, x is the input vector, si is the RNN
hidden state for time i and ci is the context vector.
si = f (si−1 , yi−1 , ci )
The context vector ci is a weighted sum of the aggregate hidden states hi .

Note however, that the hi defined here is defined as a sequence of annotations,
where each ”annotation” represents the aggregation of meaning upto that point,
with a strong focus on the ith word.
Tx
X
ci = αij hj
j=1
The weight αij of each annotation hj and is computed by
exp(ei j)
αij = PTx
k=1 exp(eik )
where
eij = a(si−1 , hj )
is an alignment model.
2.3.2 Encoder Design

The encoder is a BiRNN, with two distinct hidden states, the forward hidden
−
→ − → − → − → ←− ←− ← − ← −
states (h1 , h2 , h3 , ...hT ) and backward hidden states (h1 , h2 , h3 , ...hT ). The an-
notation of each
← − word
− → is a concatenation of the forward and the backward hidden
> > >
state: hj = [hj ; hj ] . This is refered to as local annotation summaries.
2.4 Experiments
2.4.1 Experiment Statistics
• Dataset: WMT’14
• Models: Encoder-decoder and RNNsearch, 1000 hidden units, single max-

out hidden layer for conditional probability.
5
2.4.2 Model Information
Gated Hidden Units A gated hidden unit is used in order to understand
the relative importance of the si−1 term in the generation of the si term.
si = f (si−1 , yi − 1, ci ) = (1 − z) ◦ si − 1 + z ◦ s¯i .
◦ is an element-wise multiplication and z is the output of the update gates,

defined using the set of update states s¯i :
s¯i = tanh(W e(yi−1 + U [ri ◦ si−1 ] + Cci )
which is a simple MLP with the modification of an update function, and e(yi−1 )
being an m-dimensional embedding of the input word yi−1 .
The paper also introduces ri , which is a reset gate, and update gate zi , which
are:
zi = σ(Wz e(yi−1 ) + Uz ◦ si−1 + Cz ci )

ri = σ(Wr e(yi−1 ) + Ur ◦ si−1 + Cr ci )
where σ(·) which is a logistic sigmoid function.
Alignment Model The model needs to be evaluated a total of Tx × Ty times

for an input sentence of length x and output sentence of length y. The paper
uses a simple MLP.
a(si−1 , hi ) = va> tanh(Wa si−1 + Ua hj ),
where Wa , Ua and va are weight matrices.
3 Loung et. al., 2015: Effective approaches to

Attention-based NMT
A basic form of NMT consists, an encoder which computes a representation s
for each source sentene and a decoder which generates one target word at a time
and hence decomposes the composes the conditional probability as:
m
X
log p(y|x) = log p(yj |y<j , s)
j=1
Decoders are in an RNN architecture (as seen above). Encoder computation

however, are significantly different. The probability of decoding wach word yj
is:
p(yj |y<j , s) = sof tmax(g(h j ))
where g is the transformation function that outputs a vocabulary size vector.
h j is computed as:
h j = f (h j−1 , s)
6
where f computes the current hidden state given a previous hidden state.
In the attention based models, the idea is to determine the importance of
using the hidden states throughout the network.4
Global attention model’s idea is to consider all hidden states of the encoder
when deriving the context vector ct . In this model type, a variable-length align-
ment vector a t , whose size equals the number of time steps on the source sodes,
is derived by comparing the current target hidden state h t with each source
hidden state h¯t .
a t (s) = align(h t , h¯t )

exp(score(h t ))
=P
s0 exp(score(h t , h̄ s0 ))
The score function can be considered with three different alternatives:

>
h t h̄ s
 dot,
>
score(h t , h̄ s ) = h t W a h̄ s general, (1)

W a [h t ; h̄ s ] concat.

4 This is defined as global attention, local attention is also defined in this paper, but that
is for languages with the same word order.
7
Another mode for alignment can be generated solely from the hidden target
states as follows:
a t = softmax(W a h̄ t )
The third alignment function was the only one used by the second paper
above, while the other alternatives hae shown to be better.
4 Tu et. al., 2016: Modeling Coverage for NMT

This paper proposes a coverage mechanism to alleviate the over-translation
problem and under-translation problem. Over-translation is the probelm when
a word is translated over and over needlessly; under-translation is the problem
where some words are untranslated. The paper builds on Bahdanau et. al.,
2015.
The term coverage is used in SMT in order to determine the terms that
have already been translated. In NMT, the coverage vector summmarizes the
attention record for hj and therefore for a small neighbor sentering at the j th
source word and it will discourage furhter attention to it if it has been heavily
attended, and implicitly push the attention to the less attended segments of the
source sentence.
The model is described by:
Ci,j = gupdate (Ci−1 , αi,j , Φ(hj ), Ψ)
where
• gupdate (·) is the function that updates Ci,j after the new attention αi,j at
time step i in the decoding process;
• Ci,j is a d-dimensional coverage vector summarizing the history of atten-
tion till time step i on hj ;
• Φ(hj ) is a word-specific featur with its own parameters;
• Ψ are auxiliary inputs exploited in different sorts of coverage models.

Neural Machine Translation Short Notes

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Neural Machine Translation Short Notes

Transféré par

Droits d'auteur :

Formats disponibles

Neural Machine Translation

1 Sutskever, 2014: Seq2Seq Learning with Neu-

• Decoder also uses a multilayer LSTM to ”decode” the target sequence

model or information aggregation model are captured by an RNN, as there is no mechanism

1.4.2 Model Information

1.5 Major Drawback

2 Bahdanau et. al., 2014: NMT by Jointly Learn-

p(yi |y1 , y2 , y3 , ...yi−1 , x) = g(yi−1 , si , ci )

The context vector ci is a weighted sum of the aggregate hidden states hi .

The weight αij of each annotation hj and is computed by

2.3.2 Encoder Design

• Models: Encoder-decoder and RNNsearch, 1000 hidden units, single max-

◦ is an element-wise multiplication and z is the output of the update gates,

s¯i = tanh(W e(yi−1 + U [ri ◦ si−1 ] + Cci )

zi = σ(Wz e(yi−1 ) + Uz ◦ si−1 + Cz ci )

Alignment Model The model needs to be evaluated a total of Tx × Ty times

a(si−1 , hi ) = va> tanh(Wa si−1 + Ua hj ),

where Wa , Ua and va are weight matrices.

3 Loung et. al., 2015: Effective approaches to

Decoders are in an RNN architecture (as seen above). Encoder computation

a t (s) = align(h t , h¯t )

is for languages with the same word order.

4 Tu et. al., 2016: Modeling Coverage for NMT

Ci,j = gupdate (Ci−1 , αi,j , Φ(hj ), Ψ)

• Ψ are auxiliary inputs exploited in different sorts of coverage models.

Vous aimerez peut-être aussi