Académique Documents
Professionnel Documents
Culture Documents
Alok Debnath
March 12, 2019
1.2 Introduction
LSTMs are known to solve sequence to sequence problems. The idea used here
is that an LSTM can encode a sentence into a fixed dimensional vector, while
another LSTM is a language model conditioned on the inpur sequence.
One of the major modifications made in their model is that they reversed
the order of the source sentence, but not the target sentence.1
Sentence meaning is aggregated at each step along a network, so a variable
length sentence is mapped to a fixed length vector representation.
1 This works under the assumption that reversing the source sentence order can be used to
determine some of the short term dependencies which are not captured in the right order.
1
1.3 Model Architecture
RNNs compute a sequence of outputs {y1 , y2 ...yT } from a sequence of inputs
{x1 , x2 ...xT } by iteratively computing the equations:
ht = sigm(W hx xt + W hh ht−1 )
yt = W yh ht
from 1 to T .
Note that the information in each hidden state is an aggregaton of the sen-
tence meaning until that point. Therefore the node that has an < EOS > has
the aggregated meaning of the entire sentence.
We use LSTMs or GRUs instead of RNNs in order to get rid of the vanishing
gradient problem.2 LSTMs (cover term for both LSTMs and GRUs) are used
in order to preserve more than just short range dependencies.
LSTMs use a conditional probability function estimator p(y1 , y2 , y3 , ...yT |x1 , x2 , x3 , ...xT ),
where (x1 , x2 , x3 , ...xT ) is the input sequence and (y1 , y2 , y3 , ...yT ) is the ouptut
sequence.
The standard LSTM-LM formulation is as follows:
0
T
Y
p(y1 , y2 , y3 , ...yT ) = p(yt |v, y1 , y2 .y3 , ...yT )
t=1
where v is the initial hidden representation of the input sequence. Each p(yt |v, y1 , y2 .y3 , ...yT )
is a softmax over all the words in the vocabulary.
The paper reiterates that the input sequence in their experiments is reversed,
such that the first term, which would have initially not been retained, is now
retained closer to first word of the output sequence. The model uses two LSTMs
of four hidden layers each, and due to this reversal, depends heavily on the
language model for producing later words.
1.4 Experiments
1.4.1 Experiment Statistics
• Dataset: WMT’14
• Model details: 4 layers, 1000 cells per layer, 1000 dimensional word em-
beddings.
• Initialization: uniform distribution between -0.8 and 0.8
• Epochs and SGD Learning Rate: Fixed at 0.7 for first 5 epochs, halving
for every epoch after this. 7.5 epochs of training.
2 The vanishing gradient problem states that only short range dependencies of a language
2
• Gradiant scaling is done to avoid exploding gradient. s = ||g||2 where g is
the gradient divided by 128. if s > 5, g = 5g
s
• Minibatch training was normalized for each minibatch having the same
sentence length, with a 2x speedup.
• After training:
T = arg max p(T |S)
T
• Partial hypothesis are stored and the most likely by using left to right beam
search decoder. The model’s log probability generates a higher confidence
in the hypotheses captured by the beam search.
• Reversing the sentence leads to a significant drop in perplexity.3
3
2.2 Introduction
The paper introduces a bidirectional RNN with explicit weights for the two di-
rections as an encoder, and a search based decoder model that emulates search-
ing through a source sentence.
Unlike the model presented above, this model has to major improvements:
• The model captures both forward and backward, therefore allowing con-
text aggregation from both sides and determining the importance of con-
text words that occur after the focus words as well.
• Individual words from the entire search space are provided a weight, which
determines their relative importance to a given focus word.
4
2.3 Model Architecture
2.3.1 Decoder Design
The paper defines a conditional probability:
where yi is the state of the ith output, x is the input vector, si is the RNN
hidden state for time i and ci is the context vector.
si = f (si−1 , yi−1 , ci )
exp(ei j)
αij = PTx
k=1 exp(eik )
where
eij = a(si−1 , hj )
is an alignment model.
2.4 Experiments
2.4.1 Experiment Statistics
• Dataset: WMT’14
5
2.4.2 Model Information
Gated Hidden Units A gated hidden unit is used in order to understand
the relative importance of the si−1 term in the generation of the si term.
si = f (si−1 , yi − 1, ci ) = (1 − z) ◦ si − 1 + z ◦ s¯i .
which is a simple MLP with the modification of an update function, and e(yi−1 )
being an m-dimensional embedding of the input word yi−1 .
The paper also introduces ri , which is a reset gate, and update gate zi , which
are:
6
where f computes the current hidden state given a previous hidden state.
In the attention based models, the idea is to determine the importance of
using the hidden states throughout the network.4
Global attention model’s idea is to consider all hidden states of the encoder
when deriving the context vector ct . In this model type, a variable-length align-
ment vector a t , whose size equals the number of time steps on the source sodes,
is derived by comparing the current target hidden state h t with each source
hidden state h¯t .
4 This is defined as global attention, local attention is also defined in this paper, but that
7
Another mode for alignment can be generated solely from the hidden target
states as follows:
a t = softmax(W a h̄ t )
The third alignment function was the only one used by the second paper
above, while the other alternatives hae shown to be better.
where
• gupdate (·) is the function that updates Ci,j after the new attention αi,j at
time step i in the decoding process;
• Ci,j is a d-dimensional coverage vector summarizing the history of atten-
tion till time step i on hj ;
• Φ(hj ) is a word-specific featur with its own parameters;