Vous êtes sur la page 1sur 1

1.

Motivation
Neural Autoregressive Distribution Estimators (NADEs), a recent
family of models, have been successful in modeling high dimensional
multimodal distributions.

NADEs are uniquely advantaged that they

have tractable density function,
have tractable sampling function, and still
are rich enough to model complex distributions.

However, one issue with NADEs is that they rely on a single xed
order of factorization of input.

Deep Orderless NADE was proposed which addresses this issue. As
the name suggests, deep orderless NADEs do not rely on a
particular order.

Deep Orderless NADEs share NADEs advantages of tractability. But, in
practice, Deep NADEs are slower than other density estimators, as
demonstrated in the following table:


In this work, we show how

the training criteria for Deep Orderless NADEs also trains a GSN,
this re-interpretation of Deep Orderless NADEs allows us to speed up
sampling in NADE, and
the other advantages of NADEs are retained.

2. NADE
NADEs use #V multi-layer perceptrons, one to
model each P(Vj | V<j), i.e. the distribution of the j
th

dimension conditioned on the previous dimension.

NADE models thus trained rely on the
particular ordering of the input dimensions.

Further experiments suggested that the input-
ordering crucially decides the performance of the
models, yet nding the optimum ordering does not have an efcient
algorithm.

3. Deep Orderless NADE
Deep Orderless NADEs use a different random ordering for each
gradient step. The model thus learnt is equivalent to an ensemble of
NADEs trained on different orderings sharing weights among them.

Thus, although
NADE models P(Vj | V<j),
Deep Orderless NADE models P(V"j | V<j)

since for each k > j, it models P(Vk | V<j), since the sequence
[1, , j, k] is a valid partitioning of the input dimensions, for which
the model has been training.

4. Generative Stochastic Networks
Generative Stochastic Networks are generalizations of denoising auto-
encoders to arbitrary corruption functions.

Modeling the reconstruction P(V | !) is expected to be simpler than
modeling P(V), where ! is a corrupted version of V, by passing V
through a noising function.
Using the model for P(V | !), it is possible to obtain samples from
P(V) by alternatively sampling from C(! | V) and P(V | !).

Our core insight in this work is that the training criterion for Deep
Orderless NADE also trains a GSN, where

the corruption function is a missingness function, i.e. the
corruption removes certain input dimensions and leaves the others
intact,
the target is to reconstruct the missing dimensions using the
known dimensions. This is the same as attempting to reconstruct
the full input from a corrupted input, where the corruption
removed a subset of input dimensions.

GSN models P(V | #) = P(V | V<j) = P(V<j, V"j | V<j) = P(V"j | V<j)





5. Evaluation
We trained a Deep Orderless NADE on MNIST, and used NADEs
ancestral sampling and GSNs sampling on the same model.

We observe
GSN sampling can be 3-13x faster than NADE sampling.
Both GSN and NADE have spurious modes due to
Markov chain in low probability region (GSN)
sampling initial unlikely pixels makes the rest worse
The best GSN samples are better than the best NADE samples, w.r.t
visual proximity with the training data.
GSN samples are smoother than NADE samples, since GSN samples
multiple pixels simultaneously which makes the pixels more mutually
correlated.
Number of Layers Time per pass
single-layer O(#V * #H)
double-layer O(#V * #H^2)
multi-layer (#L) O(#V * #L * #H^2)
Sherjil Ozair is visiting Universit de Montral from Indian Institute of Technology Delhi
Yoshua Bengio is a CIFAR Senior Fellow
On the Equivalence Between Deep NADE
and Generative Stochastic Networks
Li Yao, Sherjil Ozair

, Kyung Hyun Cho, Yoshua Bengio


=
DEEP NADE GENERATIVE STOCHASTIC NETWORK
NADE SAMPLING
GSN SAMPLING

Vous aimerez peut-être aussi