Académique Documents
Professionnel Documents
Culture Documents
(a) Style sources
Abstract Location Size Quantity
Content sources
from
Synthesizing images from a given text description in- text descriptions
inferred z ̂
volves engaging two types of information: the content, This flower has petals that are
pink and has yellow stamen.
which includes information explicitly described in the text This flower has white petals as
(e.g., color, composition, etc.), and the style, which is usu- well as a pedicel.
This flower has petals that are
ally not well described in the text (e.g., location, quantity, yellow and has dark lines.
Text Embedding φ(ta ) xa , xb
ε ∼ N (0, 1) Dx,φ(t )
a
ε ∼ N (0, 1)
CA ca
~ CA ca
Gx xa Dx,φ(t )
a
~
Gx xa
z
xa , xb z
~ ~
(xa , z) & (xa , ca )
D(x,z)/(x,c)
Text Embedding φ(ta )
(xa , ẑ) & )
(xa , ĉa
ĉa
ε ∼ N (0, 1)
′
xa Gz,c Gx xa
~
CA ca Gx xa Dx,φ(t ) ẑ
a
xa , xb cycle-consistency
Figure 2: Overview of the current state-of-the-art methods (left top) and our proposed method (right) for text-to-image
synthesis at low-resolution scale. By default, the current state-of-the-art methods adopt conditioning augmentation (CA),
which introduces variable c ∼ p(c|ϕt ), in addition to variable z ∼ N (0, 1) as the inputs for the image generator Gx . The
removal of z (left bottom) does not affect the model performance (viz. Figure 9 in supplementary material for quantitative
evaluations). In our method (right), we incorporate the inference mechanism, where Gz,c encodes both z and c, and the
discriminator D(x,z)/(x,c) distinguishes between joint pairs. For the cycle consistency, sampled ẑ and ĉ are also used to
reconstruct x0 .
ministic text embedding, which cannot plausibly cover all is different. InfoGAN [3] uses the variational mutual in-
content variations, and as a result, one can assume that in- formation maximization technique, whereas the matching-
formation belonging to the content could severely contam- aware loss uses the concept of matched and mismatched
inate the style. In our work, we learn style from the data pairs. In addition, InfoGAN concentrates all semantic fea-
itself as opposed to the generated images. This allows us to tures on the latent code c, which contains both content and
learn style while updating the generator and effectively in- style, whereas in this work, we only maximize mutual infor-
corporate style information from the data into the generator. mation on the content since we consider text as our content.
3. Methods
Adversarial inference methods Various papers have ex-
plored learning representations through adversarial training. 3.1. Preliminaries
Notable mentions are BiGANs [6, 7] where a bidirectional
We start by describing text-to-image synthesis. Let ϕt
discriminator acts on pairs (x, z) of data and generated
be the text embedding of a given text description associ-
points. While these models assume that a single random
ated with image x. The goal of text-to-image synthesis is to
variable z encodes data representations, in this work we ex-
generate a variety of visually-plausible images that are text-
tend the adversarial inference to two random variables that
matched. Reed et al. [26] first propose a conditional GAN-
are disentangled with each other. Our model is also closely
based framework, where a generator Gx takes as input a
related to [19], where the authors incorporate an adversar-
noise vector z sampled from p(z) = N (0, 1) and ϕt as the
ial reconstruction loss into the BiGAN framework. They
conditioning factor to generate an image x̃ = Gx (z, ϕt ). A
show that the additional loss term results in better recon-
matching-aware discriminator Dx,ϕt is then trained to not
structions and more stable training. Although Dumoulin
only judge between real and fake images, but also discrim-
et al. [7] show results for conditional image generation,
inate between matched and mismatched image-text pairs.
in their model the conditioning factor is discrete, fully ob-
The minimax objective function for text-to-image (subscript
served and not inferred through the inference model. In our
denoted as t2i) framework is given as:
model however, ‘c’ can be a continuous conditioning vari-
able that we infer from the text and image. min max Vt2i (Dx,ϕt , Gx ) =
G D
E(xa ,ta )∼pdata [log Dx,ϕt (xa , ϕta )]+
Relation to InfoGAN While the matching-aware loss
1
(Section 3.1) used in many text-to-image works can also be E(xa ,tb )∼pdata [log(1 − Dx,ϕt (xa , ϕtb ))]+
viewed as maximizing mutual information between the two 2
modalities (i.e., text and image), the way it is approximated Ez∼p(z),ta ∼pdata [log(1 − Dx,ϕt (Gx (z, ϕta ), ϕta ))] , (1)
where (xa , ta ) is a matched pair and (xa , tb ) is a mis- 3.2. Dual adversarial inference
matched pair.
As described in Section 3.1, the current state-of-the-
To augment the text data, Zhang et al. [32] replace the
art methods for text-to-image synthesis can be viewed as
deterministic text embedding ϕt in the generator with a la-
variants of conditional GANs, where the conditioning is
tent variable c, which is sampled from a learned Gaussian
initially on ϕt itself [26] and later on updated to the la-
distribution p(c|ϕt ) = N (µ(ϕt ), Σ(ϕt )), where µ and Σ
tent variable c sampled from a distribution learned through
are functions of ϕt parameterized by neural networks. For
ϕt [32, 35, 30, 24]. The generator then has two latent vari-
simplicity in notation, we denote p(c|ϕt ) as p(c). As a re-
ables z and c: z ∼ p(z), c ∼ p(c) (left, Figure 2). The
sult, the objective function (1) is updated to:
priors can be Gaussian or non-Gaussian distributions such
min max Vt2i (Dx,ϕt , Gx ) = as the Bernoulli distribution 1 . To learn disentangled rep-
G D resentations for style (z) and content (c) and to enforce the
E(xa ,ta )∼pdata [log Dx,ϕt (xa , ϕta )]+ separation between these two variables, we incorporate dual
1 adversarial inference into the current framework for text-to-
E(xa ,tb )∼pdata [log(1 − Dx,ϕt (xa , ϕtb ))]+ image synthesis (right, Figure 2). In this dual inference pro-
2
Ez∼p(z),c∼p(c),ta ∼pdata [log(1 − Dx,ϕt (Gx (z, c), ϕta ))] . cess, we are interested in matching the conditional q(z, c|x)
to the posterior p(z, c|x), which under the independence
(2)
assumption can be factorized as follows:
In addition to the matching-aware pair loss that guaran-
tees the semantic consistency, Zhang et al. [35] propose an- q(z, c | x) = q(z | x)q(c | x),
other type of adversarial loss that focuses on the image fi- p(z, c | x) = p(z | x)p(c | x).
delity (i.e., image loss), further updating (2) to:
This formulation allows us to match q(z|x) with p(z|x)
min max Vt2i (Dx , Dx,ϕt , Gx ) = and q(c|x) with p(c|x), respectively. Similar to previous
G D
work [7, 6], we achieve this by matching the two pairs of
Exa ∼pdata [log Dx (xa )] + Ez∼p(z),c∼p(c )[log(1 − Dx (Gx (z, c)))]+ joint distributions:
E(xa ,ta )∼pdata [log Dx,ϕt (xa , ϕta )]+
1 q(z, x) = p(z, x),
E(xa ,tb )∼pdata [log(1 − Dx,ϕt (xa , ϕtb ))]+ q(c, x) = p(c, x).
2
Ez∼p(z),c∼p(c),ta ∼pdata [log(1 − Dx,ϕt (Gx (z, c), ϕta ))] ,
The encoder for our dual adversarial inference then encodes
(3) both z and c: ẑ, ĉ = Gz,c (x), x ∼ q(x), while the genera-
where Dx is a discriminator distinguishing between images tor decodes z and c sampled from their corresponding prior
sampled from pdata and those sampled from the distribution distributions into an image: x̃ = Gx (z, c), z ∼ p(z), c ∼
parameterized by the generator (i.e., pmodel ). p(c). To compete with Gx and Gz,c , the discrimination
Consider two general probability distributions q(x) and phase also has two components: the discriminator Dx,z
p(z) over two domains x ∈ X and z ∈ Z, where q(x) is trained to discriminate (x, z) pairs sampled from either
represents the empirical data distribution and p(z) is usu- q(x, z) or p(x, z), and the discriminator Dx,c for the dis-
ally specified as a simple random distribution, e.g., a stan- crimination of (x, c) pairs sampled from either q(x, c) or
dard normal N (0, 1). Adversarial inference [6, 7] aims to p(x, c). Given the above setting, the original adversarial
match the two joint distributions q(x, z) = q(z|x)q(x) and inference objective (4) is updated as:
p(x, z) = p(x|z)p(z), which in turn implies that q(z|x)
matches p(z|x). To achieve this, an encoder Gz (x) : ẑ = min max Vdual (Dx,z , Dx,c , Gx , Gz,c ) =
G D
Gz (x), x ∼ q(x) is introduced in the generation phase, in
Ex∼q(x),ẑ,ĉ∼q(z,c|x) [log Dx,z (x, ẑ) + log Dx,c (x, ĉ)]+
addition to the standard generator Gx (z) : x̃ = Gx (z), z ∼
p(z). The discriminator D is trained to distinguish joint Ex̃∼p(x|z,c),z∼p(z),c∼p(c) [log(1 − Dx,z (x̃, z)) + log(1 − Dx,c (x̃, c))].
pairs between (x, ẑ) and (x̃, z). The minimax objective of (5)
adversarial inference can be written as:
3.3. Cycle consistency
min max V (D, Gx , Gz ) = In unsupervised learning, cycle-consistency refers to the
G D
ability of the model to reconstruct the original image x
Ex∼q(x),ẑ∼q(z|x) [log D(x, ẑ)]+
from its inferred latent variable z. It has been reported
Ex̃∼p(x|z),z∼p(z) [log(1 − D(x̃, z))]. (4)
1 In this paper, we experiment with both Gaussian and Bernoulli distri-
0
1
Content sources
2
3
4
5
̂c der refni
6
7
8
inferred content (ĉ)
inferred style (ẑ)
9
Figure 3: Disentangling content and style on MNIST-CB dataset. (a) Generated samples given digit identities as the content
c. Each column uses the same style z sampled from N (0, 1). (b) The t-SNE visualizations of inferred content ĉ and inferred
style ẑ. (c) Reconstructed samples using inferred content ĉ (in rows) and inferred style ẑ (in columns) from image sources.
that bidirectional adversarial inference models often have {Dx , Dx,ϕt , Dx,z , Dx,c , Dx,x0 }.
difficulties in reproducing faithful reconstructions as they Note that in addition to the latent variable c, the en-
do not explicitly include any reconstruction loss in the ob- coded ẑ and ĉ in our method are also sampled from the
jective function [7, 6, 19]. The cycle-consistency criterion, inferred posterior distributions through the reparameteriza-
as having been demonstrated in many previous works such tion trick [16], i.e., ẑ ∼ q(z|x) and ĉ ∼ q(c|x). In or-
as CycleGAN [36], DualGAN [31], DiscoGAN [14] and der to encourage smooth sampling over the latent space, we
augmented CycleGAN [1], enforces a strong connection be- regularize the posterior distributions q(z|x) and q(c|x) to
tween domains (here x and z) by constraining the models match their respective priors by minimizing the KL diver-
(e.g., encoder and decoder) to be consistent with one an- gence. We apply a similar regularization term to p(c), e.g.,
other. Li et al. [19] show that the integration of the cycle- λDKL (p(c) || N (0, 1)) for a normal distribution prior, as
consistency objective stabilizes the learning of adversarial done in previous text-to-image synthesis works [32, 35].
inference, thus yielding better reconstruction results. Our preliminary experiments 2 showed that without the
With the above in mind, we integrate cycle-consistency above regularization, the training became unstable and the
in our dual adversarial inference framework in a similar gradients typically explode after certain number of epochs.
fashion to [19] . More concretely, we use another discrimi-
nator Dx,x0 to distinguish between x and its reconstruction 4. Experiments
x0 = Gx (ẑ, ĉ), where ẑ, ĉ = Gz,c (x), by optimizing:
4.1. Proof-of-concept study
min max Vcycle (Dx,x0 , Gx , Gz,c ) = To evaluate the effectiveness of our proposed dual ad-
G D
versarial inference on the disentanglement of content and
Ex∼q(x) [log Dx,x0 (x, x)]+
style, we first validate our proposed method on a toy dataset:
Ex∼q(x),(ẑ,ĉ)∼q(z,c|x) [log(1 − Dx,x0 (x, Gx (ẑ, ĉ)))]. (6) MNIST-CB [8], where we formulate the digit generation
We later show in an ablation study (Section 4.6) that using problem as a text-to-image synthesis problem by consider-
l2 loss for cycle-consistency leads to blurriness in the gen- ing the digit identity as the text content. In this setup, digit
erated images, which agrees with previous studies [17, 31]. font and background color represent styles learned in an un-
supervised manner through adversarial inference. We add a
3.4. Full objective cross-entropy regularization term to the content inference
objective since our content in this case is discrete (i.e., one-
Taking (3), (5), (6) into account, our full objective is:
hot vector for digit identity). As shown in Figure 3 (a), the
min max Vf ull (D, G) content and style are disentangled in the generation phase,
G D where the generator has learned to assign the same style to
= Vt2i (Dx , Dx,ϕt , Gx ) different digit identities when the same z is used. More
+ Vdual (Dx,z , Dx,c , Gx , Gz,c ) importantly, the t-SNE visualizations (Figure 3 (b)) from
our inferred content and style (ĉ and ẑ) indicate that our
+ Vcycle (Dx,x0 , Gx , Gz,c ), (7)
2 We also experimented with minimizing the cosine similarity between
where G and D are the sets of all generators and dis- ẑ and ĉ, but did not observe improved performance in terms of the incep-
criminators in our method: G = {Gx , Gz,c } and D = tion score and FID.
Inception Score FID
Method
Oxford-102 CUB COCO Oxford-102 CUB COCO
GAN-INT-CLS [26] 2.66 ± 0.03 2.88 ± 0.04 7.88 ± 0.07 79.55 68.79 60.62
GAWWN [27] — 3.10 ± 0.03 — — 53.51 —
StackGAN [32, 33] 2.73 ± 0.03 3.02 ± 0.03 8.35 ± 0.11 43.02 35.11 33.88
HDGAN [35] — 3.53 ± 0.03 — — — —
HDGAN mean* 2.90 ± 0.03 3.58 ± 0.03 8.64 ± 0.37 40.02 ± 0.55 20.60 ± 0.96 29.13 ± 3.76
Ours mean* 2.90 ± 0.03 3.58 ± 0.05 8.94 ± 0.20 37.94 ± 0.39 18.41 ± 1.07 27.07 ± 2.55
* mean calculated on three experiments at five different epochs (600, 580, 560, 540, 520), or three different epochs (200, 190, 180) for COCO dataset
Table 1: Comparison of inception score and FID at 64×64 resolution scale. Higher inception score and lower FID mean
better performance.
GT Baseline Ours
Figure 4: Examples of generated images on Oxford-102 (top), CUB (middle) and COCO (bottom) datasets.
dual adversarial inference has successfully separated the 4.3. Quantitative results
information on content (digit identity) and style (font and
background color). This is further validated in Figure 3 (c) To get a global overview of how our method, the base-
where we show our model’s ability to infer style and content line method and its variants (by either fixing or removing
from different image sources and fuse them to generate hy- the noise vector z) behave throughout training, we eval-
brid images, using content from one source and style from uate each model in 20 epoch intervals. Figure 9 (supple-
the other. mentary material) shows inception score (left axis) and FID
(right axis) for both Oxford-102 and CUB datasets. Con-
4.2. Text-to-image setup sistent with the qualitative results presented in Figure 8
(supplementary material), we quantitatively show that by
Once validated on the toy example, we move to the orig- either fixing or removing z, the baseline models retain
inal text-to-image synthesis task. We evaluate our method unimpaired performance, suggesting that z has no contri-
based on model architectures similar to HDGAN [35], one bution in the baseline models. However, with our proposed
of the current state-of-the-art methods for text-to-image dual adversarial inference, the model performance is sig-
synthesis, making HDGAN our baseline method. The ar- nificantly improved on FID scores for both datasets (red
chitecture designs are the same as described in [35], keeping curves, Figure 9), indicating the proposed method’s abil-
in mind that we only consider the 64×64 resolution. Three ity to produce better-quality images. Table 1 summarizes
quantitative metrics are used to evaluate our method: Incep- the comparison of the results of our method to the baseline
tion score [28], Fréchet inception distance (FID) [10] and method and also other reported results of previous state-of-
Visual-semantic similarity [35]. It has been noticed in our the-art methods for the 64 × 64 resolution task on the three
experiments and also reported by others [21] that, due to the benchmark datasets: Oxford-102, CUB and COCO. Our
variations in the training of GAN models, it is unfair to draw method achieves the best performance based on the mean
a conclusion based on one single experiment that achieves scores for both metrics on all datasets; on the FID score,
the best result; therefore, in our experiments, we perform it shows a 5.2% improvement (from 40.02 to 37.94) on the
three independent experiments for each method, with aver- Oxford-102 dataset, and a 10.6% improvement (from 20.60
ages reported as final results. More implementation, dataset to 18.41) on the CUB dataset. In addition, we also achieve
and evaluation details can be found in the supplementary comparable results on visual-semantic similarity (Table 3,
material. supplementary material).
(a) (b) (c) (d)
Number of flowers Pose of flowers Size of birds Background
Source Source Source Source
̂
zsource ̂
zsource ̂
zsource ̂
zsource
inferred style
inferred style
inferred style
inferred style
̂
ztarget ̂
ztarget ̂
ztarget ̂
ztarget
̂
csource ̂
ctarget ̂
csource ̂
ctarget ̂
csource ̂
ctarget ̂
csource ̂
ctarget
inferred content inferred content inferred content inferred content
Target
Target
Target
Target
Figure 5: Examples of reconstructed images by interpolation of inferred content ĉ and inferred style ẑ from sources to targets.
The learned style information includes: (a) quantity, (b) pose, (c) size and (d) background.
Style sources Style sources
Location Size & Pose Quantity Location Size & Pose Quantity
Content
Content sources
sources
from
from
text descriptions inferred z ̂ inferred z ̂
images
This flower has petals that are
purple with white stamen.
This flower has petals that are
yellow and are very thin.
This flower has a large purple
petal with a white colored anther.
This flower is pink and yellow in
color, and has petals that are
ruffled and spotted.
This flower petals is light
bluish and purplish color.
̂c derrefni
Figure 6: Disentangling content (in rows) and style (in columns) on Oxford-102 dataset by using content sources either from
text descriptions (left) or images (right). More results are provided in the supplementary material (Section 6.8).
4.4. Qualitative results trained inference model with two images: the source image
and the target image, and extract their projections ẑ and ĉ
In this subsection, we present qualitative results on text- for interpolation analysis. As shown in Figure 5, the rows
to-image generation and interpolation analysis based on in- correspond to reconstructed images of linear interpolations
ferred content (ĉ) and inferred style (ẑ). in ĉ from source to target image and the same for ẑ as dis-
First, we visually compare the quality and diversity of played in columns. The smooth transitions of both the con-
images generated from our method against the baseline. tent represented by ĉ from the left to right and the style rep-
Figure 4 shows one example for each dataset, illustrating resented by ẑ from the top to bottom indicate a good gener-
that our method is able to generate better-quality images alization of our model representing both latent spaces, and
compared to the baseline method, which agrees with our more interestingly, we find promising results showing that
quantitative results in Table 1. We provide more examples ẑ is indeed controlling some meaningful style information,
in the supplementary material (Section 6.7). e.g., the number and pose of flowers, the size of birds and
the background (Figure 5, more examples in supplementary
To make sure we are not overfitting, and to investigate
material).
whether we have learned a representative latent space, we
look at interpolations of projected locations in the latent
4.5. Disentanglement constraint
space. Interpolations also enable us to examine whether
the model has indeed learned to separate style from con- Despite promising results evidenced by many such ex-
tent in an unsupervised way. To do this, we provide the amples as shown in Figure 5, we notice that the information
Style sources Style sources
Location & Pose Background (branch) Location & Pose Background (branch)
Content
Content sources
sources
from
from
text descriptions inferred z ̂ images inferred z ̂
This interesting bird has a red
breast and crest with a short bill.
White belly and throat and blue
crown and back with black primaries.
This is a bright yellow bird with a
black crown and a grey beak.
This bird is red with white on its
side and a tan beak.
This bird has wings that are brown
̂c derrefni
and has a red belly and head.
A very small bird with a long tan
beak and a blue back.
A small green and yellow bird
with a tiny beak.
Figure 7: Disentangling content (in rows) and style (in columns) on CUB dataset by using content sources either from text
descriptions (left) or images (right). More results are provided in the supplementary material (Section 6.8).
captured by inferred style (ẑ) is not always consistent and Method Inception Score FID
faithful when we use Gaussian priors for both content and
style. Inspired by the theories from independent component ours 3.58 ± 0.05 18.41 ± 1.07
analysis (ICA) for separating a multivariate signal into ad- ours without Vt2i 3.31 ± 0.04 20.65 ± 0.47
ditive subcomponents [4], we use a Bernoulli distribution
ours without Vcycle 3.53 ± 0.06 19.29 ± 0.90
for the content representation to satisfy the non-Gaussian
constraint. This provides us with a better disentanglement l2 loss for Vcycle 1.73 ± 0.15 149.8 ± 16.4
of content and style. Note that an alternative approach for
ICA has also recently been explored in [13]. As shown in Table 2: Ablation study on CUB dataset. Note that the ab-
Figure 6 and Figure 7, our models learn to synthesize im- lation on Vdual eventually turns into the baseline.
ages by combining content and style information from dif-
ferent sources while preserving their respective properties
(e.g., color for the content; and location, pose, quantity, etc.
vious works [26, 32, 35] for text-to-image synthesis use
for the style), which suggests the disentanglement of con-
the discriminator Dx,ϕt to discriminate whether the image
tent and style. Note that the content information can either
x matches its text embedding ϕt . However, with the in-
directly come from a text description (left, Figure 6 and Fig-
tegration of adversarial inference, where a new discrimi-
ure 7) or be inferred from an image source (right, Figure 6
nator Dx,c is designed to match the joint distribution of
and Figure 7). More examples and discussions are provided
(x, ĉ) and (x̃, c), we now question whether the discrimi-
in the supplementary material (Section 6.8).
nator Dx,ϕt is still required, given the fact that c is learned
Higgins et al. [11] and Zhang et al. [34] have proposed from ϕt . To answer this question, we remove the objective
quantitative metrics for the disentanglement analysis which Vt2i (D, G) from our method, and as seen in Table 2, the per-
involve classification of the style attributes or comparison of formance on the CUB dataset significantly drops for both
the distance between generated style and true style. How- inception score and FID, indicating that Dx,ϕt is not redun-
ever, in our case, the dataset does not contain any labeled dant in our method by providing strong supervision over the
attribute that can be used to evaluate a captured style. As a text embeddings. Similarly, we examine the role of cycle-
result, their proposed metrics would not be suitable in our consistency loss in our method by removing Vcycle (D, G)
case. One possible solution would be to artificially create a from the objective. We observe a slight drop in both in-
new dataset that has the same content over multiple known ception score and FID (Table 2), suggesting that cycle-
styles. We leave this exploration for future work. consistency can further improve the learning of adversarial
inference, which is in agreement with [19]. It is also worth
4.6. Ablation study
mentioning that our method without cycle-consistency still
In our method, we have multiple components, each of achieves better FID scores than the baseline method on the
which is optimized by its corresponding objective. The pre- CUB dataset (Table 1 and Table 2), which additionally sup-
ports our proposal to integrate the inference mechanism in Courville. Adversarially learned inference. In ICLR, 2017.
the current text-to-image framework. 3, 4, 5
We also examine the model performance by using l2 loss [8] Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Ben-
gio. Image-to-image translation for cross-domain disentan-
for cycle-consistency instead of the adversarial loss. The re-
glement. In NIPS, 2018. 5
sulting degradation in quality is unexpectedly dramatic (Ta- [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
ble 2). Figure 10 (supplementary material) shows the gen- Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
erated images using adversarial loss compared with those Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
using l2 loss, and it is clear that the latter gives blurrier im- 1
ages. [10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
5. Conclusion two time-scale update rule converge to a local nash equilib-
rium. In NIPS, 2017. 6, 12
In this paper, we incorporate a dual adversarial inference [11] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess,
procedure in order to learn disentangled representations of Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and
Alexander Lerchner. beta-vae: Learning basic visual con-
content and style in an unsupervised way, which we show
cepts with a constrained variational framework. In ICLR,
improves text-to-image synthesis. It is worth noting that the
2017. 8
content is learned both in a supervised way through the text [12] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and
embedding and in an unsupervised way through the adver- Honglak Lee. Inferring semantic layout for hierarchical text-
sarial inference. The style, however, is learned solely in an to-image synthesis. In CVPR, 2018. 2
unsupervised manner. Despite the challenges of the task, [13] Ilyes Khemakhem, Diederik P Kingma, and Aapo
we show promising results on interpreting what has been Hyvärinen. Variational autoencoders and nonlinear ica: A
learned for style. With the proposed inference mechanism, unifying framework. 2019. 8
[14] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee,
our method achieves improved quality and comparable vari-
and Jiwon Kim. Learning to discover cross-domain relations
ability in generated images evaluated on Oxford-102, CUB
with generative adversarial networks. In ICML, 2017. 5
and COCO datasets. [15] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In ICLR, 2015. 12
[16] Diederik P Kingma and Max Welling. Auto-encoding varia-
Acknowledgements This work was supported by Mitacs
tional bayes. In ICLR, 2014. 5
project IT11934. The authors thank Nicolas Chapados for [17] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo
his constructive comments, and Gabriel Chartrand, Thomas Larochelle, and Ole Winther. Autoencoding beyond pixels
Vincent, Andew Jesson, Cecile Low-Kam and Tanya Nair using a learned similarity metric. In ICML, 2016. 5
for their help and review. [18] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh
Singh, and Ming-Hsuan Yang. Diverse image-to-image
References translation via disentangled representations. In ECCV, 2018.
2
[1] Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip [19] Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun
Bachman, and Aaron Courville. Augmented cyclegan: Chen, Ricardo Henao, and Lawrence Carin. Alice: To-
Learning many-to-many mappings from unpaired data. In wards understanding adversarial learning for joint distribu-
ICML, 2018. 5 tion matching. In NIPS, 2017. 3, 5, 8
[2] Miriam Cha, Youngjune L Gown, and HT Kung. Adversarial [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
learning of semantic relevance in text to image synthesis. In Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
AAAI, 2019. 2 Zitnick. Microsoft coco: Common objects in context. In
[3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya ECCV, 2014. 2, 12
Sutskever, and Pieter Abbeel. Infogan: Interpretable rep- [21] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain
resentation learning by information maximizing generative Gelly, and Olivier Bousquet. Are gans created equal? a
adversarial nets. In NIPS, 2016. 3 large-scale study. In NIPS, 2018. 6
[4] Pierre Comon. Independent component analysis, a new con- [22] Mehdi Mirza and Simon Osindero. Conditional generative
cept? Signal processing, 36(3):287–314, 1994. 8 adversarial nets. In arXiv preprint arXiv:1411.1784, 2014. 1
[5] Ayushman Dash, John Cristian Borges Gamboa, Sheraz [23] Maria-Elena Nilsback and Andrew Zisserman. Automated
Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal. flower classification over a large number of classes. In
Tac-gan-text conditioned auxiliary classifier generative ad- ICVGIP, 2008. 2, 12
versarial network. In arXiv preprint arXiv:1703.06412, [24] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao.
2017. 2 Mirrorgan: Learning text-to-image generation by redescrip-
[6] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- tion. In CVPR, 2019. 2, 4
versarial feature learning. In ICLR, 2017. 3, 4, 5 [25] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele.
[7] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Learning deep representations of fine-grained visual descrip-
Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron tions. In CVPR, 2016. 12
[26] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
versarial text to image synthesis. In ICML, 2016. 1, 2, 3, 4,
6, 8, 12
[27] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka,
Bernt Schiele, and Honglak Lee. Learning what and where
to draw. In NIPS, 2016. 2, 6
[28] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
Cheung, Alec Radford, and Xi Chen. Improved techniques
for training gans. In NIPS, 2016. 6, 12
[29] Peter Welinder, Steve Branson, Takeshi Mita, Catherine
Wah, Florian Schroff, Serge Belongie, and Pietro Perona.
Caltech-ucsd birds 200. 2010. 2, 12
[30] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
grained text to image generation with attentional generative
adversarial networks. In CVPR, 2018. 2, 4
[31] Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong.
Dualgan: Unsupervised dual learning for image-to-image
translation. In ICCV, 2017. 5
[32] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei
Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan:
Text to photo-realistic image synthesis with stacked genera-
tive adversarial networks. In ICCV, 2017. 1, 2, 4, 5, 6, 8,
12
[33] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
gang Wang, Xiaolei Huang, and Dimitris Metaxas. Stack-
gan++: Realistic image synthesis with stacked generative
adversarial networks. In arXiv preprint arXiv:1710.10916,
2017. 6
[34] Yexun Zhang, Ya Zhang, and Wenbin Cai. Separating style
and content for generalized style transfer. In CVPR, 2018. 8
[35] Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic
text-to-image synthesis with a hierarchically-nested adver-
sarial network. In CVPR, 2018. 2, 4, 5, 6, 8, 12
[36] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networkss. In ICCV, 2017. 5
6. Supplementary Material
6.1. Problem
The current state-of-the-art methods for text-to-image synthesis normally have two sources of randomness: one for the
text embedding variability, and the other (noise z given a normal distribution) capturing image variability. Our empirical
investigation of some previously published methods reveals that those two sources can overlap: due to the randomness in
the text embedding, the noise vector z then does not meaningfully contribute to the variability nor the quality of generated
images, and can be discarded. This is illustrated qualitatively in Figure 8 and quantitatively in Figure 9.
enilesaB
z xiF
GT
This flower is
white and pink
z evomeR
in color, with
petals that have
small veins.
GT
enilesaB
z xiF
A greyish-blue
colored bird with
z evomeR
a white-grey
face, and big
blue feet.
Figure 8: Generated images from previous state-of-the-art method (Baseline), fixing the noise vector z in the baseline method
(Fix z) and removing the noise vector z in the baseline method (Remove z). The removal of the randomness from the noise
source by either fixing z or removing z does not affect the variability nor the quality of generated images, indicating that the
noise vector z has no contribution in the synthesis process.
Figure 9: Inception score (left axis, top curves) and FID (right axis, bottom curves) for the baseline method, its variants (fix
z and remove z) and our method on Oxford-102 (left) and CUB (right) datasets. Each curve is the mean of three independent
experiments. Higher inception score and lower FID mean better performance.
6.2. Implementation details
For the encoder Gz,c , we first extract a 1024-dimension feature vector from a given image x (note that the text embeddings
are also 1024-dimension vectors), and apply the reparameterization trick to sample ẑ and ĉ. To reduce the complexity of our
models, we use the same discriminator for both Dx,z and Dx,c in our experiments, thus re-denoted as D(x,z)/(x,c) , and the
weights are shared between Dx and Dx,ϕt . By default, λ = 4. For the training, we iteratively train the discriminators Dx,ϕt ,
D(x,z)/(x,c) , Dx,x0 and then the generators Gx , Gz,c for 600 epochs (Oxford-102 and CUB) or 200 epochs (COCO), with
Adam optimization [15]. The initial learning rate is set to 0.0002 and decreased to half of the previous value for every 100
epochs.
6.3. Datasets
The Oxford-102 dataset [23] contains 8,189 flower images from 102 different categories, and the CUB dataset [29] con-
tains 11,788 bird images belonging to 200 different categories. For both datasets, each image is annotated with 10 text
descriptions provided by [25]. Following the same experimental setup as used in previous works [26, 32, 35], we preprocess
and split both datasets into disjoint training and test sets: 82 + 20 classes for the Oxford-102 dataset and 150 + 50 classes
for the CUB dataset. For the COCO dataset [20], we use the 82,783 training images and the 40,504 validation images for our
training and testing respectively, with each image given 5 text descriptions. We also use the same text embeddings pretrained
by a char-CNN-RNN encoder [25].
Figure 10: Examples of generated images by using either adversarial loss or l2 loss for the cycle consistency.
Dataset
Method
Oxford-102 CUB
̂
zsource
inferred style
̂
ztarget
̂
csource ̂
ctarget
inferred content
Target
Figure 12: Example of inferred style controlling the number of petals, and the pose of flowers from facing towards upright to
facing front.
Source
̂
zsource
inferred style
̂
ztarget
̂
csource ̂
ctarget
inferred content
Target
Figure 13: Example of inferred style controlling the number of flowers, from a single flower to multiple flowers.
Source
̂
zsource
inferred style
̂
ztarget
̂
csource ̂
ctarget
inferred content
Target
Figure 14: Example of inferred style controlling the pose of birds from sitting to flying.
Source
̂
zsource
inferred style
̂
ztarget
̂
csource ̂
ctarget
inferred content
Target
Figure 15: Example of inferred style controlling the pose of birds from facing towards right to facing towards left and the
emergence of tree branches in the background.
This flower is white and yellow in color, with only one large petal.
Flower is big and round petals are orange and the pistil is orange.
This flower has light, purple petals and a bee on it's stigma.
This flower has petals that are white and are bunched together.
This flower has a light purple bottom petal with three other petals that are dark purple with green streaks.
Figure 18: Style interpolations with synthetic style sources: the moving flowers and the growing quantities of flowers.