FrankenGAN: Guided Detail Synthesis For Building Mass-Models UsingStyle-Synchonized GANs

FrankenGAN: Guided Detail Synthesis for Building Mass-Models
Using Style-Synchonized GANs

Tom Kelly, Paul Guerrero, Anthony Steed, Peter Wonka, and Niloy J. Mitra
arXiv:1806.07179v1 [cs.GR] 19 Jun 2018
input: style and mass models output: detailed geometry + textures
Figure 1: We present FrankenGAN, a method to add detail to coarse mass models. Diverse geometry
and texture detail can automatically be added to a mass model (left), while giving the user control over the
resulting style. Details are generated by multiple generative adversarial networks with synchronized style.
A more detailed view of this model is shown in Figure 14.
Abstract ability of the output. We allow her to interactively

specify style via images and manipulate style-adapted
sliders to control style variability. We demonstrate
Coarse building mass models are now routinely gen- our system on several large-scale examples. The
erated at scales ranging from individual buildings generated outputs are qualitatively evaluated via a
through to whole cities. For example, they can be set of user studies and are found to be realistic,
abstracted from raw measurements, generated proce- semantically-plausible, and style-consistent.
durally, or created manually. However, these models
typically lack any meaningful semantic or texture de-
tails, making them unsuitable for direct display. We 1 Introduction
introduce the problem of automatically and realisti-
cally decorating such models by adding semantically We propose a framework to add geometric and tex-
consistent geometric details and textures. Building ture details to coarse building models, commonly re-
on the recent success of generative adversarial net- ferred to as mass models. There are multiple sources
works (GANs), we propose FrankenGAN, a cas- of such coarse models: they can be generated from
cade of GANs to create plausible details across mul- procedural models, reconstructed from aerial images
tiple scales over large neighborhoods. The various and LIDAR data, or created manually by urban plan-
GANs are synchronized to produce consistent style ners, artists and architects.
distributions over buildings and neighborhoods. We By themselves, these mass models lack geometric
provide the user with direct control over the vari- and texture details, and hence look unrealistic when
1
directly displayed. For many applications it is de- we refer to both geometric and texture, details. Geo-
sirable to decorate these models by adding details metric details include balconies, window frames, roof
that are automatically generated. However, naive types, chimneys, etc., while texture detail refers to
attempts to decorate the mass models lead to unrealistic facade appearances that are consistent with
convincing results. Examples include assigning uni- (latent) semantic elements such as windows, sills, etc.
form colors to different building fronts, ‘attaching’ Further, we require the added details to be stylisti-
rectified street-view images (when available) to the cally consistent and guided by the user-provided in-
mass model faces, or synthesizing facade textures us- put image(s). Note that we do not expect the input
ing current generative adversarial networks (see Fig- images to be semantically-annotated or correspond
ure 2). A viable alternative is to use rule-based proce- to the target mass models. The output of our algo-
dural models, both for geometric and texture details. rithm (see Figure 1) is a mixture of 2.5D geometry
However, such an approach requires time and exper- with textures and realistic variations.
tise to author. Instead, we would like to build on the Technically, we perform detailing in stages via a
recent success of machine learning using generative cascade of GANs. We engineer the individual GANs
adversarial networks (GANs) to simplify the model- to generate particular styles which are encoded as la-
ing process and learn facade details directly from real- tent vectors. As a result we can synchronize the dif-
world data. There are two issues with current work ferent GAN outputs by selecting style vectors from
that we would like to improve upon to make GANs appropriate distributions. We demonstrate how to
suitable for our application. First, current networks perform such style-guided synthesis for both geomet-
are only concerned with the generation of textures, ric and texture details. By allowing style vectors to
but we also require semantic and geometric details for be guided by input images, we allow users to per-
our urban modeling applications. Second, we would form, at interactive rates, drag-and-drop stylization
like to improve the style control of the facade details. by simply providing different target images (see sup-
For example, a user may wish to decorate a mass plementary video). Our system FrankenGAN en-
model in the style of a given example facade, or spec- sures that the resultant detailed models are realistic
ify how similar facades should be within an urban and are stylistically consistent both within individual
neighborhood. buildings and between buildings in a neighborhood.
In this paper, we consider the problem of auto- We demonstrate our system by detailing a variety
matically and realistically detailing mass-models us- of mass models over large-scale city neighborhoods,
ing user-supplied building images for style-guidance, and compare the quality of detailed models against
with the option to adjust style variety. By details, baseline alternatives using a user study. In summary,
our main contributions are: (i) introducing the prob-
lem of realistically detailing mass models with se-
mantically consistent geometric and texture details;
(ii) presenting FrankenGAN as an interactive sys-
tem to this end that utilizes latent style vectors via
a cascade of synchronized GANs guided by examplar
style images; and (iii) demonstrating the system on
several large-scale examples and qualitatively evalu-
ating the generated output.
uniform colors Pix2Pix FRANKENGAN

2 Related Work
Figure 2: Naive options to texture a mass model give
unconvincing results. Pix2Pix shows heavy mode col- In the following, we review related work in the com-
lapse, discussed in Section 6.3. putational design of facade layouts and generative ad-
2
mass model Google Earth low variation medium variation high variation
Figure 3: In contrast to photogrammetric reconstruction (second column), our method can be used to
synthesize new facade layouts and textures. Style and variation can be controlled by the user; in columns 3
to 5, we show details generated by FrankenGAN with low to high style variation.
versarial networks. Supervised deep learning has played a crucial role in

Computational facade layouts. Rule-based recent developments in several computer vision tasks,
procedural modeling can be used to model facades e.g., object recognition [14, 23], semantic segmenta-
and mass models [46, 29, 34]. An alternative to tion [27], and activity recognition/detection [10, 39].
pure procedural modeling is the combination of op- In contrast, weakly-supervised or unsupervised deep
timization with declarative or procedural descrip- learning has been popular for image and texture syn-
tions. There exist multiple recent frameworks specif- thesis tasks. In this context, Generative Adversarial
ically targeted at the modeling of facades and build- Networks (GANs) have emerged as a promising fam-
ings [5, 26, 2, 15, 6], urban layouts [42]. There are also ily of unsupervised learning techniques that have re-
multiple approaches that target more general proce- cently shown the ability to model simple natural im-
dural modeling [37, 48, 32]. ages (e.g., faces and flowers) [11]. They can learn to
Another important avenue of recent work is the emulate the training set, enable sampling from that
combination of machine learning and procedural domain and use the learned knowledge for useful ap-
modeling. One goal is inverse procedural model- plications. Since their introduction, many variations
ing, where grammar rules and grammar parameters of GAN have been put forward to overcome some
are learned from data. One example approach is of the impediments they face (e.g., instability during
Bayesian model merging [36], which was adopted training) [1, 4, 7, 12, 33, 35, 45, 49]. Three versions of
by multiple authors for learning grammars [38, 28]. GANs are of particular interest for our work. First,
While this approach shares some goals with our the Pix2Pix framework using conditional GANs [16]
project, the learning techniques employed were not is useful for image to image translation, e.g., the
powerful enough to encode design knowledge and paper showed results for translating a facade label
thus only very simple examples were demonstrated. image into a textured facade image. Second, cycle-
Recently, deep learning was used for shape and GAN [51] is useful to learn image-to-image transla-
scene synthesis of architectural content. Nishida et tion tasks without requiring a corresponding pair of
al. [30] proposed an interactive framework to inter- images in the two styles. Third, bicycleGAN [52] is
pret sketches as the outer shell of 3D building mod- another extension of image-to-image translation that
els. More recently, Wang et al. [18] used deep learning improves the variations that can be generated.
to predict furniture placement inside a room. Auto- Interestingly, GANs have proven useful in several
matic detailing of mass models, as is the focus on core image processing and computer vision tasks,
this paper, has not been attempted yet using a data- including image inpainting [47], style transfer [17],
driven approach. super-resolution [16, 25], manifold traversing [50],
Generative adverserial networks (GANs). hand pose estimation [44], and face recognition [40].
3
style reference FrankenGAN detailed
model
facade/window style ref.
context info region mask context info region mask coarse labels
detail labels
3D roof
roof super-res.
roof style reference
details
detail labels highres tex.

G
roofs
lowres texture region mask context info

texture
G G
3D window
mass model windows
window labels
details
G G
highres tex.
texture
3D facade
facades facade details
super-res.
highres tex.
G G G
detail labels
Figure 4: FrankenGAN overview. Individual GANs are denoted by G, and yellow rectangles denote GAN
chains or geometry generation modules. Given a mass model and an optional style reference (any part can
be replaced by a random style), the model is detailed in three steps. First, two chains of GANs generate
texture and label maps for facades and roofs. Then, the resolution of the generated textures is increased
by a dedicated window generation chain and two super-resolution GANs for roofs and facades. Finally, 3D
details are generated for roofs, windows, and facades based on the generated textures and label maps.
However, the applications of GANs are not limited generates geometric detail in addition to texture de-
to the computer vision and image processing com- tail. Geometric detail is generated by training the
munities; adversarial models are being explored for GANs to output an additional label map that can be
graphics applications. Examples include street net- used to select the type of geometry to place at each
work synthesis [13], volume rendering engines [3], and location on the facade and roof (if any).
adaptive city-scene generation [43]. In this paper, We interleave GANs which output structural la-
we build on GANs and adapt them to detail synthe- bels, and those which output textures. There are
sis for detailing building mass models using exemplar several desirable properties of cascading the GANs
images for style guidance. in this manner that are difficult to obtain with stan-
dard GAN setups. Firstly, the output should ex-
hibit a plausible structure. For example, windows
3 Overview tend to be arranged in grids, and the ground floor
usually has a different structure than the remaining
The input to our method is a coarse building mass floors. In our experiments we found that training a
model with a given set of flat facades and a roof with GAN end-to-end makes it harder to obtain plausi-
known hip- and ridge-lines. Our goal is to gener- ble structure; the structure is never given explicitly
ate texture and geometric details on the facades and as an objective and it must be deduced from the fa-
the roof. These details are generated with a cascade texture. The structural labels also allow us to
cade of GANs. Our generative approach contrasts map the output bitmaps to 3D geometry, regular-
with the reconstruction performed by photogrammet- ize the outputs using known priors, and permit users
ric approaches, see Figure 3 for a comparison. In con- to manually modify the structure. Secondly, there
trast to traditional GAN setups, our method allows should be some control over the style of the output.
control over the style of the synthesized outputs, and Facades of the same building usually have the same
4
style, while the amount of style variation in a block put, including its approximate scale and a distance
of buildings depends on the city and area. Generat- transform of the input boundary. This information
ing realistic city blocks therefore requires control over about global context makes it easier for a network
the style. See Figure 3 for a few examples. Thirdly, that only operates on local patches to make global
we wish to improve upon the quality achievable with decisions, such as generating details at the correct
a generic GAN architecture like Pix2Pix [16]. While scale, or placing objects such as doors at the correct
recent work has shown remarkable quality and resolu- height. Details about the GAN architecture are pro-
tion [19], achieving this quality with a single network vided in Section 4.
trained end-to-end comes at a prohibitive resource To generate facade textures and label maps, three
cost. of these GANs are chained together, each performing
We improve these properties in our outputs at a one step in the construction of the final facade details.
reasonable resource cost by splitting the traditional The first GAN generates window labels from a blank
single-GAN setup into multiple smaller steps that can facade mask, the second GAN transform these labels
be trained and evaluated separately. Synchronization into the facade texture, and the final GAN detects
across different steps using a low-dimensional embed- non-window labels in the facade texture to generate
ding of style, ensures the consistency of outputs. Ad- a second detailed label map. Similar steps are per-
ditionally, the style embedding can be manipulated formed for roof textures and label maps. As we will
by the user, giving control over the style distribution show, this multi-step approach results in higher qual-
on a building, a block or multiple blocks. ity details than an end-to-end approach.
Figure 4 shows an overview of the steps performed The resolution of the facade and roof textures are
to generate facade and roof details. In the following limited by the memory requirements of the GANs.
we assume the user is working on a block of build- To obtain higher-resolution textures without signifi-
ings, but similar steps also apply to multiple build- cantly increasing the memory requirements, we em-
ing blocks or a single building. First, the user defines ploy two strategies: first, since windows are promi-
style distributions for the building block. Style distri- nent features that usually exhibit fine details, we
butions can be provided for several building proper- texture windows individually. A GAN chain creates
ties, such as facade texture, roof texture, and window window-pane labels in a first step, and window tex-
layouts. Each distribution is modeled as a mixture tures from these labels in a second step. Second, we
of Gaussians in a low-dimensional style space, where increase the resolution of the roof and wall textures
the user may choose n modes by providing n refer- using a super-resolution GAN applied to wall patches
ence images (which do not need to be consistent with of fixed size. This GAN has the same architecture as
the building shape or each other), and optionally a the GANs used in all other steps. Details on all GAN
custom variance. Each building in the block samples chains are given in Section 5.2.
one style vector from this distribution and uses it for Finally, geometric details are lifted to 3D using pro-
all windows, facades and the roof. Specifying styles cedural geometry based on the generated label maps,
gives more control over the result, but is optional; the and the resulting detailed mass models are textured
style for any building property can instead be entirely using the generated textures. We also use label maps
random. Section 5.1 provides details. and textures to define decorative 3d details and ma-
After each building style is defined, two separate terial propertiesmaps. Details are provided in Sec-
chains of GANs with similar architectures generate tion 5.4.
the facade and roof textures, as well as the corre-
sponding label maps. Each GAN in the chain per-
forms an image-to-image mapping based on Bicycle- 4 GAN architecture
GAN [52]. We extend this architecture with several
conditional inputs, including a mask of the empty Our texture and label maps are generated in multiple
facade or roof, and several metrics describing the in- steps, where each step is an image-to-image trans-
5
Evaluation Z tion, in the form of two datasets A and B of match-
ing input/output pairs. For example, p(A, B) may
Ê G
be the distribution of matching pairs of facade la-
style A G(A,Z) bels and facade textures. During generator train-
Training LKL LLR ing, the difference between the generated distribution
N(0,1) Z p(A, G(A, Z)) and the desired unknown distribution
or
p(A, B) is measured in an adversarial setup, where
E E(B) G Ê Z
A
a discriminator function D is trained to distinguish
B G(A,Z) between samples from the two distributions, with the
LL1 D LGGAN following cross-entropy classification loss:
Figure 5: GAN architecture. The setup used during LD

GAN (G, D) = EA,B∼p(A,B) − log D(A, B) +
evaluation is shown in the top row, and the training
setup in the bottom row. Dotted lines denote random EA,B∼p(A,B), Z∼p(Z) (1)

sampling. − log (1 − D(A, G(A, Z))) ,
formation, implemented with a GAN. Traditional where p(Z) = N (0, I) is the prior over style vectors,
image-to-image GANs such as Pix2Pix [16] can learn defined to be a standard normal distribution. The
a wide range of image-to-image transformations, but generator is trained to output samples that are mis-
the variation of outputs that can be generated for a classified by the discriminator as being from the de-
given input is limited. Since we aim to generate mul- sired distribution:
tiple different output styles for a given input, we base
our GAN architecture on the recently introduced Bi- LG
GAN (G, D) = EA,B∼p(A,B), Z∼p(Z)
(2)
cycleGAN architecture [52], which explicitly encodes − log D(A, G(A, Z)) .
the style in a low-dimensional style space and allows Additionally, an L1 or L2 loss term between the
generating outputs with multiple styles. In this sec- generated output and the ground truth is usually in-
tion we will briefly recap the BicycleGAN setup and cluded:
describe our modifications.
Image-to-image GANs train a generator function LL1 (G) = EA,B∼p(A,B), Z∼p(Z) B − G(A, Z) 1 . (3)
B = G(A, Z) : Rn × Rk → Rn , In general, the conditional distribution p(B|A) for

a fixed A may be a multi-modal distribution with
that transforms an input image A to an output im- large variance; for example, there is a wide range
age B. For example, A might be an image containing of possible facade textures for a given facade label
color-coded facade labels, such as windows, and B a image. However, previous work [16] has shown that in
corresponding facade texture. The second input Z is typical image-to-image GAN setups, the style vector
a vector of latent variables that describe properties of Z is largely ignored, resulting in a generator output
the output image that are not given by the input im- that is almost fully determined by A, and restricting
age, such as the wall color. We call Z the style vector p(B|A) to have low variance. To solve this problem,
and the space containing Z the style space Z. The BicycleGAN uses an encoder E that obtains the style
embedding of properties into this space is learned by from an image, and combines additional loss terms
the generator during training. Typically Z is chosen introduced in previous work [8, 9, 24] to ensure the
to be random during training and evaluation, effec- style is not ignored by the generator.
tively randomizing the style. First, using ideas from Variational Autoen-
The generator’s goal is to approximate some de- coders [21], the encoder outputs a distribution E(B)
sired but unknown joint distribution p(A, B), by of styles for each image instead of a single style vector.
training it with known samples from this distribu- In Equations 1 to 3, p(Z) = E(B) is used instead of
6
the standard normal distribution. The distribution
E(B) is regularized to be close to a standard nor-
facade
mal distribution to encourage style vectors to form a mask
large contiguous region in style space that can easily
be sampled: Figure 6: GANs are conditioned on additional chan-
nels that include information about the global con-
LKL (E) = EB∼p(B) DKL (E(B)kN (0, I)) , (4) text at each pixel. Given a facade/roof mask we in-
clude the distance to the facade boundary and the
where DKL is the KL-divergence. Second, the gener- distance to each bounding box side, making it easy
ator is encouraged not to ignore style by including a for the network to decide how far it is from the bound-
style reconstruction term: ary and at what height on the facade.

LLR (E) = EA∼p(A),Z∼N (0,I) Z − Ê(G(A, Z)) 1 , receptive field alleviates the problem but has a sig-
(5) nificant resource cost and destabilizes training. We
found it to be more efficient to condition the GANs
where Ê denotes the mean of the distribution output on additional information about the global context of
by E. Intuitively, this term measures the reconstruc- each pixel. More specifically, we condition the GANs
tion error between the style given to the generator on 5 additional channels that are appended to A: the
as input and the style obtained from the generated distance in real world units to each side of the bound-
image. The full loss for the generator and encoder is ing box and to the nearest boundary of a facade or
then: roof. Examples are shown in Figure 6.
LG (G, D, E) = λGAN LG
GAN (G, D) + λL1 LL1 (G) +
λKL LKL (E) + λLR LLR (E). 5 Franken-GAN
(6)
The hyper-parameters λ control the relative weight Detail generation is performed by a cascade of tex-
of each loss. A diagram of this architecture is shown tures and label maps, as shown in Figure 4. These
in Figure 5. are generated by FrankenGAN in several separate
FrankenGAN trains a BicycleGAN for each indi- chains of GANs, where each GAN is trained and run
vidual step, with one exception that we will discuss independently. Splitting up this task into multiple
in Section 5. In addition to style, we also need to steps instead of training end-to-end has several ad-
encode the real-world scale of the input mass model, vantages. First, we can provide more guidance in the
so that the output details can be generated in the de- form of intermediate objectives, for example window
sired scale. We condition the GANs on an additional layouts, or low-resolution textures. In our experi-
constant input channel that contains the scale of the ments, we show that this provides a significant ad-
facade in real-world units. This channel is appended vantage over omitting this guidance. While in theory
to A. there are several ways to provide similar intermediate
In our experiments, we observed that using this objectives for an end-to-end network, for example by
BicycleGAN setup to directly go from inputs to out- concatenating our current GANs, this would result
puts in each step often resulted in implausible global in extremely large networks, leading to the second
structure, such as badly misaligned windows or ledges point: GAN training is notoriously unstable. An end-
on facades. Due to the limited receptive field of out- to-end network with intermediate objectives would
puts in both the generator and the discriminator, co- need to have a very large generator with multiple dis-
ordination between distant output pixels is difficult, criminators, making it hard to achieve stable train-
making it hard to create globally consistent structure. ing. Third, splitting up the network reduces resource
Increasing the depth of the networks to increase the costs during training. Instead of a single very large
7
roofs style ref. style ref. windows style ref. style ref.
textures and labels maps
generating low-resolution
labels texture labels low-res tex.
high-resolution details
G R G G R G
generating
facades style ref. style ref.
labels texture roof style ref.
texture
facade style ref.
texture
s.res. s.res.
G R replace G G’ R
G G
Figure 7: FrankenGAN details. Each GAN chain (yellow rectangles) consists of several GANs (G) that
each perform an image-to-image transformation. GANs are usually conditioned on additional inputs (arrows
along the bottom), and are guided by a reference style (arrows along the top). Label outputs are regularized
(R) to obtain clean label rectangles. Figure 4 shows these chains in context.
style distribution style distribution

network, we can separately train multiple smaller net- user-specified
works. Note that training a very large network one
part at a time would require storing and loading parts
of the network from disk in each forward and back-
ward pass, which is prohibitive in terms of training
times. Finally, we can regularize intermediate results
with operations that are not are not differentiable or
would not provide a good gradient signal.
random
5.1 Style Control Figure 8: Specifying a style distribution gives con-

trol over the generated details. The top row shows
One difficulty with using separate GANs is achieving three results generated with the same user-specified
stylistically consistent results. For example, windows style distribution, while the bottom row uses the style
on the same facade usually have a similar style, as prior, giving random styles. Note how the buildings
do ledges or window sills. Style control is also nec- in the top row have a consistent style while still al-
essary beyond single roof or facade textures: adja- lowing for some variation (depending on the variance
cent facades on a building usually look similar, and chosen by the user), while the bottom row does not
city blocks may have buildings with similar facade have a consistent style.
and roof styles. A comparison of generated details
with and without style control is given in Figure 8. in style space Z:
In FrankenGAN, style can be specified for eight m
properties: the coarse facade and roof texture, fa- X
p(Z|S, σ) = φi N (E(Si ), σi ), (7)
cade and roof texture details, such as brick patterns, i=1
the window layout on the facade, the glass pane lay-
out in windows, the window texture, and the layout where Z ∈ Z is the style vector, N is the nor-
of chimneys and windows on a roof. The user can mal distribution and the weights φi must sum to
describe the style distribution of a property over a 1. Z provides a compact representation of style, we
building block with a mixture of isotropic Gaussians use an 8-dimensional style space in our experiments.
8
The means of the Gaussians are specified by encod- a GAN is passed through a regularizer R, denoted
ing m style reference images Si ∈ S with the en- by the red rectangles in Figure 7, to produce a clean
coder described in the previous section. The vari- set of boxes R(L) before being passed to the next
ance σ = (σ1 , . . . , σm ) specifies the diversity of the step. We will now describe each chain in detail. The
generated details and can be adjusted per reference regularizers will be described in Section 5.3.
image with a slider. One of these distributions can
be specified per property and the styles are sampled
Roofs The roof chain generates roof detail labels
independently.
and the coarse roof texture. The chain starts with a
In many cases however, the styles of different prop-
coarse label map of the roof as input image. This la-
erties are dependent. For example, the color of roofs
bel map includes ridge and valley lines of the roof,
and facades may be correlated. To specify these de-
which are obtained directly from the mass model.
pendencies, several sets of property distributions may
Flat roofs are labeled with a separate color. The
be specified, each set Si = {(Sp , σp )}p=1...8 contains
first GAN adds chimneys and pitched roof windows.
one mixture model per property p. For each build-
These labels are regularized and then used by the
ing, one of these sets is chosen at random. The special
second GAN to generate the coarse roof texture.
case of having a single Gaussian (m = 1) per prop-
erty in each set effectively gives a Gaussian mixture
model over the joint space of all properties; each set Facades The facade chain generates window labels,
being one component. It is important to note that full facade labels, and the coarse facade texture. Win-
the user does not need to provide the style for all dow labels are generated separately from the full la-
properties. Any number of properties may be left bels, since they may be occluded by some of the other
unspecified, in which case the style vector is sampled labels. The first GAN starts by creating window and
from the GAN’s style prior, which is a standard nor- door labels from facade boundaries. These are regu-
mal distribution. larized before being used as input in the second GAN
to generate the coarse facade texture. The third GAN
detects the full set of facade labels from the facade
5.2 Detail Generation
texture, including ledges, window sills, and balconies,
FrankenGAN uses five chains of GANs, which can which are also regularized to give a cleaner set of la-
be split into two groups: two chains for generating bels. The third GAN has a different architecture:
initial coarse details (textures and label maps) for since we expect there to be a single correct label
roofs and facades, and three chains for increasing map for each facade texture, we do not need style
the resolution given the coarse outputs of the first input, which simplifies the GAN to the Pix2Pix ar-
group. Details of these chains are shown in Figure 7. chitecture [16]. Since the window and door labels are
Most of the chains have intermediate results, which known at this point, we also condition this GAN on
are used for the geometry synthesis we will describe these labels. Detecting the full label set from the fa-
in Section 5.4. Each GAN takes an input image and cade texture instead of generating it beforehand and
outputs a transformed image. In addition to the in- using it as input for the texture generation step is a
put image, all GANs except the super-resolution net- design choice that we made after experimenting with
works are conditioned on the scale and context infor- both variants. Detail outlines in the generated tex-
mation described in the previous section, making it ture tend to follow the input labels very closely, and
easier to generate consistent global structure. Each constraining the details in this way results in unreal-
GAN is also guided by a style that is drawn from istic shapes and a reduced variability. For all three
a distribution, as described above. Figure 7 shows networks, areas occluded by nearby facades are set to
reference images Si that are used to specify the dis- the background colour; this ensures that the feature
tribution. Images output by GANs are either label distribution takes into account the nearby geometry.
maps L, or textures T . Each label map output by Since the layout of dormer windows needs to be
9
consistent with the facade layout, we create them in to generate texture detail, such as bricks or roof tiles
the facade chain. More specifically, we extend the from a given low-resolution input. Roof and facade
facade mask given as input to this chain with a pro- textures are split into a set of patches that are pro-
jection of the roof to the facade plane. This allows us cessed separately. Each patch is scaled up to the
to treat the roof as part of the facade and generate input size of the GANs before generating the high-
a window layout that extends to the roof. Roof fea- resolution output. Consistency can be maintained by
tures (chimneys or pitched windows) which intersect fixing the style over the building. The output patches
dormer windows are removed. are then assembled to obtain the high-resolution roof
and facade textures. Boundaries between patches are
Windows To obtain high resolution window tex- blended linearly to avoid seams. Examples are shown
tures, the window chain is applied to each window in Figure. 9. Our interleaved GANs allow us to aug-
separately, using a consistent style. Each window is ment the super-resolution texture map with texture
cut out from the window label map that was gen- cues from the label-maps. For example, window sills
erated in the facade chain, and scaled up to the in- are lightened, and roof crests are drawn; these aug-
put resolution of the GAN. The steps in the window mentations take the form of drawing the labels in a
chain are then similar to the roof chain. We generate single colour with a certain alpha. Note that because
glass pane labels from the window region, regularize of the large scale of the super-resolution bitmaps, we
them, and generate the high-resolution window tex- explicitly state which figures use these two networks.
ture from the glass pane labels.
5.3 Regularizers
Our GANs are good at producing varied label maps
that follow the data distribution in our training set,
however alignment between individual elements is
usually not perfect. For example, window sills may
not be rectangular or have different sizes in adjacent
windows, or ledges may not be perfectly straight. Al-
though the discrepancy is usually small, it is still no-
ticeable in the final output. Our multi-step approach
allows us to use any non-differentiable (or otherwise)
regularization. We exploit domain specific knowledge
to craft simple algorithms to improve the alignment
of the label maps, and provide 3D locations for ge-
ometric features. In the following we describe our
regularizers in detail.
Figure 9: Super-resolution. Given inputs (green, ma-
genta), the super-resolution network crates high qual-
ity textures for the walls, while the window GAN
Roof detail labels Chimneys and pitched roof
chain provides high quality windows. Note that the
windows are regularized by fitting boxes that are
window label and window texture networks run once
aligned to the ridge lines. More specifically, we take
for every window.
the bounding box for each connected component of a
given label, crop its size to the typical size range for
the given label class, align it to the closest ridge line,
Super-resolution High resolution roof and facade and shrink it so that it lies entirely within the roof
textures are obtained with two GANs that are trained extent.
10
Facade window and door labels The window window pane
and door layout on a facade has to be regularized textures labels
original
original
without removing desirable irregularities introduced
by the GAN that reflect the actual data distribution,
such as different window sizes on different floors, or
multiple overlayed grid layouts. We start by fitting
axis-aligned bounding boxes to doors and windows.
merged windows
merged windows
We then collect a set of properties for each window,
including the extent in x-and y- direction and their
size, and run mean-shift with a small kernel size on
each property in turn, until convergence. This en-
sures these properties can have a multi-modal distri-
bution, preserving desirable irregularities, but that texture normals specular
small-scale irregularities in these properties are re-
moved. Figure 11: Generated window textures and labels
maps are merged back into the facade texture, in-
creasing the fidelity of windows textures, normals and
Facade detail labels Since adjacent details are of-
materials.
ten not perfectly aligned, we snap nearby details,
such as window sills and windows to improve the
product of two 1D masks, one for the columns, and
alignment. We also observed that in the generated
one for the rows of the 2D mask. This representa-
label map, the placement of small details such as win-
tion ensures that the mask can only contain a grid of
dow sills and moldings is sometimes not coordinated
square glass panes. The two 1D masks are created by
over larger distances on the facade. To improve regu-
taking the mean of the 2D mask in x- and y-direction,
larity, we propagate details such as window sills and
and thresholding them at 0.33. An example is shown
moldings that are present in more than 50% of the
in Figure 10.
windows in a row to all remaining windows.
Window detail labels The glass pane layout in a 5.4 Geometry Synthesis
window is usually more regular than the window lay- As output of the five GAN chains, we have high-
out on facades, allowing for a simpler regularization:
resolution roof, facade, and window textures, as well
we transform the label map into a binary glass pane as regularized label maps for roof details, facade de-
mask and approximate this 2D mask by the outer tails, and window panes. These are used to generate
raw label map regularized the detailed mass model. First, geometry for details
is generated procedurally based on the label maps.
For details such as window sills and ledges we use
simple extrusions, while balconies and chimneys are
generated with small procedural programs to fit the
shape given by the label map.
To apply the generated textures to the detailed
mass models, UV maps are generated procedurally
Figure 10: Window detail regularization. All our la- along with the geometry. In addition to textures,
bel maps are regularized before being used by the we also define building materials based on the label
next GAN in the chain. Here, we regularize window maps. Each label is given a set of material proper-
glass pane labels using the outer product of two 1D ties: windows, for example, are reflective and have
label masks. high glossiness, while walls are mostly diffuse. To
11
a b c d e the current distribution. Please see the accompany-
ing video for a full description. The source code of
our system, and weights for accompanying networks,
will be made available online.
6 Results
We evaluate our method on several scenes, consisting
of procedurally generated mass models. We qualita-
tively show the fidelity of our output and the effect
of style and scale control. Style and scale control are
also evaluated quantitatively with a user study, and
Figure 12: The distribution designer UI (see supple- we provide comparisons to existing end-to-end net-
mental video). Given a style distribution (a), the works, both qualitatively, and quantitatively through
system continuously shows evaluations of that distri- another user study.
bution (c). By clicking on an image, the user can see
the network inputs (b). Different networks can be
selected (d). The style distribution for any network 6.1 Datasets and Training Setup
is a Gaussian mixture model that may have multiple
modes (e), the mean of which is given by an exemplar Each GAN in our framework performs an image-to-
image. image transformation that is trained with a separate
dataset of matched image pairs (we will release these
further increase the fidelity of our models, textures annotated datasets or provide links to them based
and label maps are used to heuristically generate nor- on permission). Matched images for these pairs are
mal maps. The intensity of the generated texture is obtained from three datasets:
treated as a height field that allows us to compute The facade dataset consists of the CMP
normals. While this does not give us accurate nor- dataset [41] and a larger dataset of labeled fa-
mals, it works well in practice to simulate the rough- cades that has not yet been released, but that has
ness of a texture. Per-label roughness weights ensure been made available to us by the authors. The
that details such as glass panes still remain flat. Fi- combined dataset contains 3941 rectified facades
nally, generated window textures and label maps are with labels for several types of details, including
merged back into the facade textures; an example is doors, windows, window sills, and balconies. We
given in Figure 11. further refined this dataset by removing heavily
occluded facades, and by annotating the height of a
typical floor in each facade, to obtain the real-world
5.5 User interface scale that our GANs are conditioned on, as described
Our system contains a complete framework for inter- in Section 4. From this dataset, we create matched
actively using FrankenGAN. A user may select an pairs of images for each GAN in the facade chain.
urban city block, and specify a distribution, then the The roof dataset consists of 585 high-quality roof
system adds the additional geometry and textures to images with labeled roof area, ridge/valley lines,
the 3D view. At this point the user can edit semantic pitched windows and chimneys. The images are part
details (such as window locations), while seeing tex- of an unreleased dataset. We contracted professional
tures updates in real-time. Of note is our interface labellers to create high-quality labels. From this
to build our joint distributions (Figure 12), which dataset, we create matched pairs of images for the
continually shows the user new examples drawn from two GANs in the roof chain.
12
super-res
facades
windows
roofs
Figure 13: Datasets used to train our GANs. We use four datasets of labeled images, a few examples are
shown here.
The window dataset contains 1376 rectified win-

dow images with labeled window areas and glass
panes. These images were obtained from Google
Street View, and high quality labels were created by
professional labellers. From this dataset, we create
matched pairs of images for the two GANs in the
window chain. Examples from all datasets are shown
in Figure 13.
In addition to these three datasets, the super-
resolution GANs were trained with two separate
datasets. These datasets were created by taking a set
of high-quality roof/wall texture patches downloaded
from the internet, for example, brick patterns or roof
shingles, and blurring them by random amounts. The
networks are then trained to transform the blurred
image to the original image. We found that it in-
creased the performance of our networks to add in
a second texture at a random location in the image.
This accounts for changes in the texture over a facade
or roof that occur in natural images.
To train each GAN, we alternate between dis-
criminator optimization steps that minimize Eq. 1
and generator/encoder optimization steps that min-
imize Eq. 6. The optimization is performed with
Adam [22]. The weights (λGAN , λL1 , λKL , λLR ) in
Equation 6 are set to (1, 10, 0.01, 0.5). A large λL1
encourages results that are close to the average over
all training images. This helps stabilize training for
textures. For label map generation, however, there is
usually one dominant label, such as the wall or the
roof label, and a large L1 loss encourages the gener-
ation of this label over other labels. Lowering the L1
loss to 1 improves results for label maps. Statistics
for our GANs are summarized in Table 1.
13
Table 1: GAN statistics: the size of their training data(n), resolution (in pixels square), number of epochs
trained, and whether the network takes style as an input.
n resolution epochs style
roof labels 555 512 400 yes
roof textures 555 512 400 yes
facade window labels 3441 256 400 yes
facade textures 3441 256 150 yes
facade full labels 3441 256 335 no
window labels 1176 256 200 yes
window textures 1176 256 400 yes
facade super-resolution 2015 256 600 yes
roof super-resolution 1122 256 600 yes
At test time, we evaluate our networks on mass buildings with black roofs and yellow buildings that
models created through procedural reconstruction often have several ledges.
from photogrammetric meshes [20]. Facades and Statistics of the London and Madrid scenes are
roofs on these mass models are labeled, and roof shown in Table 2, each of the scenes has 10 blocks,
ridges and valleys are known, providing all necessary containing an average of 21 buildings (the number of
inputs for FrankenGAN: facade masks and coarse buildings equals the number of roofs). All of the gen-
roof label maps. Note that it is also feasible to obtain erated details, including their textures, are unique
these inputs automatically from any type of reason- in the entire scene. In the London scene, we gen-
ably clean mass model. erate approximately 860 Megapixels of texture and
2.68 million triangles, in the Madrid scene approx-
imately 560 Megapixels and 1.17 million triangles.
6.2 Qualitative Results The time taken to generate the scenes is shown in
We show the quality of our results on one area of the last column of Table 2. These timings were taken
Madrid and one area of London, both spanning sev- on a standard desktop PC with an Intel 7700k pro-
eral blocks. As reference style, we take images of cessor, 32GB of main memory, and an NVidia GTX
facades, windows and roofs, some of which are taken 1070 GPU. The FrankenGAN implementation has
from our dataset, and some from the web. Figures 1 two modes - when minimizing GPU memory, a sin-
and 14 show the result of detailing the London scene. gle network is loaded at a time, and batch-processes
Reference style textures and input mass models are all applicable tasks in a scene; when GPU memory is
shown on the left, our detailed result on the right. We not a constraint the whole network can be loaded at
produce varied details that are not completely ran- once. These modes use 560Mb and 2371Mb of GPU
dom, but guided by the style given as input. In this memory respectively, leaving sufficient space on the
example, several sets of style textures are used. Note graphics card to interactively display the textures.
the varied window layouts and textures: each build- These figures compare favourably with larger end-to-
ing has unique layouts and textures that are never end networks that require more memory [19].
repeated.
Figure 15 shows our results of detailing the Madrid 6.3 Qualitative Comparisons
scene. Reference style images and a detail view of the
input mass models are shown on the left, our output In this section, we qualitatively evaluate the style
is shown in the center and on the right, including a control of FrankenGAN and compare our results
detail view of the generated geometry. Note several to two end-to-end GANs: Pix2Pix [16] and Bicycle-
modes of matching styles in the result, including red GAN [52]. Similar quantiative evaluations and com-
14
input style reference input mass model detailed geometry detail views
Figure 14: Detailed London area. The output of our method is shown at the top, and the input style images
at the bottom left. This is followed, from left to right, by close-ups of the input mass models, detail geometry
generated by our method, and two detail views of the generated model using super-resolution textures.
parisons, based on two user studies, are provided in tion, as described in Section 5.1, gives more control
the next section. over the style. Note the two modes of white and
To evaluate the effects of style control on our out- red buildings in the third column, corresponding to
puts, we compare four different types of style distri- two reference style images. However style for differ-
butions applied to similar mass models in Figure 16. ent building properties, such as roof and facade tex-
On the left, a constant style vector is used for all tures, are sampled independently, resulting in ran-
houses. This results in buildings with similar style, domly re-mixed property styles; both white and red
which may, for example, be required for row houses facades, for example, are mixed with black and red
or blocks of apartment buildings. A fully random roofs. When specifying multiple sets of GMMs, one
style is obtained by sampling the style prior, a stan- set is randomly picked for each house, allowing the
dard normal distribution. The diversity in this case user to model dependencies between property styles,
approximates the diversity in our training set. Using as shown in the last column.
a Gaussian mixture model (GMM) as style distribu- Examples for different super-resolution styles are
15
input mass model detailed geometry
input style reference

detail views
Figure 15: Detailed Madrid area. Input style images and mass models are shown on the left, an overview of
our output in the center, and close-ups of our output and the generated geometry (green), showing details
like balconies and window moldings, on the right. Lower right panel uses super-resolution textures.
shown in Figure 17. Since a low-resolution texture more diversity than for Pix2Pix, but we observe in-
contains little information about fine texture detail, consistent scale and style across different buildings,
like the shingles on a roof or the stone texture on a or across different facades of the same building, less
wall, there is a diverse range of possible styles for this realistic window layouts, and less diversity than for
fine texture detail, as shown in the middle and right our results. Splitting the texture generation task into
building. multiple steps allows us to provide more training sig-
Figure 20 shows a qualitative comparison to nals, such as an explicit ground truth for the window
Pix2Pix and BicycleGAN. We trained both end-to- layout, without requiring a very large network, reg-
end networks on our facade and roof datasets, trans- ularize the intermediate results of the network, and
forming empty facade and roof masks to textures in use intermediate results as label maps that can be
one step. As discussed in Section 4, Pix2Pix, like used to generate geometric detail and assign materi-
most image-to-image GANs, has a low output diver- als. Additionally, giving the user more precise control
sity for a given input image. In our case, the input over the style results in more consistent details across
image, being a blank facade or roof rectangle, does building or parts of buildings. This allows us to gen-
not provide much variation either, resulting in strong erate more diverse and realistic building details. In
mode collapse for Pix2Pix, shown by the repeated the next section we will quantify the comparison dis-
patterns such as the ‘glass front’ pattern across mul- cussed here with a user study.
tiple buildings. Like FrankenGAN, BicycleGAN
takes a style vector as input, that we set to a similar
multi-modal distribution as for our method. There is
16
Table 2: Statistics for the London and Madrid scenes. We show the number of roofs, facades and windows
in each block, as well as the time taken to generate the block.
London Madrid
block roofs facades windows time (s) block roofs facades windows time (s)
1 29 145 1075 490 1 22 146 773 315
2 20 204 541 351 2 20 110 559 239
3 15 87 536 222 3 17 103 471 219
4 25 133 1040 400 4 12 67 399 166
5 47 243 2107 809 5 7 66 230 102
6 27 171 1622 597 6 22 116 758 291
7 10 65 1144 403 7 25 139 571 292
8 7 40 559 199 8 22 125 611 255
9 8 42 786 271 9 35 240 1219 495
10 26 158 1566 577 10 37 222 738 350
total 214 1288 10976 4322 total 219 1334 6329 2722
6.4 User Study average probability of user choosing either the style-
guided facade, or the facade with randomized style
We performed two user studies to quantify the fol- as more similar to the reference facade. Our results
lowing questions: show that style-guided facade is consistently chosen,
implying good visibility of style and scale control.
(A) How visible is the effect of style and scale con-
trol? The second study was performed in two parts, in
the first part we compare to Pix2Pix, and in the sec-
(B) How does the realism of FrankenGAN com- ond part to BicycleGAN. Users were shown one fa-
pare to Pix2Pix [16] and BicycleGAN [52]? cade generated with FrankenGAN, and one facade
with the other method, trained end-to-end to trans-
To investigate the question A, we tested if users form facade masks to facade textures. Each pair was
could reliably tell which of two generated facades was evaluated by an average of 17.6 and 17.5 users, for
guided by a given reference style facade. The style Pix2Pix and BicycleGAN, respectively, and a total
and scale was randomized for the other facade. Ques- of 86 unique users participated in the study. Re-
tion B was investigated by comparing facades gen- sults are shown in Figure 18 (red vs. blue and yellow
erated by FrankenGAN and one of the other two vs. blue). FrankenGAN is chosen as more realistic
methods side-by-side and asking users to compare the 66.6% of the comparisons with Pix2Pix and 57.6% of
realism of the two facades. We performed both stud- the times for BicycleGAN. 95% confidence intervals
ies in Amazon Mechanical Turk (AMT). Screenshots are given by the small bars in Figure 18. Note that
of both studies are shown in Figure 18. this advantage in realism is in addition to the advan-
For study A, we created 172 triples of facades, tage of of having fine-grained style and scale control
each consisting of a reference facade, a facade with and obtaining label maps as intermediate results that
style and scale given by the reference facade, and a are necessary to generate 3D details.
third facade with randomized style. Users were asked There are also some limitations to our framework.
which of the two facades was more similar to the ref- To replicate the framework from scratch requires ex-
erence. Each triple was evaluated by an average of tensive training data. Similar to other GANs, our re-
11 users, and a total of 116 unique users participated sults look very good from a certain range of distances,
in the study. Figure 18 (green vs. blue) shows the but there is a limit of how far it is possible to zoom in
17
constant random independent dependent
properties properties
Figure 16: Different types of style distributions. From left to right, a constant style is used for all houses, a
style chosen randomly from the style prior, a style sampled independently for all building properties, and a
dependently sampled property style. Note that in the last two columns, there are several separate modes for
properties such as wall and building color, that are either mixed randomly for each building (third column)
or sampled dependently (last column).
before noticing a lack of details. This would require vectors for style synchronization. We evaluated our
the synthesis of displacement maps and additional system on a range of large-scale urban mass models
material layers in addition to textures. Data-driven and qualitatively evaluated the quality of the detailed
texturing is inherently dependent on representative models via a set of user studies.
datasets. For example, our facade label dataset had
missing windows due to occlusions by by trees and
other flora, so we occasionally see missing windows The most direct way to improve the quality of out-
in our results. We chose to not fill in windows in the put generated by FrankenGAN is to retrain the
regularisation stage. The facade-texturing network GANs using richer and more accurately annotated
then learned to associate these missing windows with facade and roof datasets, and also extend the system
flora, and dutifully adds green ”ivy” to the build- to generate street textures and furniture, as well as
ing’s walls (Figure 19, left). Finally our system uses trees and botanical elements. This is a natural im-
a shared style to synchronize the appearance of adja- provement as many research efforts are now focused
cent building facades, however this compact represen- on capturing higher quality facade data with better
tation does not contain sufficient detail to guarantee quality annotations. A more non-trivial extension is
seamless textures at boundaries (Figure 19, right). to explore a GAN-based approach that directly gen-
erates textured 3D geometry. The challenge here is
to find a way to compare rendered procedural models
7 Conclusion with images based on photographs that also includes
facade structure rather than just texture details. Fi-
We presented FrankenGAN, a system for adding nally, it would be interesting to combine our proposed
realistic geometric and texture details on large-scale approach to author both realistic building mass mod-
coarse building mass models guided by example im- els along with their detailing directly from rough in-
ages for style. Our system is based on a cascaded set put from the user in form of sketches [30] or high level
of GANs that are individually controlled by generator functional specifications [31].
18
ours 90%
BicycleGAN
P(more similar / realistic)

Pix2Pix
ours, random style
0%
study: A B1 B2
Figure 18: User studies comparing our method

with/without style guidance (left) and Pix2Pix and
the BicycleGAN (middle). The average probability
Figure 17: Different super-resolution styles. The for each method of being judged more similar to the
original low-resolution facade and roof textures are reference or more realistic by the users is shown on
shown on the left; the middle and right buildings the right. The black bars are 95% confidence inter-
show two different super-resolutions styles, resulting vals.
in different textures for the roof tiles or stone wall.
References
[1] Arjovsky, M., Chintala, S., and Bottou,
L. Wasserstein gan, 2017. arXiv:1701.07875.
[2] Bao, F., Schwarz, M., and Wonka, P. Pro-

cedural facade variations from a single layout.
ACM Transactions on Graphics 32, 1 (2013), 8.
[3] Berger, M., Li, J., and Levine, J. A. A

generative model for volume rendering. CoRR
abs/1710.09545 (2017).
[4] Berthelot, D., Schumm, T., and Metz,

L. BEGAN: boundary equilibrium generative
adversarial networks. CoRR abs/1703.10717
(2017).
[5] Bokeloh, M., Wand, M., Seidel, H.-P.,

and Koltun, V. An algebraic model for param-
eterized shape editing. ACM TOG 31, 4 (2012), Figure 19: Cyan: the system is prone to generating
78:1–78:10. green vegetation over missing windows. Red: texture
discontinuities at boundaries.
[6] Dang, M., Ceylan, D., Neubert, B., and
Pauly, M. SAFE: Structure-aware facade edit-
ing. Computer Graphics Forum 33, 2 (2014),
83–93.
19
Pix2Pix BicycleGAN FRANKENGAN
Figure 20: Qualitative comparison to end-to-end GANs. The left column shows results of Pix2Pix trained to
transform empty facade and roof masks to textures. The middle column shows BicycleGAN trained similarly,
while the last column shows our method. Note how Pix2Pix suffers from mode collapse, while BicyleGAN
has less realistic window layouts and lacks scale and style consistency. FrankenGAN provides better style
control and our approach of splitting up the problem into multiple steps opens up several avenues to increase
the realism of our models.
[7] Denton, E. L., Chintala, S., szlam, a., [11] Goodfellow, I., Pouget-Abadie, J.,
and Fergus, R. Deep generative image mod- Mirza, M., Xu, B., Warde-Farley, D.,
els using a laplacian pyramid of adversarial net- Ozair, S., Courville, A., and Bengio,
works. In Advances in Neural Information Pro- Y. Generative adversarial nets. In NIPS,
cessing Systems 28, C. Cortes, N. D. Lawrence, Z. Ghahramani, M. Welling, C. Cortes, N. D.
D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Lawrence, and K. Q. Weinberger, Eds. Curran
Curran Associates, Inc., 2015, pp. 1486–1494. Associates, Inc., 2014, pp. 2672–2680.
[8] Donahue, J., Krähenbühl, P., and Dar-

[12] Gurumurthy, S., Kiran Sarvadevabhatla,
rell, T. Adversarial feature learning. In ICLR
R., and Venkatesh Babu, R. Deligan : Gen-
(2016).
erative adversarial networks for diverse and lim-
ited data. In The IEEE Conference on Com-
[9] Dumoulin, V., Belghazi, I., Poole, B.,
puter Vision and Pattern Recognition (CVPR)
Lamb, A., Arjovsky, M., Mastropietro,
(July 2017).
O., and Courville, A. Adversarially learned
inference. In ICLR (2016).
[13] Hartmann, S., Weinmann, M., Wessel, R.,
[10] Escorcia, V., Heilbron, F. C., Niebles, and Klein, R. Streetgan: Towards road net-
J. C., and Ghanem, B. Daps: Deep action work synthesis with generative adversarial net-
proposals for action understanding. In Euro- works. In International Conference on Com-
pean Conference on Computer Vision (2016), puter Graphics, Visualization and Computer Vi-
Springer, pp. 768–784. sion (2017).
20
[14] He, K., Zhang, X., Ren, S., and Sun, J. [25] Ledig, C., Theis, L., Huszar, F., Ca-
Deep residual learning for image recognition. In ballero, J., Cunningham, A., Acosta, A.,
Proceedings of the IEEE conference on computer Aitken, A., Tejani, A., Totz, J., Wang,
vision and pattern recognition (2016), pp. 770– Z., and Shi, W. Photo-realistic single image
778. super-resolution using a generative adversarial
network. In The IEEE Conference on Computer
[15] Ilčı́k, M., Musialski, P., Auzinger, T., Vision and Pattern Recognition (CVPR) (July
and Wimmer, M. Layer-based procedural de- 2017).
sign of façades. Computer Graphics Forum 34,
2 (2015), 205–216. [26] Lin, J., Cohen-Or, D., Zhang, H., Liang,
[16] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, C., Sharf, A., Deussen, O., and Chen, B.
A. A. Image-to-image translation with condi- Structure-preserving retargeting of irregular 3D
tional adversarial networks. arxiv (2016). architecture. ACM TOG 30, 6 (2011), 183:1–
183:10.
[17] Johnson, J., Alahi, A., and fei Li, F. Per-
ceptual losses for real-time style transfer and [27] Long, J., Shelhamer, E., and Darrell, T.
super-resolution. In ECCV (2016). Fully convolutional networks for semantic seg-
mentation. In Proceedings of the IEEE Confer-
[18] Kai Wang, Manolis Savva, A. X. C., and ence on Computer Vision and Pattern Recogni-
Ritchie, D. Deep convolutional priors for in- tion (2015), pp. 3431–3440.
door scene synthesis. ACM TOG (2018).
[28] Martinovic, A., and Van Gool, L. Bayesian
[19] Karras, T., Aila, T., Laine, S., and Lehti-
grammar learning for inverse procedural mod-
nen, J. Progressive growing of GANs for im-
eling. In Proceedings of CVPR 2013 (2013),
proved quality, stability, and variation. In ICLR
pp. 201–208.
(2018).
[20] Kelly, T., Femiani, J., Wonka, P., and Mi- [29] Mueller, P., Wonka, P., Haegler, S., Ul-
tra, N. J. Bigsur: Large-scale structured ur- mer, A., and Van Gool, L. Procedural mod-
ban reconstruction. ACM SIGGRAPH Asia 36, eling of buildings. ACM TOG 25, 3 (2006), 614–
6 (2017). 623.
[21] Kingma, D., and Welling, M. Auto- [30] Nishida, G., Garcia-Dorado, I., Aliaga,
encoding variational bayes. In ICLR (2014). D. G., Benes, B., and Bousseau, A. In-
teractive sketching of urban procedural models.
[22] Kingma, D. P., and Ba, J. Adam: A
ACM TOG (2016).
method for stochastic optimization. CoRR
abs/1412.6980 (2014). [31] Peng, C.-H., Yang, Y.-L., Bao, F., Fink,
[23] Krizhevsky, A., Sutskever, I., and Hin- D., Yan, D.-M., Wonka, P., and Mitra,
ton, G. E. Imagenet classification with deep N. J. Computational network design from
convolutional neural networks. In Advances in functional specifications. ACM Transactions on
neural information processing systems (2012), Graphics 35, 4 (2016), 131.
pp. 1097–1105.
[32] Ritchie, D., Mildenhall, B., Goodman,
[24] Larsen, A. B. L., Snderby, S. K., N. D., and Hanrahan, P. Controlling pro-
Larochelle, H., and Winther, O. Autoen- cedural modeling programs with stochastically-
coding beyond pixels using a learned similarity ordered sequential Monte Carlo. ACM TOG 34,
metric. In ICML (2016), vol. 48, pp. 1558–1566. 4 (2015), 105:1–105:11.
21
[33] Salimans, T., Goodfellow, I., Zaremba, [42] Vanegas, C. A., Garcia-Dorado, I.,
W., Cheung, V., Radford, A., Chen, X., Aliaga, D. G., Benes, B., and Waddell,
and Chen, X. Improved techniques for train- P. Inverse design of urban procedural models.
ing gans. In Advances in Neural Information ACM TOG 31, 6 (2012), 168:1–168:11.
Processing Systems 29, D. D. Lee, M. Sugiyama,
U. V. Luxburg, I. Guyon, and R. Garnett, Eds. [43] Veeravasarapu, V. S. R., Rothkopf,
Curran Associates, Inc., 2016, pp. 2234–2242. C. A., and Ramesh, V. Adversarially tuned
scene generation.
[34] Schwarz, M., and Müller, P. Advanced
procedural modeling of architecture. ACM TOG [44] Wan, C., Probst, T., Van Gool, L., and
34, 4 (2015), 107:1–107:12. Yao, A. Crossing nets: Combining gans and
vaes with a shared latent space for hand pose
[35] Springenberg, J. T. Unsupervised and semi- estimation. In The IEEE Conference on Com-
supervised learning with categorical generative puter Vision and Pattern Recognition (CVPR)
adversarial network. In ICLR (2016). (July 2017).
[36] Stolcke, A., and Omohundro, S. Induc-
[45] Warde-Farley, D., and Bengio, Y. Improv-
ing probabilistic grammars by Bayesian model
ing generative adverarial networks with denois-
merging. In Proceedings of ICGI-94 (1994),
ing feature matching. In CVPR (2017).
pp. 106–118.
[37] Talton, J. O., Lou, Y., Lesser, S., Duke, [46] Wonka, P., Wimmer, M., Sillion, F., and
J., Měch, R., and Koltun, V. Metropolis Ribarsky, W. Instant architecture. ACM
procedural modeling. ACM TOG 30, 2 (2011), Transactions on Graphics 22, 3 (2003), 669–677.
11:1–11:14. [47] Yeh, R., Chen, C., Lim, T., Hasegawa-
[38] Talton, J. O., Yang, L., Kumar, R., Lim, Johnson, M., and Do, M. N. Semantic im-
M., Goodman, N. D., and Měch, R. Learn- age inpainting with perceptual and contextual
ing design patterns with Bayesian grammar in- losses. CoRR abs/1607.07539 (2016).
duction. In Proceedings of UIST ’12 (2012),
[48] Yeh, Y.-T., Breeden, K., Yang, L., Fisher,
pp. 63–74.
M., and Hanrahan, P. Synthesis of tiled pat-
[39] Tran, D., Bourdev, L., Fergus, R., Tor- terns using factor graphs. ACM TOG 32, 1
resani, L., and Paluri, M. Learning spa- (2013), 3:1–3:13.
tiotemporal features with 3d convolutional net-
works. In Proceedings of the IEEE international [49] Zhao, J., Mathieu, M., and LeCun, Y.
conference on computer vision (2015), pp. 4489– Energy-based generative adversarial network.
4497. arXiv preprint arXiv:1609.03126 (2016).
[40] Tran, L., Yin, X., and Liu, X. Disen- [50] Zhu, J.-Y., Krähenbühl, P., Shechtman,
tangled representation learning gan for pose- E., and Efros, A. A. Generative visual manip-
invariant face recognition. In The IEEE Con- ulation on the natural image manifold. In Pro-
ference on Computer Vision and Pattern Recog- ceedings of European Conference on Computer
nition (CVPR) (July 2017). Vision (ECCV) (2016).
[41] Tyleček, R., and Šára, R. Spatial pattern [51] Zhu, J.-Y., Park, T., Isola, P., and Efros,
templates for recognition of objects with regular A. A. Unpaired image-to-image translation us-
structure. In Proc. GCPR (Saarbrucken, Ger- ing cycle-consistent adversarial networks. ICCV
many, 2013). (2017).
22
[52] Zhu, J.-Y., Zhang, R., Pathak, D., Dar-
rell, T., Efros, A. A., Wang, O., and
Shechtman, E. Toward multimodal image-to-
image translation. In Advances in Neural Infor-
mation Processing Systems 30 (2017).
23

FrankenGAN: Guided Detail Synthesis For Building Mass-Models UsingStyle-Synchonized GANs

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

FrankenGAN: Guided Detail Synthesis For Building Mass-Models UsingStyle-Synchonized GANs

Transféré par

Droits d'auteur :

Formats disponibles

FrankenGAN: Guided Detail Synthesis for Building Mass-Models

Using Style-Synchonized GANs

input: style and mass models output: detailed geometry + textures

Abstract ability of the output. We allow her to interactively

uniform colors Pix2Pix FRANKENGAN

versarial networks. Supervised deep learning has played a crucial role in

detail labels highres tex.

lowres texture region mask context info

B = G(A, Z) : Rn × Rk → Rn , In general, the conditional distribution p(B|A) for

labels texture labels low-res tex.

style distribution style distribution

5.1 Style Control Figure 8: Specifying a style distribution gives con-

The window dataset contains 1376 rectified win-

input style reference

P(more similar / realistic)

Figure 18: User studies comparing our method

[2] Bao, F., Schwarz, M., and Wonka, P. Pro-

[3] Berger, M., Li, J., and Levine, J. A. A

[4] Berthelot, D., Schumm, T., and Metz,

[5] Bokeloh, M., Wand, M., Seidel, H.-P.,

[8] Donahue, J., Krähenbühl, P., and Dar-

Vous aimerez peut-être aussi