Académique Documents
Professionnel Documents
Culture Documents
X, DECEMBER 2018 1
Abstract—Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video
processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the
Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and
are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has
imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning
approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in
arXiv:1901.00596v1 [cs.LG] 3 Jan 2019
data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into different
categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these
learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal
networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes
and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this
fast-growing field.
Index Terms—Deep Learning, graph neural networks, graph convolutional networks, graph representation learning, graph
autoencoder, network embedding
1 I NTRODUCTION meaningful features that are shared with the entire datasets
for various image analysis tasks.
3.2 Frameworks
Graph neural networks, graph convolution networks
(GCNs) in particular, try to replicate the success of CNN
in graph data by defining graph convolutions via graph
spectral theory or spatial locality. With graph structure and
node content information as inputs, the outputs of GCN
can focus on different graph analytics task with one of the
following mechanisms:
• Node-level outputs relate to node regression and
classification tasks. As a graph convolution module
directly gives nodes’ latent representations, a multi-
(a) Graph Convolution Net- (b) Graph Attention Networks
works [14] explicitly assign a [15] implicitly capture the perceptron layer or softmax layer is used as the final
non-parametric weight aij = weight aij via an end to end layer of GCN. We review graph convolution modules
√ 1
to the neigh- neural network architecture, in Section 4.1 and Section 4.2.
deg(vi )deg(vj )
so that more important nodes
bor vj of vi during the aggre-
receive larger weights.
• Edge-level outputs relate to the edge classifica-
gation process. tion and link prediction tasks. To predict the la-
bel/connection strength of an edge, an additional
Fig. 6: Differences between graph convolutional networks
function will take two nodes’ latent representations
and graph attention networks.
from the graph convolution module as inputs.
• Graph-level outputs relate to the graph classification
task. To obtain a compact representation on graph
level, a pooling module is used to coarse a graph
Graph Generative Networks aim to generate plausible into sub-graphs or to sum/average over the node
structures from data. Generating graphs given a graph representations. We review graph pooling module in
empirical distribution is fundamentally challenging, mainly Section 4.3.
because graphs are complex data structures. To address this In Table 3, we list the details of the inputs and outputs
problem, researchers have explored to factor the generation of the main GCNs methods. In particular, we summarize
process as forming nodes and edges alternatively [64], [65], output mechanisms in between each GCN layer and in the
to employ generative adversarial training [66], [67]. One final layer of each method. The output mechanisms may
promising application domain of graph generative networks involve several pooling operations, which are discussed in
is chemical compound synthesis. In a chemical graph, atoms Section 4.3.
are treated as nodes and chemical bonds are treated as
edges. The task is to discover new synthesizable molecules End-to-end Training Frameworks. Graph convolutional net-
which possess certain chemical and physical properties. works can be trained in a (semi-) supervised or purely un-
supervised way within an end-to-end learning framework,
depending on the learning tasks and label information avail-
Graph Spatial-temporal Networks aim to learn unseen pat-
able at hand.
terns from spatial-temporal graphs, which are increasingly
important in many applications such as traffic forecasting • Semi-supervised learning for node-level classifi-
and human activity prediction. For instance, the underlying cation. Given a single network with partial nodes
road traffic network is a natural graph where each key loca- being labeled and others remaining unlabeled, graph
tion is a node whose traffic data is continuously monitored. convolutional networks can learn a robust model that
By developing effective graph spatial temporal network effectively identify the class labels for the unlabeled
models, we can accurately predict the traffic status over nodes [14]. To this end, an end-to-end framework can
the whole traffic system [70], [71]. The key idea of graph be built by stacking a couple of graph convolutional
spatial-temporal networks is to consider spatial dependency layers followed by a softmax layer for multi-class
and temporal dependency at the same time. Many current classification.
approaches apply GCNs to capture the dependency together • Supervised learning for graph-level classification.
with some RNN [70] or CNN [71] to model the temporal Given a graph dataset, graph-level classification aims
dependency. to predict the class label(s) for an entire graph [55],
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 6
− 21 − 21
[56], [74], [75]. The end-to-end learning for this task D AD , where P D is a diagonal matrix of node de-
can be done with a framework which combines grees, Dii = j (Ai,j ). The normalized graph Laplacian
both graph convolutional layers and the pooling matrix possesses the property of being real symmetric
procedure [55], [56]. Specifically, by applying graph positive semidefinite. With this property, the normalized
convolutional layers, we obtain a representation with Laplacian matrix can be factored as L = UΛUT , where
a fixed number of dimensions for each node in each U = [u0 , u1 , · · · , un−1 ] ∈ RN ×N is the matrix of eigenvec-
single graph. Then, we can get the representation of tors ordered by eigenvalues and Λ is the diagonal matrix of
an entire graph through pooling which summarizes eigenvalues, Λii = λi . The eigenvectors of the normalized
the representation vectors of all nodes in a graph. Laplacian matrix forms an orthonormal space, in mathemat-
Finally, by applying the MLP layers and a softmax ical words, UT U = I. In graph signal processing, a graph
layer which are commonly used in existing deep signal x ∈ RN is a feature vector of nodes of the graph
learning frameworks, we can build an end-to-end where xi is the value of ith node. The graph Fourier transform
framework for graph classification. An example is to a signal x is defined as F (x) = UT x and the inverse
given in Fig 5a. graph Fourier transform is defined as F −1 (x̂) = Ux̂,
• Unsupervised learning for graph embedding. When where x̂ represents the resulting signal from graph Fourier
no class labels are available in graphs, we can learn transform. To understand graph Fourier transform, from its
the graph embedding in a purely unsupervised way definition we see that it indeed projects the input graph
in an end-to-end framework. These algorithms ex- signal to the orthonormal space where the basis is formed by
ploit the edge-level information in two ways. One eigenvectors of the normalized graph Laplacian. Elements
simple way is to adapt an autoencoder framework of the transformed signal x̂ are the coordinates of the graph
where the encoder employs graph convolutional lay- signal in the new space P so that the input signal can be
ers to embed the graph into the latent representation represented as x = i x̂i ui , which is exactly the inverse
upon which a decoder is used to reconstruct the graph Fourier transform. Now the graph convolution of the
graph structure [59], [61]. Another way is to utilize input signal x with a filter g ∈ RN is defined as
the negative sampling approach which samples a
portion of node pairs as negative pairs while existing x ∗G g = F −1 (F (x) F (g))
node pairs with links in the graphs being positive (1)
= U(UT x UT g)
pairs. Then a logistic regression layer is applied after
the convolutional layers for end-to-end learning [24]. where denotes the Hadamard product. If we denote a
filter as gθ = diag(UT g), then the graph convolution is
simplified as
4 G RAPH C ONVOLUTION N ETWORKS x ∗G gθ = Ugθ UT x (2)
In this section, we review graph convolution networks
(GCNs), the fundamental of many complex graph neural Spectral-based graph convolution networks all follow this
network models. GCNs approaches fall into two categories, definition. The key difference lies in the choice of the filter
spectral-based and spatial-based. Spectral-based approaches gθ .
define graph convolutions by introducing filters from the
perspective of graph signal processing [76] where the graph
convolution operation is interpreted as removing noise 4.1.2 Methods of Spectral based GCNs
from graph signals. Spatial-based approaches formulate Spectral CNN. Bruna et al. [20] propose the first spectral
graph convolutions as aggregating feature information from convolution neural network (Spectral CNN). Assuming the
neighbors. While GCNs operate on the node level, graph filter gθ = Θki,j is a set of learnable parameters and consid-
pooling modules can be interleaved with the GCN layer, to ering graph signals of multi-dimension, they define a graph
coarsen graphs into high-level sub-structures. As shown in convolution layer as
Fig 5a, such an architecture design can be used to extract
graph-level representations and to perform graph classifi- fk−1
X
cation tasks. In the following, we introduce spectral-based Xk+1
:,j = σ( UΘki,j UT Xk:,i ) (j = 1, 2, · · · , fk ) (3)
GCNs, spatial-based GCNs, and graph pooling modules i=1
separately.
where Xk ∈ RN ×fk−1 is the input graph signal, N is the
number of nodes, fk−1 is the number of input channels and
4.1 Spectral-based Graph Convolutional Networks fk is the number of output channels, Θki,j is a diagonal
Spectral-based methods have a solid foundation in graph matrix filled with learnable parameters, and σ is a non-
signal processing [76]. We first give some basic knowledge linear transformation.
background of graph signal processing, after which we re-
view the representative research on the spetral-based GCNs. Chebyshev Spectral CNN (ChebNet). Defferrard et al.
[12] propose ChebNet which defines a filter as Cheby-
shev polynomials of the diagonal matrix of eigenvalues,
4.1.1 Backgrounds PK
i.e, gθ = i=1 θi Tk (Λ̃), where Λ̃ = 2Λ/λmax − IN . The
A robust mathematical representation of a graph is the Chebyshev polynomials are defined recursively by Tk (x) =
normalized graph Laplacian matrix, defined as L = In − 2xTk−1 (x) − Tk−2 (x) with T0 (x) = 1 and T1 (x) = x. As a
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 7
result, the convolution of a graph signal x with the defined of 1stChebNet is that the computation cost increases expo-
filter gθ is nentially with the increase of the number of 1stChebNet
K layers during batch training. Each node in the last layer
X
x ∗G gθ = U( θi Tk (Λ̃))UT x has to expand its neighborhood recursively across previous
i=1
layers. Chen et al. [45] assume the rescaled adjacent matrix
K
(4) Ã in Equation 7 comes from a sampling distribution. Under
X
= θi Ti (L̃)x this assumption, the technique of Monte Carlo and variance
i=1 reduction techniques are used to facilitate the training pro-
cess. Chen et al. [46] reduce the receptive field size of the
where L̃ = 2L/λmax − IN . graph convolution to an arbitrary small scale by sampling
From Equation 4, ChebNet implictly avoids the compu- neighborhoods and using historical hidden representations.
tation of the graph Fourier basis, reducing the computation Huang et al. [54] propose an adaptive layer-wise sampling
complexity from O(N 3 ) to O(KM ). Since Ti (L̃) is a polyno- approach to accelerate the training of 1stChebNet, where
mial of L̃ of ith order, Ti (L̃)x operates locally on each node. sampling for the lower layer is conditioned on the top
Therefore, the filters of ChebNet are localized in space. one. This method is also applicable for explicit variance
First order of ChebNet (1stChebNet 2 ) Kipf et al. [14] in- reduction.
troduce a first-order approximation of ChebNet. Assuming Adaptive Graph Convolution Network (AGCN). To ex-
K = 1 and λmax = 2 , Equation 4 is simplified as plore hidden structural relations unspecified by the graph
1 1 Laplacian matrix, Li et al. [22] propose the adaptive graph
x ∗G gθ = θ0 x − θ1 D− 2 AD− 2 x (5)
convolution network (AGCN). AGCN augments a graph
To restrain the number of parameters and avoid over- with a so-called residual graph, which is constructed by
fitting, 1stChebNet further assumes θ = θ0 = −θ1 , leading computing a pairwise distance of nodes. Despite being able
to the following definition of graph convolution, to capture complement relational information, AGCN incurs
1 1
expensive O(N 2 ) computation.
x ∗G gθ = θ(In + D− 2 AD− 2 )x (6)
4.1.3 Summary
In order to incorporate multi-dimensional graph input
Spectral CNN [20] relys on the eigen-decomposition of the
signals, 1stChebNet proposes a graph convolution layer
Laplacian matrix. It has three effects. First, any perturbation
which modifies Equation 6,
to a graph results in a change of eigen basis. Second, the
Xk+1 = ÃXk Θ (7) learned filters are domain dependent, meaning they cannot
be applied to a graph with a different structure. Third, eigen-
− 12 − 12
where à = IN + D AD . decomposition requires O(N 3 ) computation and O(N 2 )
The graph convolution defined by 1stChebNet is local- memory. Filters defined by ChebNet [12] and 1stChebNet
ized in space. It bridges the gap between spectral-based [14] are localized in space. The learned weights can be
methods and spatial-based methods. Each row of the output shared across different locations in a graph. However, a
represents the latent representation of each node obtained common drawback of spectral methods is they need to
by a linear transformation of aggregated information from load the whole graph into the memory to perform graph
the node itself and its neighboring nodes with weights convolution, which is not efficient in handling big graphs.
specified by the row of Ã. However, the main drawback
4.2 Spatial-based Graph Convolutional Networks
2. Due to its impressive performance in many node classification
tasks, 1stChebNet is simply termed as GCN and is considered as a Imitating the convolution operation of a conventional con-
strong baseline in the research community. volution neural network on an image, spatial-based meth-
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 8
LGCN proposes a subgraph training strategy, which puts Input graphs are first processed by the coarsening process
the sampled subgraphs into a mini-batch. described in Fig 5a . After coarsening, the vertices of the
input graph and its coarsened versions are reformed in a
Mixture Model Network (MoNet) [25] unifies standard
balanced binary tree. Arbitrarily ordering the nodes at the
CNN with convolutional architectures on non-Euclidean
coarsest level then propagating this ordering to the lower
domains. While several spatial-based approaches ignore the
level in the balanced binary tree would finally produce a
relative positions between a node and its neighbors when
regular ordering in the finest level. Pooling such a rear-
aggregating neighborhood feature information, MoNet in-
ranged 1D signal is much more efficient than the original.
troduce pseudo-coordinates and weight functions to let the
Zhang et al. also propose a framework DGCNN [55]
weight of a node’s neighbor be determined by the relative
with a similar pooling strategy named SortPooling which
position (pseudo-coordinates) between the node and its
performs pooling by rearranging vertices to a meaningful
neighbor. Under such a framework, several approaches on
order. Different to ChebNet [12], DGCNN sorts vertices
manifolds such as Geodesic CNN (GCNN) [84], Anisotropic
according to their structural roles within the graph. The
CNN(ACNN) [85], Spline CNN [86], and on graphs such
graph’s unordered vertex features from spatial graph con-
as GCN [14], DCNN [44] can be generalized as special
volutions are treated as a continuous WL colors [82], and
instances of MoNet. However these approaches under the
they are then used to sort vertices. In addition to sorting
framework of MoNet have fixed weight functions. MoNet
the vertex features, it unifies the graph size to k by truncat-
instead proposes a Gaussian kernel with learnable parame-
ing/extending the graph’s feature tensor. The last n−k rows
ters to freely adjust the weight function.
are deleted if n > k , otherwise k − n zero rows are added.
This method enhances the pooling network to improve the
4.2.4 Summary
performance of GCNs by solving one challenge underlying
Spatial-based methods define graph convolutions via ag- graph structured tasks which is referred to as permutation
gregating feature information from neighbors. According to invariant. Verma and Zhang propose graph capsule net-
different ways of stacking graph convolution layers, spatial- works [89] which further explore the permutation invariant
based methods are split into two groups, recurrent-based for graph data.
and composition-based. While recurrent-based approaches Recently a pooling module, DIFFPOOL [56], is proposed
try to obtain nodes’ steady states, composition-based ap- which can generate hierarchical representations of graphs
proaches try to incorporate higher orders of neighborhood and can be combined with not only CNNs, but also var-
information. In each layer, both two groups have to update ious graph neural network architectures in an end-to-end
hidden states over all nodes during training. However, it fashion. Compared to all previous coarsening methods,
is not efficient as it has to store all the intermediate states DIFFPOOL does not simply cluster the nodes in one graph,
into memory. To address this issue, several training strate- but provide a general solution to hierarchically pool nodes
gies have been proposed, including sub-graph training for across a broad set of input graphs. This is done by learning
composition-based approaches such as GraphSage [24] and a cluster assignment matrix S at layer l referred to as
stochastically asynchronous training for recurrent-based ap- S(l) ∈ Rnl ×nl +1 . Two separate GNNs with both input
proaches such as SSE [19]. cluster node features X(l) and coarsened adjacency matrix
A(l) are being used to generate the assignment matrix S(l)
4.3 Graph Pooling Modules and embedding matrices Z(l) as follows:
When generalizing convolutional neural networks to graph-
Z(l) = GN Nl,embed (A(l) , X(l) ) (16)
structured data, another key component, graph pooling
module, is also of vital importance, particularly for graph- S(l) = sof tmax(GN Nl,pool (A(l) , X(l) )) (17)
level classification tasks [55], [56], [87]. According to Xu
et al. [88], pooling-assisted GCNs are as powerful as the Equation 16 and 17 can be implemented with any
Weisfeiler-Lehman test [82] in distinguishing graph struc- standard GNN module, which processes the same input
tures. Similar to the original pooling layer which comes data but has distinct parametrizations since the roles they
with CNNs, graph pooling module could easily reduce the play in the framework are different. The GN Nl,embed will
variance and computation complexity by down-sampling produce new embeddings while the GN Nl,pool generates a
from original feature data. Mean/max/sum pooling is the probabilistic assignment of the input nodes to nl+1 clusters.
most primitive and most effective way of implementing this The Softmax function is applied in a row-wise fashion in
since calculating the mean/max/sum value in the pooling Equation 17. As a result, each row of S(l) corresponds to one
window is rapid. of the nl nodes(or clusters) at layer l, and each column of
S(l) corresponds to one of the nl at the next layer. Once we
hG = mean/max/sum(hT1 , hT2 , ..., hTn ) (15) have Z(l) and S(l) , the pooling operation comes as follows:
Henaff et al. [21] prove that performing a simple T
X(l+1) = S(l) Z(l) ∈ Rnl+1 ×d (18)
max/mean pooling at the beginning of the network is espe-
cially important to reduce the dimensionality in the graph T
A(l+1) = S(l) A(l) S(l) ∈ Rnl+1 ×nl+1 (19)
domain and mitigate the cost of the expensive graph Fourier
transform operation. Equation 18 takes the cluster embeddings Z(l) then
Defferrard et al. optimize max/min pooling and devices aggregates these embeddings according to the cluster as-
an efficient pooling strategy in their approach ChebNet [12]. signments S(l) to calculate embedding for each of the nl+1
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 11
clusters. Initial cluster embedding would be node repre- 5.1 Graph Attention Networks
sentation. Similarly, Equation 19 takes the adjacency matrix Attention mechanisms have almost become a standard in
A(l) as inputs and generates a coarsened adjacency matrix sequence-based tasks [90]. The virtue of attention mecha-
denoting the connectivity strength between each pair of the nisms is their ability to focus on the most important parts
clusters. of an object. This specialty has been proven to be useful
Overall, DIFFPOOL [56] redefines the graph pooling for many tasks, such as machine translation and natural
module by using two GNNs to cluster the nodes. Any language understanding. Thanks to the increased model
standard GCN module is able to combine with DIFFPOOL, capacity of attention mechanisms, graph neural networks
to not only achieve enhanced performance, but also to speed also benefit from this by using attention during aggregation,
up the convolution operation. integrating outputs from multiple models, and generating
importance-oriented random walks. In this section, we will
discuss how attention mechanisms are being used in graph
4.4 Comparison Between Spectral and Spatial Models structured data.
As the earliest convolutional networks for graph data,
spectral-based models have achieved impressive results in 5.1.1 Methods of Graph Attention Networks
many graph related analytics tasks. These models are ap- Graph Attention Network (GAT) [15] is a spatial-based
pealing in that they have a theoretical foundation in graph graph convolution network where the attention mechanism
signal processing. By designing new graph signal filters is involved in determining the weights of a node’s neighbors
[23], we can theoretically design new graph convolution when aggregating feature information. The graph convolu-
networks. However, there are several drawbacks to spectral- tion operation of GAT is defined as,
based models. We illustrate this in the following from three X
aspects, efficiency, generality and flexibility. hti = σ( α(hit−1 , ht−1
j )W
t−1 t−1
hj ) (20)
j∈Ni
In terms of efficiency, the computational cost of spectral-
based models increases dramatically with the graph size where α(·) is an attention function which adaptively con-
because they either need to perform eigenvector compu- trols the contribution of a neighbor j to the node i. In order
tation [20] or handle the whole graph at the same time, to learn attention weights in different subspaces, GAT uses
which makes them difficult to parallel or scale to large multi-head attentions.
graphs. Spatial based models have the potential to handle X
hti =kK
k=1 σ( αk (hit−1 , ht−1
j )Wk
t−1 t−1
hj ) (21)
large graphs as they directly perform the convolution in
j∈Ni
the graph domain via aggregating the neighbor nodes. The
computation can be performed in a batch of nodes instead where k denotes concatenation.
of the whole graph. When the number of neighbor nodes
Gated Attention Network (GAAN) [28] also employs
increases, sampling techniques [24], [27] can be developed
the multi-head attention attention mechanism in updat-
to improve efficiency.
ing a node’s hidden state. However rather than assigning
In terms of generality, spectral-based models assumed
an equal weight to each head, GAAN introduces a self-
a fixed graph, making them generalize poorly to new or
attention mechanism which computes a different weight for
different graphs. Spatial-based models on the other hand
each head. The updating rule is defined as,
perform graph convolution locally on each node, where
weights can be easily shared across different locations and X
structures. hti = φo (xi ⊕ kK k
k=1 gi αk (ht−1
i , ht−1 t−1
j )φv (hj )) (22)
In terms of flexibility, spectral-based models are limited j∈Ni
to work on undirected graphs. There is no clear definition where φo (·) and φv (·) denotes feedforward neural networks
of the Laplacian matrix on directed graphs so that the only and gik is the attention weight of the k th attention head.
way to apply spectral-based models to directed graphs is to
transfer directed graphs to undirected graphs. Spatial-based Graph Attention Model (GAM) [57] proposes a recur-
models are more flexible to deal with multi-source inputs rent neural network model to solve graph classification
such as edge features and edge directions because these problems, which processes informative parts of a graph
inputs can be incorporated into the aggregation function by adaptively visiting a sequence of important nodes. The
(e.g. [13], [17], [51], [52], [53]). GAM model is defined as
As a result, spatial models have attracted increasing
attention in recent years [25]. ht = fh (fs (rt−1 , vt−1 , g; θs ), ht−1 ; θh ) (23)
where fh (·) is a LSTM network, fs is the step network
which takes a step from the current node vt−1 to one of
5 B EYOND G RAPH C ONVOLUTIONAL N ETWORKS its neighbors ct , prioritizing those whose type have higher
rank in vt−1 which is generated by a policy network:
In this section, we review other graph neural networks
including graph attention neural networks, graph auto- rt = fr (ht ; θr ) (24)
encoder, graph generative networks, and graph spatial-
temporal networks. In Table 4, we provide a summary of where rt is a stochastic rank vector which indicates which
main approaches under each category. node is more important and thus should be further explored
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 12
TABLE 4: Summary of Alternative Graph Neural Networks (Graph Convolutional Networks Excluded). We summarize
methods based on their inputs, outputs, targeted tasks, and whether a method is GCN-based. Inputs indicate whether a
method suits attributed graphs (A), directed graphs (D), and spatial-temporal graphs (S).
Inputs GCN
Category Approaches Outputs Tasks
A D S Based
Graph GAT (2017) [15] 3 3 7 node labels node classification 3
Attention GAAN (2018) [28] 3 3 7 node labels node classification 3
Networks GAM (2018) [57] 3 3 7 graph labels graph classification 7
Attention Walks (2018) [58] 7 7 7 node embedding network embedding 7
GAE (2016) [59] 3 7 7 reconstructed adajacency matrix network embedding 3
Graph ARGA (2018) [61] 3 7 7 reconstructed adajacency matrix network embedding 3
Auto-encoder reconstructed sequences of
NetRA (2018) [62] 7 7 7 network embedding 7
random walks
DNGR (2016) [41] 7 7 7 reconstructed PPMI matrix network embedding 7
SDNE (2016) [42] 7 3 7 reconstructed adajacency matrix network embedding 7
DNRE (2018) [63] 3 7 7 reconstructed node embedding network embedding 7
MolGAN (2018) [66] 3 7 7 new graphs graph generation 3
Graph
DGMG (2018) [65] 7 7 7 new graphs graph generation 3
Generative
GraphRNN (2018) [64] 7 7 7 new graphs graph generation 7
Networks
NetGAN (2018) [67] 7 7 7 new graphs graph generation 7
spatial-temporal
DCRNN (2018) [70] 7 7 3 node value vectors 3
Graph forecasting
Spatial-Temporal spatial-temporal
CNN-GCN (2017) [71] 7 7 3 node value vectors 3
Networks forecasting
spatial-temporal
ST-GCN (2018) [72] 7 7 3 graph labels 3
classification
spatial-temporal
Structural RNN (2016) [73] 7 7 3 node labels/value vectors 7
forecasting
with high priority, ht contains historical information that the architectures. A typical solution is to leverage multi-layer
agent has aggregated from exploration of the graph, and is perceptrons as the encoder to obtain node embeddings,
used to make a prediction for the graph label. where a decoder reconstructs a node’s neighborhood statis-
tics such as positive pointwise mutual information (PPMI)
Attention Walks [58] learns node embeddings through
[41] or the first and second order of proximities [42]. Re-
random walks. Unlike DeepWalk [40] using fixed apriori,
cently, researchers have explored the use of GCN [14] as an
Attention Walks factorizes the co-occurance matrix with
encoder, combining GCN [14] with GAN [91], or combining
differentiable attention weights.
LSTM [7] with GAN [91] in designing a graph auto-encoder.
C
X We will first review GCN based autoencoder and then
E[D] = P̃(0) ak (P)k (25) summarize other variants in this category.
k=1
where D denotes the cooccurence matrix, P̃(0) denotes 5.2.1 GCN Based Auto-encoders
the initial position matrix, and P denotes the probability Graph Auto-encoder (GAE) [59] firstly integrates GCN
transition matrix. [14] into a graph auto encoder framework. The encoder is
defined as
5.1.2 Summary Z = GCN (X, A) (26)
Attention mechanisms contribute to graph neural networks
while the decoder is defined as
in three different ways, namely assigning attention weights
to different neighbors when aggregating feature informa- Â = σ(ZZT ) (27)
tion, ensembling multiple models according to attention
weights, and using attention weights to guide random The framework of GAE is also dipicted in Fig 5b. The GAE
walks. Despite categorizing GAT [15] and GAAN [28] under can be trained in a variational manner, i.e., to minimize the
the umbrella of graph attention networks, they can also be variational lower bound L:
considered as spatial-based graph convolution networks at
L = Eq(Z|X,A) [logp (A|Z)] − KL[q(Z|X, A)||p(Z)] (28)
the same time. The advantage of GAT [15] and GAAN [28]
is that they can adpatively learn the importance weights of
Adversarially Regularized Graph Autoencoder (ARGA)
neighbors as illustrated in Fig 6. However, the computa-
[61] employs the training scheme of generative adversarial
tion cost and memory consumption increase rapidly as the
networks (GANs) [91] to regularize a graph auto-encoder. In
attention weights between each pair of neighbors must be
ARGA, an encoder encodes a node’s structural information
computed.
with its features into a hidden representation by GCN [14],
and a decoder reconstructs the adjacency matrix from the
5.2 Graph Auto-encoders outputs of the encoder. The GANs play a min-max game be-
Graph auto-encoders are one class of network embedding tween a generator and a discriminator in training generative
approaches which aim at representing network vertices into models. A generator generates “faked samples” as real as
a low-dimensional vector space by using neural network possible while a discriminator makes its best to distinguish
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 13
the “faked samples” from the real ones. GAN helps ARGA detail, bi,j = 1 if Ai,j = 0 and bi,j = β > 1 if Ai,j = 1.
to regularize the learned hidden representations of nodes to Overall, the objective function is defined as
follow a prior distribution. In detail, the encoder, working
as a generator, tries to make the learned node hidden rep- L = L2nd + αL1st + λLreg (32)
resentations indistinguishable from a real prior distribution.
where Lreg is the L2 regularization term.
A discriminator, on the other side, tries to identify whether
the learned node hidden representations are generated from Deep Recursive Network Embedding (DRNE) [63] di-
the encoder or from a real prior distribution. rectedly reconstructs a node’s hidden state instead of the
whole graph statistics. Using an aggregation function as the
5.2.2 Miscellaneous Variants of Graph Auto-encoders encoder, DRNE designs the loss function as,
X
Network Representations with Adversarially Regularized L= ||hv − aggregate(hu |u ∈ N (v))||2 (33)
Autoencoders (NetRA) [62] is a graph auto-encoder frame- v∈V
work which shares a similar idea with ARGA. It also
One inovation of DRNE is that it choose LSTM as aggrega-
regularizes node hidden representations to comply with a
tion function where the neighbors sequence is ordered by
prior distribution via adversarial training. Instead of recon-
their node degree.
structing the adjacency matrix, they recover node sequences
sampled from random walks by a sequence-to-sequence
architecture [92]. 5.2.3 Summary
DNGR and SDNE learn node embeddings only given the
Deep Neural Networks for Graph Representations topological structures, while GAE, ARGA, NetRA, DRNE
(DNGR) [41] uses the stacked denoising autoencoder learn node embeddings when both topological information
[93] to reconstruct the pointwise mutual information ma- and node content features are available. One challenge of
trix(PPMI). The PPMI matrix intrinsically captures nodes graph auto-encoders is the sparsity of the adjacency matrix
co-occurence information when a graph is serialized as A, causing the number of positive entries of the decoder to
sequences by random walks. Formally, the PPMI matrix is be far less than the negative ones. To tackle this issue, DNGR
defined as reconstructs a denser matrix namely the PPMI matrix, SDNE
count(v1 , v2 ) · |D| imposes a penalty to zero entries of the adjacency matrix,
PPMIv1 ,v2 = max(log( ), 0) (29) GAE reweights the terms in the adjacency matrix, and
count(v1 )count(v2 )
P NetRA linearizes Graphs into sequences.
where |D| = v1 ,v2 count(v1 , v2 ) and v1 , v2 ∈ V . The
stacked denoising autoencoder is able to learn highly non-
linear regularity behind data. Different from conventional 5.3 Graph Generative Networks
neural autoencoders, it adds noise to inputs by randomly The goal of graph generative networks is to generate graphs
switching entries of inputs to zero. The learned latent repre- given an observed set of graphs. Many approaches to graph
sentation is more robust especially when there are missing generative networks are domain specific. For instance, in
values present. molecular graph generation, some works model a string
representation of molecular graphs called SMILES [94], [95],
Structural Deep Network Embedding (SDNE) [42] uses
[96], [97]. In natural language processing, generating a se-
stacked auto encoder to preserve nodes first-order proximity
mantic or a knowledge graph is often conditioned on a given
and second-order proximity jointly. The first-order proxim-
sentence [98], [99]. Recently, several general approaches
ity is defined as the distance between a node’s hidden rep-
have been proposed. Some works factor the generation
resentation and its neighbor’s hidden representation. The
process as forming nodes and edges alternatively [64], [65]
goal for the first-order proximity is to drive representations
while others employ generative adversarial training [66],
of adjacent nodes close to each other as much as possible.
[67]. The methods in this category either employ GCN as
Specifically, the loss function L1st is defined as
building blocks or use different architectures.
n
X (k) (k)
L1st = Ai,j ||hi − hj ||2 (30) 5.3.1 GCN Based Graph Generative Networks
i,j=1
Molecular Generative Adversarial Networks (MolGAN)
The second-order proximity is defined as the distance be- [66] integrates relational GCN [100], improved GAN [101]
tween a node’s input and its reconstructed inputs where and reinforcement lerarning (RL) objective to generate
the input is the corresponding row of the node in the graphs with desired properties. The GAN consists of a
adjacent matrix. The goal for the second-order proximity is generator and a discriminator, competing with each other
to preserve a node’s neighborhood information. Concretely, to improve the authenticity of the generator. In MolGAN,
the loss function L2nd is defined as the generator tries to propose a faked graph along with its
feature matrix while the discriminator aims to distinguish
n
X the faked sample from the empirical data. Additionally
L2nd = ||(x̂i − xi ) bi ||2 (31)
a reward network is introduced in parallel with the dis-
i=1
criminator to encourage the generated graphs to possess
The role of vector bi is to penalize non-zero elements more certain properties according to an external evaluator. The
than zero elements since the inputs are highly sparse. In framework of MolGAN is described in Fig 9.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 14
Fig. 9: Framework of MolGAN [67]. A generator first samples an initial vector from a standard normal distribution. Passing
this initial vector through a neural network, the generator outputs a dense adjacency matrix A and a corresponding feature
matrix X . Next, the generator produces a sampled discrete à and X̃ from categorical distributions based on A and X .
Finally, GCN is used to derive a vector representation of the sampled graph. Feeding this graph representation to two
distinct neural networks, a discriminator and a reward network outputs a score between zero and one separately, which
will be used as feedback to update the model parameters.
Deep Generative Models of Graphs (DGMG) [65] uti- graphs is difficult to inspect visually. MolGAN and DGMG
lizes spatial-based graph convolution networks to obtain make use of external knowledge to evaluate the validity
a hidden representation of an existing graph. The decision of generated molecule graphs. GraphRNN and NetGAN
process of generating nodes and edges is conditioned on the evaluate generated graphs by graph statistics (e.g. node
resultant graph representation. Briefly, DGMG recursively degrees). Whereas DGMG and GraphRNN generate nodes
proposes a node to a growing graph until a stopping criteria and edges sequentially, MolGAN and NetGAN generate
is evoked. In each step after adding a new node, DGMG nodes and edges jointly. According to [68], the disadvantage
repeatedly decides whether to add an edge to the added of the former approaches is that when graphs become large,
node until the decision turns to false. If the decision is true, modelling a long sequence is not realistic. The challenge
it evaluates the probability distribution of connecting the of the later approaches is that global properties of the
newly added node to all existing nodes and samples one graph are difficult to control. A recent approach [68] adopts
node from the probability distribution. After a new node variational auto-encoder to generate a graph by proposing
and its connections are added to the existing graph, DGMG the adjacency matrix, imposing penalty terms to address
updates the graph representation again. validity constraints. However as the output space of a graph
with n nodes is n2 , none of these methods is scalable to large
5.3.2 Miscellaneous Graph Generative Networks graphs.
GraphRNN [64] exploits deep graph generative models
through two-level recurrent neural networks. The graph- 5.4 Graph Spatial-Temporal Networks
level RNN adds a new node each time to a node sequence Graph spatial-temporal networks capture spatial and tem-
while the edge level RNN produces a binary sequence poral dependencies of a spatial-temporal graph simultane-
indicating connections between the newly added node and ously. Spatial-temporal graphs have a global graph structure
previously generated nodes in the sequence. To linearize with inputs to each node which are changing across time.
a graph into a sequence of nodes for training the graph For instance, in traffic networks, each sensor taken as a
level RNN, GraphRNN adopts the breadth-first-search (BFS) node records the traffic speed of a certain road continuously
strategy. To model the binary sequence for training the edge- where the edges of the traffic network are determined by
level RNN, GraphRNN assumes multivariate Bernoulli or the distance between pairs of sensors. The goal of graph
conditional Bernoulli distribution. spatial-temporal networks can be forecasting future node
values or labels, or predicting spatial-temporal graph labels.
NetGAN [67] combines LSTM [7] with Wasserstein GAN
Recent studies have explored the use of GCNs [72] solely,
[102] to generate graphs from a random-walk-based ap-
a combination of GCNs with RNN [70] or CNN [71], and
proach. The GAN framework consists of two modules, a
a recurrent architecture tailored to graph structures [73]. In
generator and a discriminator. The generator makes its best
the following, we introduce these methods.
effort to generate plausible random walks through a LSTM
network while the discriminator tries to distinguish faked 5.4.1 GCN Based Graph Spatial-Temporal Networks
random walks from the real ones. After training, a new Diffusion Convolutional Recurrent Neural Network
graph is obtained by normalizing a co-occurence matrix of (DCRNN) [70] introduces diffusion convolution as graph
nodes which occur in a set of random walks. convolution for capturing spatial dependency and uses
sequence-to-sequence architecture [92] with gated recurrent
5.3.3 Summary units (GRU) [79] to capture temporal dependency.
Evaluating generated graphs remains a difficult problem. Diffusion convolution models a truncated diffusion pro-
Unlike synthesized images or audios, which can be di- cess with forward and backward directions. Formally, the
rectly assessed by human experts, the quality of generated diffusion convolution is defined as
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 15
DBLP can be found on https://dblp.uni-trier.de. A pro- 6.2 Benchmarks & Open-source Implementations
cessed version of the DBLP paper-citation network is up-
Of the datasets listed in Table 5, Cora, Pubmed, Citeseer,
dated continuously by https://aminer.org/citation.
and PPI are the most frequently used datasets. They are
Social Networks are formed by user interactions from often tested to compare the performance of graph convo-
online services such as BlogCatalog, Reddit, and Epinions. lution networks in node classification tasks. In Table 6, we
The BlogCatalog dataset is a social network which con- report the benchmark performance of these four datasets,
sists of bloggers and their social relationships. The labels all of which use standard data splits. Open-source imple-
of bloggers represent their personal interests. The Reddit mentations facilitate the work of baseline experiments in
dataset is an undirected graph formed by posts collected deep learning research. Due to the vast number of hyper-
from the Reddit discussion forum. Two posts are linked if parameters, it is difficult to achieve the same results as
they contain comments by the same user. Each post has a reported in the literature without using published codes.
label indicating the community to which it belongs. The In Table 7, we provide the hyperlinks of open-source imple-
Epinions dataset is a multi-relation graph collected from an mentations of the graph neural network models reviewed in
online product review website where commenters can have Section 4-5. Noticeably, Fey et al. [86] published a geometric
more than one type of relation, such as trust, distrust, co- learning library in PyTorch named PyTorch Geometric 3 ,
review, and co-rating. which implements serveral graph neural networks includ-
ing ChebNet [12], 1stChebNet [14], GraphSage [24], MPNNs
Chemical/Biological Graphs Chemical molecules and com- [13], GAT [15] and SplineCNN [86]. Most recently, the Deep
pounds can be represented by chemical graphs with atoms Graph Library (DGL) 4 is released which provides a fast
as nodes and chemical bonds as edges. This category of implementation of many graph neural networks with a set
graphs is often used to evaluate graph classification perfor- of functions on top of popular deep learning platforms such
mance. The NCI-1 and NCI-9 dataset contains 4100 and 4127 as PyTorch and MXNet.
chemical compounds respectively, labeled as to whether
they are active to hinder the growth of human cancer cell
lines. The MUTAG dataset contains 188 nitro compounds, 6.3 Practical Applications
labeled as to whether they are aromatic or heteroaromatic. Graph neural networks have a wide range of applications
The D&D dataset contains 1178 protein structures, labeled across different tasks and domains. Despite general tasks at
as to whether they are enzymes or non-enzymes. The QM9 which each category of GNNs is specialized, including node
dataset contains 133885 molecules labeled with 13 chemical classification, node representation learning, graph classifi-
properties. The Tox21 dataset contains 12707 chemical com- cation, graph generation, and spatial-temporal forecasting,
pounds labeled with 12 types of toxicity. Another important GNNs can also be applied to node clustering, link predic-
dataset is the Protein-Protein Interaction network(PPI). It tion [119], and graph partition [120]. In this section, we
contains 24 biological graphs with nodes represented by mainly introduce practical applications according to general
proteins and edges represented by the interactions between domains to which they belong.
proteins. In PPI, each graph is associated with a human
Computer Vision One of biggest application areas for graph
tissue. Each node is labeled with its biological states.
neural networks is computer vision. Researchers have ex-
Unstructured Graphs To test the generalization of graph plored leveraging graph structures in scene graph gener-
neural networks to unstructured data, the k nearest neigh- ation, point clouds classification and segmentation, action
bor graph(k-NN graph) has been widely used. The MNIST recognition and many other directions.
dataset contains 70000 images of size 28×28 labeled with 10 In scene graph generation, semantic relationships be-
digits. A typical way to convert a MNIST image to a graph tween objects facilitate the understanding of the semantic
is to construct a 8-NN graph based on its pixel locations. meaning behind a visual scene. Given an image, scene
The Wikipedia dataset is a word co-occurence network ex- graph generation models detect and recognize objects and
tracted from the first million bytes of the Wikipedia dump. predict semantic relationships between pairs of objects [121],
Labels of words represent part-of-speech (POS) tags. The 20- [122], [123]. Another application inverses the process by
NewsGroup dataset consists of around 20,000 News Group generating realistic images given scene graphs [124]. As
(NG) text documents categorized by 20 news types. The natural language can be parsed as semantic graphs where
graph of the 20-NewsGroup is constructed by representing each word represents an object, it is a promising solution to
each document as a node and using the similarities between synthesize images given textual descriptions.
nodes as edge weights. In point clouds classification and segmentation, a point
cloud is a set of 3D points recorded by LiDAR scans.
Others There are several other datasets worth mentioning. Solutions for this task enable LiDAR devices to see the
The METR-LA is a traffic dataset collected from the high- surrounding environment, which is typically beneficial for
ways of Los Angeles County. The MovieLens-1M dataset unmanned vehicles. To identify objects depicted by point
from the MovieLens website contains 1 million item rat- clouds, [125], [126], [127] convert point clouds into k-nearest
ings given by 6k users. It is a benchmark dataset for neighbor graphs or superpoint graphs, and use graph con-
recommender systems. The NELL dataset is a knowledge volution networks to explore the topological structure.
graph obtained from the Never-Ending Language Learning
project. It consist of facts represented by a triplet which 3. https://github.com/rusty1s/pytorch geometric
involves two entities and their relation. 4. https://www.dgl.ai/
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 17
TABLE 6: Benchmark performance of four most frequently and items, as well as content information, graph-based
used datasets. The listed methods use the same training, recommender systems are able to produce high-quality
validation, and test data for evaluation. recommendations. The key to a recommender system is to
Method Cora Citeseer Pubmed PPI score the importance of an item to an user. As a result,
1stChebnet (2016) [14] 81.5 70.3 79.0 - it can be cast as a link prediction problem. The goal is
GraphSage (2017) [24] - - - 61.2 to predict the missing links between users and items. To
GAT (2017) [15] 83.0±0.7 72.5±0.7 79.0±0.3 97.3±0.2
Cayleynets (2017) [23] 81.9±0.7 - - - address this problem, Van et al. [9] and Ying et al. [11] et al.
StoGCN (2018) [46] 82.0±0.8 70.9±0.2 79±0.4 97.9+.04 propose a GCN-based graph auto-encoder. Monti et al. [10]
DualGCN (2018) [49] 83.5 72.6 80.0 -
GAAN (2018) [28] - - - 98.71±0.02
combine GCN and RNN to learn the underlying process that
GraphInfoMax (2018) [118] 82.3±0.6 71.8±0.7 76.8±0.6 63.8±0.2 generates the known ratings.
GeniePath (2018) [48] - - 78.5 97.9
LGCN (2018) [27] 83.3±0.5 73.0±0.6 79.5±0.2 77.2±0.2 Traffic Traffic congestion has become a hot social issue in
SSE (2018) [19] - - - 83.6 modern cities. Accurately forecasting traffic speed, volume
or the density of roads in traffic networks is fundamentally
important in route planning and flow control. [28], [70], [71],
In action recognition, recognizing human actions con- [134] adopt a graph-based approach with spatial-temporal
tained in videos facilitates a better understanding of video neural networks. The input to their models is a spatial-
content from a machine aspect. One group of solutions temporal graph. In this spatial-temporal graph, nodes are
detects the locations of human joints in video clips. Human represented by sensors placed on roads, edges are repre-
joints which are linked by skeletons naturally form a graph. sented by the distance of pair-wise nodes above a threshold
Given the time series of human joint locations, [72], [73] and each node contains a time series as features. The goal
applies spatial-temporal neural networks to learn human is to forecast the average speed of a road within a time
action patterns. interval. Another interesting application is taxi-demand pre-
In addition, the number of possible directions in which diction. This greatly helps intelligent transportation systems
to apply graph neural networks in computer vision is still make use of resources and save energy effectively. Given
growing. This includes few-shot image classification [128], historical taxi demands, location information, weather data,
[129], semantic segmentation [130], [131], visual reasoning and event features, Yao et al. [135] incorporate LSTM, CNN
[132] and question answering [133]. and node embeddings trained by LINE [136] to form a joint
representation for each location to predict the number of
Recommender Systems Graph-based recommender sys-
taxis demanded for a location within a time interval.
tems take items and users as nodes. By leveraging the
relations between items and items, users and users, users Chemistry In chemistry, researchers apply graph neural
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 18
networks to study the graph strcutures of molecules. In Scalability Most graph neural networks do not scale well
a molecular graph, atoms function as nodes and chem- for large graphs. The main reason for this is when stacking
ical bonds function as edges. Node classification, graph multiple layers of a graph convolution, a node’s final state
classification and graph generation are three main tasks involves a large number of its neighbors’ hidden states,
targeting at molecular graphs in order to learn molecular leading to high complexity of backpropagation. While sev-
fingerprints [53], [80], to predict molecular properties [13], eral approaches try to improve their model efficiency by fast
to infer protein interfaces [137], and to synthesize chemical sampling [45], [46] and sub-graph training [24], [27], they are
compounds [65], [66], [138]. still not scalable enough to handle deep architectures with
large graphs.
Others There have been initial explorations into applying
GNNs to other problems such as program verification [18], Dynamics and Heterogeneity The majority of current graph
program reasoning [139], social influence prediction [140], neural networks tackle with static homogeneous graphs. On
adversarial attacks prevention [141], electrical health records the one hand, graph structures are assumed to be fixed.
modeling [142], [143], event detection [144] and combinato- On the other hand, nodes and edges from a graph are
rial optimization [145]. assumed to come from a single source. However, these two
assumptions are not realistic in many scenarios. In a social
network, a new person may enter into a network at any time
7 F UTURE D IRECTIONS
and an existing person may quit the network as well. In
Though graph neural networks have proven their power a recommender system, products may have different types
in learning graph data, challenges still exist due to the where their inputs may have different forms such as texts
complexity of graphs. In this section, we provide four future or images. Therefore, new methods should be developed to
directions of graph neural networks. handle dynamic and heterogeneous graph structures.
Go Deep The success of deep learning lies in deep neu-
ral architectures. In image classification, for example, an 8 C ONCLUSION
outstanding model named ResNet [146] has 152 layers.
However, when it comes to graphs, experimental studies In this survey, we conduct a comprehensive overview of
have shown that with the increase in the number of layers, graph neural networks. We provide a taxonomy which
the model performance drops dramatically [147]. According groups graph neural networks into five categories: graph
to [147], this is due to the effect of graph convolutions in convolutional networks, graph attention networks, graph
that it essentially pushes representations of adjacent nodes autoencoders and graph generative networks. We provide
closer to each other so that, in theory, with an infinite times a thorough review, comparisons, and summarizations of the
of convolutions, all nodes’ representations will converge to a methods within or between categories. Then we introduce
single point. This raises the question of whether going deep a wide range of applications of graph neural networks.
is still a good strategy for learning graph-structured data. Datasets, open source codes, and benchmarks for graph
neural networks are summarized. Finally, we suggest four
Receptive Field The receptive field of a node refers to a future directions for graph neural networks.
set of nodes including the central node and its neighbors.
The number of neighbors of a node follows a power law
distribution. Some nodes may only have one neighbor, ACKNOWLEDGMENT
while other nodes may neighbors as many as thousands. This research was funded by the Australian Government
Though sampling strategies have been adopted [24], [26], through the Australian Research Council (ARC) under
[27], how to select a representative receptive field of a node grants 1) LP160100630 partnership with Australia Govern-
remains to be explored. ment Department of Health and 2) LP150100671 partnership
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 19
with Australia Research Alliance for Children and Youth [21] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks
(ARACY) and Global Business College Australia (GBCA). on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[22] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graph convolu-
We acknowledge the support of NVIDIA Corporation and tional neural networks,” in Proceedings of the AAAI Conference on
MakeMagic Australia with the donation of GPU used for Artificial Intelligence, 2018, pp. 3546–3553.
this research. [23] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets:
Graph convolutional neural networks with complex rational
spectral filters,” arXiv preprint arXiv:1705.07664, 2017.
R EFERENCES [24] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only learning on large graphs,” in Advances in Neural Information
look once: Unified, real-time object detection,” in Proceedings of Processing Systems, 2017, pp. 1024–1034.
the IEEE conference on computer vision and pattern recognition, 2016, [25] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M.
pp. 779–788. Bronstein, “Geometric deep learning on graphs and manifolds
[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards using mixture model cnns,” in Proceedings of the IEEE Conference
real-time object detection with region proposal networks,” in on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017, p. 3.
Advances in neural information processing systems, 2015, pp. 91–99. [26] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional
[3] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches neural networks for graphs,” in Proceedings of the International
to attention-based neural machine translation,” in Proceedings of Conference on Machine Learning, 2016, pp. 2014–2023.
the Conference on Empirical Methods in Natural Language Processing, [27] H. Gao, Z. Wang, and S. Ji, “Large-scale learnable graph convolu-
2015, pp. 1412–1421. tional networks,” in Proceedings of the ACM SIGKDD International
[4] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, Conference on Knowledge Discovery & Data Mining. ACM, 2018,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural pp. 1416–1424.
machine translation system: Bridging the gap between human [28] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “Gaan:
and machine translation,” arXiv preprint arXiv:1609.08144, 2016. Gated attention networks for learning on large and spatiotem-
[5] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, poral graphs,” in Proceedings of the Uncertainty in Artificial Intelli-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep gence, 2018.
neural networks for acoustic modeling in speech recognition: [29] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
The shared views of four research groups,” IEEE Signal processing V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
magazine, vol. 29, no. 6, pp. 82–97, 2012. R. Faulkner et al., “Relational inductive biases, deep learning, and
[6] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, graph networks,” arXiv preprint arXiv:1806.01261, 2018.
speech, and time series,” The handbook of brain theory and neural [30] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention
networks, vol. 3361, no. 10, p. 1995, 1995. models in graphs: A survey,” arXiv preprint arXiv:1807.07984,
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” 2018.
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [31] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A
[8] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van- survey,” arXiv preprint arXiv:1812.04202, 2018.
dergheynst, “Geometric deep learning: going beyond euclidean [32] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network em-
data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, bedding,” IEEE Transactions on Knowledge and Data Engineering,
2017. 2017.
[9] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolu- [33] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learn-
tional matrix completion,” stat, vol. 1050, p. 7, 2017. ing on graphs: Methods and applications,” in Advances in Neural
[10] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix com- Information Processing Systems, 2017, pp. 1024–1034.
pletion with recurrent multi-graph neural networks,” in Advances [34] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation
in Neural Information Processing Systems, 2017, pp. 3697–3707. learning: A survey,” IEEE Transactions on Big Data, 2018.
[11] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and [35] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of
J. Leskovec, “Graph convolutional neural networks for web- graph embedding: problems, techniques and applications,” IEEE
scale recommender systems,” in Proceedings of the ACM SIGKDD Transactions on Knowledge and Data Engineering, 2018.
International Conference on Knowledge Discovery and Data Mining. [36] P. Goyal and E. Ferrara, “Graph embedding techniques, applica-
ACM, 2018, pp. 974–983. tions, and performance: A survey,” Knowledge-Based Systems, vol.
[12] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional 151, pp. 78–94, 2018.
neural networks on graphs with fast localized spectral filtering,” [37] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep
in Advances in Neural Information Processing Systems, 2016, pp. network representation,” in Proceedings of the International Joint
3844–3852. Conference on Artificial Intelligence. AAAI Press, 2016, pp. 1895–
[13] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, 1901.
“Neural message passing for quantum chemistry,” in Proceedings [38] X. Shen, S. Pan, W. Liu, Y.-S. Ong, and Q.-S. Sun, “Discrete
of the International Conference on Machine Learning, 2017, pp. 1263– network embedding,” in Proceedings of the International Joint Con-
1272. ference on Artificial Intelligence, 7 2018, pp. 3549–3555.
[14] T. N. Kipf and M. Welling, “Semi-supervised classification with [39] H. Yang, S. Pan, P. Zhang, L. Chen, D. Lian, and C. Zhang,
graph convolutional networks,” in Proceedings of the International “Binarized attributed network embedding,” in IEEE International
Conference on Learning Representations, 2017. Conference on Data Mining. IEEE, 2018.
[15] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, [40] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
and Y. Bengio, “Graph attention networks,” in Proceedings of the of social representations,” in Proceedings of the ACM SIGKDD
International Conference on Learning Representations, 2017. international conference on Knowledge discovery and data mining.
[16] M. Gori, G. Monfardini, and F. Scarselli, “A new model for ACM, 2014, pp. 701–710.
learning in graph domains,” in Proceedings of the International Joint [41] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning
Conference on Neural Networks, vol. 2. IEEE, 2005, pp. 729–734. graph representations,” in Proceedings of the AAAI Conference on
[17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon- Artificial Intelligence, 2016, pp. 1145–1152.
fardini, “The graph neural network model,” IEEE Transactions on [42] D. Wang, P. Cui, and W. Zhu, “Structural deep network embed-
Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. ding,” in Proceedings of the ACM SIGKDD International Conference
[18] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1225–
sequence neural networks,” in Proceedings of the International 1234.
Conference on Learning Representations, 2015. [43] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst,
[19] H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song, “Learning “Accelerated filtering on graphs using lanczos method,” arXiv
steady-states of iterative algorithms over graphs,” in Proceedings preprint arXiv:1509.04537, 2015.
of the International Conference on Machine Learning, 2018, pp. 1114– [44] J. Atwood and D. Towsley, “Diffusion-convolutional neural net-
1122. works,” in Advances in Neural Information Processing Systems, 2016,
[20] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net- pp. 1993–2001.
works and locally connected networks on graphs,” in Proceedings
of International Conference on Learning Representations, 2014.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 20
[45] J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph [67] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann, “Net-
convolutional networks via importance sampling,” in Proceedings gan: Generating graphs via random walks,” in Proceedings of the
of the International Conference on Learning Representations, 2018. International Conference on Machine Learning, 2018.
[46] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph [68] T. Ma, J. Chen, and C. Xiao, “Constrained generation of semanti-
convolutional networks with variance reduction,” in Proceedings cally valid graphs via regularizing variational autoencoders,” in
of the International Conference on Machine Learning, 2018, pp. 941– Advances in Neural Information Processing Systems, 2018, pp. 7110–
949. 7121.
[47] F. P. Such, S. Sah, M. A. Dominguez, S. Pillai, C. Zhang, [69] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Struc-
A. Michael, N. D. Cahill, and R. Ptucha, “Robust spatial filter- tured sequence modeling with graph convolutional recurrent
ing with graph convolutional neural networks,” IEEE Journal of networks,” arXiv preprint arXiv:1612.07659, 2016.
Selected Topics in Signal Processing, vol. 11, no. 6, pp. 884–896, 2017. [70] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional
[48] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, and L. Song, “Geniepath: recurrent neural network: Data-driven traffic forecasting,” in
Graph neural networks with adaptive receptive paths,” arXiv Proceedings of International Conference on Learning Representations,
preprint arXiv:1802.00910, 2018. 2018.
[49] C. Zhuang and Q. Ma, “Dual graph convolutional networks for [71] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional
graph-based semi-supervised classification,” in Proceedings of the networks: A deep learning framework for traffic forecasting,”
World Wide Web Conference on World Wide Web. International in Proceedings of the International Joint Conference on Artificial
World Wide Web Conferences Steering Committee, 2018, pp. 499– Intelligence, 2017, pp. 3634–3640.
508. [72] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph con-
[50] T. Derr, Y. Ma, and J. Tang, “Signed graph convolutional net- volutional networks for skeleton-based action recognition,” in
work,” arXiv preprint arXiv:1808.06354, 2018. Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[51] T. Pham, T. Tran, D. Q. Phung, and S. Venkatesh, “Column [73] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn:
networks for collective classification,” in Proceedings of the AAAI Deep learning on spatio-temporal graphs,” in Proceedings of the
Conference on Artificial Intelligence, 2017, pp. 2485–2491. IEEE Conference on Computer Vision and Pattern Recognition, 2016,
[52] M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned pp. 5308–5317.
filters in convolutional neural networks on graphs,” in Proceed- [74] S. Pan, J. Wu, X. Zhu, C. Zhang, and P. S. Yu, “Joint structure
ings of the IEEE conference on computer vision and pattern recognition, feature exploration and regularization for multi-task graph clas-
2017. sification,” IEEE Transactions on Knowledge and Data Engineering,
[53] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, vol. 28, no. 3, pp. 715–728, 2016.
“Molecular graph convolutions: moving beyond fingerprints,” [75] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Task sensitive fea-
Journal of computer-aided molecular design, vol. 30, no. 8, pp. 595– ture exploration and learning for multitask graph classification,”
608, 2016. IEEE transactions on cybernetics, vol. 47, no. 3, pp. 744–758, 2017.
[54] W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling [76] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van-
towards fast graph representation learning,” in Advances in Neu- dergheynst, “The emerging field of signal processing on graphs:
ral Information Processing Systems, 2018, pp. 4563–4572. Extending high-dimensional data analysis to networks and other
[55] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end irregular domains,” IEEE Signal Processing Magazine, vol. 30,
deep learning architecture for graph classification,” in Proceedings no. 3, pp. 83–98, 2013.
of the AAAI Conference on Artificial Intelligence, 2018. [77] L. B. Almeida, “A learning rule for asynchronous perceptrons
[56] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, with feedback in a combinatorial environment.” in Proceedings of
“Hierarchical graph representation learning with differentiable the International Conference on Neural Networks, vol. 2. IEEE, 1987,
pooling,” in Advances in Neural Information Processing Systems, pp. 609–618.
2018, pp. 4801–4811. [78] F. J. Pineda, “Generalization of back-propagation to recurrent
[57] J. B. Lee, R. Rossi, and X. Kong, “Graph classification using struc- neural networks,” Physical review letters, vol. 59, no. 19, p. 2229,
tural attention,” in Proceedings of the ACM SIGKDD International 1987.
Conference on Knowledge Discovery & Data Mining. ACM, 2018, [79] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
pp. 1666–1674. F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
[58] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. A. Alemi, “Watch resentations using rnn encoder-decoder for statistical machine
your step: Learning node embeddings via graph attention,” in translation,” in Proceedings of the Conference on Empirical Methods
Advances in Neural Information Processing Systems, 2018, pp. 9197– in Natural Language Processing, 2014, pp. 1724–1734.
9207. [80] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell,
[59] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional
arXiv preprint arXiv:1611.07308, 2016. networks on graphs for learning molecular fingerprints,” in
[60] C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang, “Mgae: Marginal- Advances in Neural Information Processing Systems, 2015, pp. 2224–
ized graph autoencoder for graph clustering,” in Proceedings of 2232.
the ACM on Conference on Information and Knowledge Management. [81] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and
ACM, 2017, pp. 889–898. A. Tkatchenko, “Quantum-chemical insights from deep tensor
[61] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang, “Adver- neural networks,” Nature communications, vol. 8, p. 13890, 2017.
sarially regularized graph autoencoder for graph embedding.” [82] B. Weisfeiler and A. Lehman, “A reduction of a graph to a
in Proceedings of the International Joint Conference on Artificial canonical form and an algebra arising during this reduction,”
Intelligence, 2018, pp. 2609–2615. Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968.
[62] W. Yu, C. Zheng, W. Cheng, C. C. Aggarwal, D. Song, B. Zong, [83] B. L. Douglas, “The weisfeiler-lehman method and graph isomor-
H. Chen, and W. Wang, “Learning deep network representations phism testing,” arXiv preprint arXiv:1101.5211, 2011.
with adversarially regularized autoencoders,” in Proceedings of [84] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst,
the ACM SIGKDD International Conference on Knowledge Discovery “Geodesic convolutional neural networks on riemannian mani-
& Data Mining. ACM, 2018, pp. 2663–2671. folds,” in Proceedings of the IEEE International Conference on Com-
[63] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu, “Deep recursive puter Vision Workshops, 2015, pp. 37–45.
network embedding with regular equivalence,” in Proceedings of [85] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein, “Learning
the ACM SIGKDD International Conference on Knowledge Discovery shape correspondence with anisotropic convolutional neural net-
and Data Mining. ACM, 2018, pp. 2357–2366. works,” in Advances in Neural Information Processing Systems, 2016,
[64] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, pp. 3189–3197.
“Graphrnn: A deep generative model for graphs,” Proceedings of [86] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast
International Conference on Machine Learning, 2018. geometric deep learning with continuous b-spline kernels,” in
[65] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia, “Learning Proceedings of the IEEE Conference on Computer Vision and Pattern
deep generative models of graphs,” in Proceedings of the Interna- Recognition, 2018, pp. 869–877.
tional Conference on Machine Learning, 2018. [87] S. Pan, J. Wu, and X. Zhu, “Cogboost: Boosting for fast cost-
[66] N. De Cao and T. Kipf, “Molgan: An implicit generative model sensitive graph classification,” IEEE Transactions on Knowledge &
for small molecular graphs,” arXiv preprint arXiv:1805.11973, Data Engineering, no. 1, pp. 1–1, 2015.
2018.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 21
[88] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are [111] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J.
graph neural networks,” arXiv preprint arXiv:1810.00826, 2018. Shusterman, and C. Hansch, “Structure-activity relationship of
[89] S. Verma and Z.-L. Zhang, “Graph capsule convolutional neural mutagenic aromatic and heteroaromatic nitro compounds. cor-
networks,” arXiv preprint arXiv:1805.08090, 2018. relation with molecular orbital energies and hydrophobicity,”
[90] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Journal of medicinal chemistry, vol. 34, no. 2, pp. 786–797, 1991.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [112] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures
in Advances in Neural Information Processing Systems, 2017, pp. from non-enzymes without alignments,” Journal of molecular biol-
5998–6008. ogy, vol. 330, no. 4, pp. 771–783, 2003.
[91] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [113] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilien-
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- feld, “Quantum chemistry structures and properties of 134 kilo
sarial nets,” in Advances in neural information processing systems, molecules,” Scientific data, vol. 1, p. 140022, 2014.
2014, pp. 2672–2680. [114] T. Joachims, “A probabilistic analysis of the rocchio algorithm
[92] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence with tfidf for text categorization.” Carnegie-mellon univ pitts-
learning with neural networks,” in Advances in Neural Information burgh pa dept of computer science, Tech. Rep., 1996.
Processing Systems, 2014, pp. 3104–3112. [115] H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M.
[93] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex- Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its tech-
tracting and composing robust features with denoising autoen- nical challenges,” Communications of the ACM, vol. 57, no. 7, pp.
coders,” in Proceedings of the international conference on Machine 86–94, 2014.
learning. ACM, 2008, pp. 1096–1103. [116] B. N. Miller, I. Albert, S. K. Lam, J. A. Konstan, and J. Riedl,
[94] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. “Movielens unplugged: experiences with an occasionally con-
Farias, and A. Aspuru-Guzik, “Objective-reinforced generative nected recommender system,” in Proceedings of the international
adversarial networks (organ) for sequence generation models,” conference on Intelligent user interfaces. ACM, 2003, pp. 263–266.
arXiv preprint arXiv:1705.10843, 2017. [117] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr,
[95] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, “Grammar and T. M. Mitchell, “Toward an architecture for never-ending
variational autoencoder,” arXiv preprint arXiv:1703.01925, 2017. language learning.” in Proceedings of the AAAI Conference on
[96] H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song, “Syntax-directed Artificial Intelligence, 2010, pp. 1306–1313.
variational autoencoder for molecule generation,” in Proceedings [118] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Ben-
of the International Conference on Learning Representations, 2018. gio, and R. D. Hjelm, “Deep graph infomax,” arXiv preprint
[97] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández- arXiv:1809.10341, 2018.
Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera- [119] M. Zhang and Y. Chen, “Link prediction based on graph neural
Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, networks,” in Advances in Neural Information Processing Systems,
“Automatic chemical design using a data-driven continuous 2018.
representation of molecules,” ACS central science, vol. 4, no. 2, [120] T. Kawamoto, M. Tsubaki, and T. Obuchi, “Mean-field theory
pp. 268–276, 2018. of graph neural networks in graph partitioning,” in Advances in
[98] B. Chen, L. Sun, and X. Han, “Sequence-to-action: End-to-end Neural Information Processing Systems, 2018, pp. 4362–4372.
semantic graph generation for semantic parsing,” in Proceedings of [121] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation
the Annual Meeting of the Association for Computational Linguistics, by iterative message passing,” in Proceedings of the IEEE Confer-
2018, pp. 766–777. ence on Computer Vision and Pattern Recognition, vol. 2, 2017.
[99] D. D. Johnson, “Learning graphical state transitions,” in Proceed- [122] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn
ings of the International Conference on Learning Representations, 2016. for scene graph generation,” in European Conference on Computer
[100] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, Vision. Springer, 2018, pp. 690–706.
and M. Welling, “Modeling relational data with graph convolu- [123] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Fac-
tional networks,” in European Semantic Web Conference. Springer, torizable net: an efficient subgraph-based framework for scene
2018, pp. 593–607. graph generation,” in European Conference on Computer Vision.
[101] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Springer, 2018, pp. 346–363.
Courville, “Improved training of wasserstein gans,” in Advances [124] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from
in Neural Information Processing Systems, 2017, pp. 5767–5777. scene graphs,” arXiv preprint, 2018.
[102] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv [125] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.
preprint arXiv:1701.07875, 2017. Solomon, “Dynamic graph cnn for learning on point clouds,”
[103] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi- arXiv preprint arXiv:1801.07829, 2018.
Rad, “Collective classification in network data,” AI magazine, [126] L. Landrieu and M. Simonovsky, “Large-scale point cloud seman-
vol. 29, no. 3, p. 93, 2008. tic segmentation with superpoint graphs,” in Proceedings of the
[104] X. Zhang, Y. Li, D. Shen, and L. Carin, “Diffusion maps for textual IEEE Conference on Computer Vision and Pattern Recognition, 2018.
network embedding,” in Advances in Neural Information Processing [127] G. Te, W. Hu, Z. Guo, and A. Zheng, “Rgcnn: Regular-
Systems, 2018. ized graph cnn for point cloud segmentation,” arXiv preprint
[105] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: arXiv:1806.02952, 2018.
extraction and mining of academic social networks,” in Proceed- [128] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph
ings of the ACM SIGKDD International Conference on Knowledge neural networks,” in Proceedings of the International Conference on
Discovery and Data Mining. ACM, 2008, pp. 990–998. Learning Representations, 2018.
[106] Y. Ma, S. Wang, C. C. Aggarwal, D. Yin, and J. Tang, “Multi- [129] M. Guo, E. Chou, D.-A. Huang, S. Song, S. Yeung, and L. Fei-
dimensional graph convolutional networks,” arXiv preprint Fei, “Neural graph matching networks for fewshot 3d action
arXiv:1808.06099, 2018. recognition,” in European Conference on Computer Vision. Springer,
[107] L. Tang and H. Liu, “Relational learning via latent social dimen- 2018, pp. 673–689.
sions,” in Proceedings of the ACM SIGKDD International Conference [130] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural
on Knowledge Ciscovery and Data Mining. ACM, 2009, pp. 817– networks for rgbd semantic segmentation,” in Proceedings of the
826. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[108] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, pp. 5199–5208.
and M. Guo, “Graphgan: Graph representation learning with [131] L. Yi, H. Su, X. Guo, and L. J. Guibas, “Syncspeccnn: Synchro-
generative adversarial nets,” in Proceedings of the AAAI Conference nized spectral cnn for 3d shape segmentation.” in Proceedings of
on Artificial Intelligence, 2017. the IEEE Conference on Computer Vision and Pattern Recognition,
[109] M. Zitnik and J. Leskovec, “Predicting multicellular function 2017, pp. 6584–6592.
through multi-layer tissue networks,” Bioinformatics, vol. 33, [132] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visual reason-
no. 14, pp. i190–i198, 2017. ing beyond convolutions,” in Proceedings of the IEEE Conference on
[110] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descrip- Computer Vision and Pattern Recognition, 2018.
tor spaces for chemical compound retrieval and classification,” [133] M. Narasimhan, S. Lazebnik, and A. Schwing, “Out of the
Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, box: Reasoning with graph convolution nets for factual visual
2008. question answering,” in Advances in Neural Information Processing
Systems, 2018, pp. 2655–2666.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 22
[134] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “High-order graph [141] D. Zügner, A. Akbarnejad, and S. Günnemann, “Adversarial
convolutional recurrent neural network: a deep learning frame- attacks on neural networks for graph data,” in Proceedings of the
work for network-scale traffic learning and forecasting,” arXiv ACM SIGKDD International Conference on Knowledge Discovery and
preprint arXiv:1802.07007, 2018. Data Mining. ACM, 2018, pp. 2847–2856.
[135] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, [142] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun, “Gram:
and Z. Li, “Deep multi-view spatial-temporal network for taxi graph-based attention model for healthcare representation learn-
demand prediction,” in Proceedings of the AAAI Conference on ing,” in Proceedings of the ACM SIGKDD International Conference on
Artificial Intelligence, 2018, pp. 2588–2595. Knowledge Discovery and Data Mining. ACM, 2017, pp. 787–795.
[136] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: [143] E. Choi, C. Xiao, W. Stewart, and J. Sun, “Mime: Multilevel
Large-scale information network embedding,” in Proceedings of medical embedding of electronic health records for predictive
the International Conference on World Wide Web. International healthcare,” in Advances in Neural Information Processing Systems,
World Wide Web Conferences Steering Committee, 2015, pp. 2018, pp. 4548–4558.
1067–1077. [144] T. H. Nguyen and R. Grishman, “Graph convolutional networks
[137] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur, “Protein interface with argument-aware pooling for event detection,” in Proceedings
prediction using graph convolutional networks,” in Advances in of the AAAI Conference on Artificial Intelligence, 2018, pp. 5900–
Neural Information Processing Systems, 2017, pp. 6530–6539. 5907.
[138] J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec, “Graph [145] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization
convolutional policy network for goal-directed molecular graph with graph convolutional networks and guided tree search,” in
generation,” in Advances in Neural Information Processing Systems, Advances in Neural Information Processing Systems, 2018, pp. 536–
2018. 545.
[139] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to [146] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
represent programs with graphs,” in Proceedings of the Interna- for image recognition,” in Proceedings of the IEEE conference on
tional Conference on Learning Representations, 2017. computer vision and pattern recognition, 2016, pp. 770–778.
[140] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang, “Deepinf: [147] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convo-
Social influence prediction with deep learning,” in Proceedings of lutional networks for semi-supervised learning,” in Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery the AAAI Conference on Artificial Intelligence, 2018.
& Data Mining. ACM, 2018, pp. 2110–2119.
Graph Neural Networks with convolutional ARMA filters
Contribution In this paper, we address the limitations of This formulation inspired the seminal work of Bruna et al.
existing graph convolutional layers in modeling a desired (2013) that implemented spectral graph convolutions in a
filter response and propose a GNN based on a novel ARMA neural network. Their GNN learns end-to-end the param-
layer. The ARMA layer implements a non-linear and train- eters of each filter implemented as h = Bc, where B ∈
able ARMA graph filter that generalizes the existing graph RM ×K is a cubic B-spline basis and c ∈ RK is a vector of
convolutional layers based on polynomial filters and pro- control parameters. Those filters are not localized, since the
vides the GNN with enhanced modeling capability, thanks full projection of the eigenvectors yields paths of infinite
to a flexible design of the filter transfer function. Contrarily length and the filter accounts for interactions of each node
to polynomial filters, ARMA filters are not localized in the with the whole graph, rather than those limited to the node
node space, making their implementation inefficient within neighborhood. Since this contrasts with the local design of
a GNN. To address such a scalability issue, the proposed classic convolutional filters, Henaff et al. (2015) introduced
ARMA layer relies on a recursive formulation, which leads a parametrization of the spectral filters with smooth coef-
to a fast and distributed implementation that exploits effi- ficients to achieve spatial localization. However, the main
cient sparse operations on tensors. The resulting filters are issue with such spectral filtering (1) is the computational
not learned in the Fourier space induced by a given Lapla- complexity: not only the eigendecomposition of L is ex-
cian, but are local in the node space and independent from pensive, but a double product with U must be computed
the underlying graph structure. This allows our GNN to whenever the filter is applied. Notably, U in (1) is full even
process graphs with different topologies. when L is sparse. Finally, the same filter cannot be ap-
plied to graphs with different structures since it depends on
We use a node pooling procedure based on node deci-
a specific Laplacian spectrum.
mation, which builds on the multi-resolution framework
adopted in graph signal processing (Shuman et al., 2016).
This allows us to build deep architectures that yield more 2.1. Chebyshev polynomial filters
abstract representations at different network depths. Given The desired transfer function h(λ) can be approximated by
an input graph, node decimation drops approximately half a polynomial of order K,
of the nodes and a coarsened version of the graph on the re-
K
maining ones is obtained through graph reduction. Pooling X
of different strides is implemented in the GNN by means of hPOLY (λ) = wk λk , (2)
multiplications with pre-computed matrices. k=0
To assess the performance of our GNN, we apply it to which performs a weighted MA of the graph signal (Trem-
semi-supervised node classification, graph signal classifi- blay et al., 2018). Polynomial filters are localized in space,
cation, and graph classification. Results show that the pro- since the output at each node in the filtered signal is a linear
posed GNN with ARMA filters outperforms GNNs based combination of the nodes in its K-hop neighbourhood. A
on polynomial filters, setting the new state-of-the-art in localized filter overcomes an important limitation of spec-
several tasks. tral formulations relying on a fixed Laplacian spectrum,
making it suitable also for inference tasks on graphs with
different structures (Zhang et al., 2018).
2. Spectral filtering in GNNs
Compared to conventional polynomials, Chebyshev poly-
We assume a graph with M nodes to be characterized by nomials attenuate unwanted oscillations around the cut-off
a symmetric adjacency matrix A ∈ RM ×M and we re- frequencies (Shuman et al., 2011). Chebyshev polynomi-
fer to graph signal X ∈ RM ×F as the instance of all fea- als are exploited to implement fast localized filters in a
tures (vectors in RF ) associated with the graph nodes. Let GNN, avoiding to eigen-decompose the Laplacian by ap-
L = IM − D−1/2 AD−1/2 be the symmetrically normal- proximating the filter convolution with Chebyshev expan-
ized Laplacian (D isPthe degree matrix), with spectral de- sion Tk (x) = 2xTk−1 (x) − Tk−2 (x) (Defferrard et al.,
M
composition L = m=1 λm um um . A graph filter is a
T
2016). It follows that the convolutional layers perform the
linear operator that modifies the components of X on the filtering operation
eigenvectors basis of L, according to a transfer function h !
K−1
acting on each eigenvalue λ. The filtered graph signal reads X
X̄ = σ Tk (L̃)XWk , (3)
k=0
M
X where L̃ = 2L/λmax −IM , σ is a non-linear activation (e.g.,
X̄ = h(λm )um uTm xm , ReLU), and Wk ∈ RFin ×Fout are the k trainable weight
m=1
(1) matrices that map the node’s features from an input space
= U diag[h(λ1 ), . . . , h(λM )] UT X Fin to a new space Fout .
GNNs with convolutional ARMA filters
2.2. First-order polynomial filters λmin )/2λn . The frequency response of the approximated
ARMA(1,0) filter is
A first-order polynomial filter is adopted by Kipf & Welling
(2016a) to solve the task of semi-supervised node classifi- r b 1
cation. They propose a GNN called Graph Convolutional hARMA (µ) = with r = − and p = . (8)
µ−p a a
Network (GCN), where the convolutional layer is a simpli-
fied version of Chebyshev filters The effect of an ARMA(K,K) filter can be obtained by
summing the outputs of K ARMA(1,0) filters
X̄ = σ ÂXW . (4) K K X
M
X X rk
X̄ = X̄k = um uTm xm . (9)
Their formulation is obtained by (3) considering only K = k=1 k=1 n=1
µm + pk
1 and setting W = W0 = −W1 . Additionally, L̃ is re-
placed by  = D̃−1/2 ÃD̃−1/2 , with à = A + IM . In 3.1. Recursive and distributed implementation of the
respect to L̃, Â contains self-loops that compensate for the ARMA layer
removal of the term of order 0 in the polynomial filter, en-
suring that a node is part of its 1st order neighbourhood, Here we propose a recursive implementation of the
and that its features are preserved after the convolution. ARMA(K,K) filter based on neural networks; see Fig. 1.
The convolution with higher-order neighbourhoods can be Equation (7) must be applied many times before converg-
obtained by stacking multiple layers. However, since each ing to a steady state. Instead, to obtain a more efficient
layer (4) performs a Laplacian smoothing, after few con- implementation, we apply the recursive update only a few
volutions the node features becomes too smoothed over the times and compensate by adding a non-linearity and train-
graph (Li et al., 2018) able parameters.
We implement the recursive update in (7) with a Graph
3. The ARMA graph convolutional layer Convolutional Skip (GCS) layer, defined as
The polynomial filters discussed in the previous section are X̄(t+1) = σ L̃X̄(t) W(t) + XV(t) , (10)
sensitive to changes in the graph signal or in the underly-
ing graph structure, and their smoothness prevents to model t t+1 t+1
filter responses with sharp changes. Moreover, they have where W(t) ∈ RFout ×Fout and V(t) ∈ RFin ×Fout are train-
poor interpolation and extrapolation capability around the able parameters; we set X̄(0) = X. The modified Lapla-
known graph frequencies (Isufi et al., 2016). On the other cian matrix L̃ = I − L is derived by setting λmin = 0 and
hand, an ARMA filter approximates better the optimal h λmax = 2 in M. This is a reasonable simplification, since
thanks to a rational design that allows to model a larger the spectrum of L lies in [0, 2] and the trainable parameters
variety of filter shapes (Tremblay et al., 2018). The filter in W(t) can adjust the small offset introduced. Each GCS
response of an ARMA(P, Q) reads layer extracts local substructure information by aggregat-
ing node information in local neighbourhoods and, through
PQ q the skip connection, by combining them with the original
q=0 bq λ
hARMA (λ) = PP , (5) node features. If L and/or X are represented by sparse ten-
1+ p
p=1 ap λ sors, the GCS can exploit efficient sparse operations.
which in the node domain translates to the filtering relation We build K parallel stacks, each one with T GCS layers,
P and define the output of the ARMA convolutional layer as
Q q
q=0 bq L X
K
!
X̄ = PP . (6) X (T )
1 + p=0 ap Lp X̄ = avgpool X̄k , (11)
k=1
It is possible to note that the Laplacian appearing in the
denominator implies a matrix inversion and multiplication (T )
where X̄k is the last output of the k-th stack. We apply
between dense matrices, which is inefficient to implement dropout to the skip connection of each GCS layer not only
in a GNN. Hence, we consider the distributed formulation for regularization, but also to encourage diversity in the fil-
proposed by Loukas et al. (2015), which approximates the ters learned in each one of the K parallel stacks. To provide
effect of an ARMA(1,0) filter with a first-order recursion a further regularization and reduce the number of parame-
ters in the ARMA layer, the GCS layers in each stack may
X̄(t+1) = aMX̄(t) + bX. (7) (1)
share the same parameters, except for Wk ∈ RFin ×Fout
The eigenvalues of M = (λmax − λmin )/2I − L are re- that performs a different mapping in the first layer of the
(i) (i+1)
lated to those of the Laplacian L as follows: µn = (λmax − stack. Namely, Wk = Wk = Wk ∈ RFout ×Fout , ∀i >
GNNs with convolutional ARMA filters
L
̃ W
1
(1)
+ V
1
(1)
σ L
̃ W
1
(2)
+ V
1
(2)
σ L
̃ W
1
(T )
+ V
1
(T )
σ
Graph Conv Skip 1-1 Graph Conv Skip 1-2 Graph Conv Skip 1-T
...
...
...
Avg Pool ¯
X X
L
̃ W
K
(1)
+ V
K
(1)
σ L
̃ W
K
(2)
+ V
K
(2)
σ L
̃ W
K
(T )
+ V
K
(T )
σ
Graph Conv Skip K-1 Graph Conv Skip K-2 Graph Conv Skip K-T
Figure 1. The ARMA convolutional layer. Same colour indicates shared weights.
(i) (i+1)
1 and Vk = Vk = Vk ∈ RFin ×Fout , ∀i. Since each of this method to medium and large graphs is not feasible,
stack of GCS layers is executed independently from the as it introduces a number of additional trainable parameters
others, it is possible to implement the ARMA layer in a quadratic in the number of nodes. The other approach fol-
distributed fashion using multiple GPUs. lowed in most GNNs consists of pre-computing coarsened
versions of the graph using hierarchical clustering (Bruna
We also notice that the ARMA layer can deal naturally with
et al., 2013; Defferrard et al., 2016; Monti et al., 2017; Fey
time-varying graph signals (Holme, 2015; Grattarola et al., (l) (l)
2018) by replacing the constant term X in (10) with a time et al., 2018). At each level l, two vertices xi and xj are
(l+1)
dependent input X(t) . clustered together in a new vertex xz . Then, a stan-
dard pooling operation is applied to halve the size of the
3.2. Relationship to other approaches graph signal. To make the pooling output consistent with
the cluster assignment, the graph signal is rearranged so
The GCS layer has a similar formulation to the graph con- that elements i and j end up in consecutive positions. This
volutional layer in (4). However, thanks to the skip con- approach has several drawbacks. First, the connectivity of
nection, it is possible to stack multiple layers without in- the original graph is not preserved in the coarsened graphs
curring in the risk of over-smoothing the node features of and the spectrum of their associated Laplacians is usually
the graph (Li et al., 2018). The formulation of the ARMA not contained in the spectrum of the original Laplacian.
layer with shared weights shares analogies with a recur- Second, the procedure to rearrange vertices is cumbersome
rent neural network with residual connections (Wu et al., to implement; moreover, it requires to add fake vertices so
2016). Finally, similarly to GNNs operating in the node do- that the number of nodes can be halved each time, hence
main (Scarselli et al., 2009; Gallicchio & Micheli, 2010), injecting noisy information in the graph signal. Finally,
(t+1)
each GCS layer computes the filtered signal x̄i at ver- clustering results depend on the initial nodes order of the
(t)
tex i as a combination of signals xj in its 1-hop neigh- nodes, which hampers stability and reproducibility.
borhood, j ∈ N (i). Such a commutative aggregation
In this paper, we use a pooling procedure that builds on the
solves the problem of undefined vertex ordering and vary-
multi-resolution framework adopted in graph signal pro-
ing neighborhood sizes.
cessing (Shuman et al., 2016), which addresses the draw-
backs of the aforementioned methods. A similar, yet pre-
4. Node Pooling liminary approach was recently discussed by Simonovsky
& Komodakis (2017). Here, we provide a more detailed
Node pooling associates a single label to the node features
formulation framed within the GNN framework of the
and is particularly important in tasks such as graph (sig-
pooling procedure based on node decimation, and of the
nal) classification. However, contrarily to other neural net-
graph reduction to generate a new coarsened graph, neces-
works, GNNs also require to coarsen the original graph to
sary to apply graph convolutions in the next GNN layer. In
perform further convolutions on graph signals as the node
the experiments, we provide a systematic comparison with
dimensionality is reduced through the network layers.
respect to pooling methods based on graph clustering.
A recent approach (Ying et al., 2018) proposes to learn dif-
ferentiable soft assignments to cluster the nodes at each 4.1. Node decimation pooling and graph reduction
layer. The original adjacency matrix acts as a prior when
learning the soft assignment and sparsity is enforced with Pooling with node decimation. A simple way to deci-
an entropy-based regularization. However, the application mate nodes V of an arbitrary graph consists of partitioning
them in two sets based on Fiedler vector umax of the Lapla-
GNNs with convolutional ARMA filters
cian, and then drop one of the two sets of nodes (Shuman implies that deeper layers will require more computation.
et al., 2016). In particular, the pooling operation keeps only A solution is to apply after each reduction spectral sparsifi-
the nodes in V + , defined as cation (Batson et al., 2013) on Lnew . However, we experi-
enced numerical instability and poor convergence when ap-
V + = {n ∈ V : umax (n) ≥ 0}. (12) plying the sparsification algorithm. Therefore, we opted for
dropping connections with weights below a small threshold
We note that is equivalent to keep each time the nodes in (1E-4), which keeps the desired level of sparsity in Lnew
V − , i.e., those associated with a negative value in umax . without altering too much its spectrum.
Despite its simplicity, this procedure offers important ad-
vantages: i) approximately half of the nodes are removed
each time, i.e., |V + | ≈ |V − |; ii) the nodes in V + and V − ∈
+
(4) (4)
and small size, in all experiments we use Kron reduction to weight each link when applying the graph convolution.
(15). However, we advise the reduction in (14) when deal-
ing with very large graphs to avoid memory issues.
Table 2. Mean node classification accuracy
(1) c3
L 1/n3
3
Figure 3. How to perform graph classification with mini batches in the proposed framework.
Cheby 66.50 69.19 66.81 80.32 sification tasks on graph data taken into account. To build
ARMA (ours) 67.83 71.92 71.22 85.67 a deep GNN, we used a pooling operation based on node
GCN 67.33 72.15 70.63 86.20 decimation, which achieves superior performance on real-
decim
Cheby 66.50 70.79 68.09 90.39 world graphs with irregular topology and faster training
ARMA (ours) 69.66 75.12 74.86 93.25
time compared to node pooling based on graph clustering.
The current formulation of the ARMA layer only consid-
ers nodes information, but can be extended to incorporate
GCN performs better than Cheby only on the Protein
edge features to weight the contribution of each neigh-
dataset, while the proposed ARMA layer always achieves
bor node using, for example, edge-conditioned convolu-
the best performance showing, once again, a superior mod-
tions (Simonovsky & Komodakis, 2017). Moreover, the re-
eling capability compared to those layers based on poly-
sults presented in (Velickovic et al., 2017) showed a notable
nomial filters. The adopted GNN architecture is particu-
increase in performance when applying multi-head soft at-
larly effective for the Enzymes dataset, as it surpasses the
tention to the Laplacian in a GNN. Given that the ARMA
state-of-the-art with every convolutional layer and pool-
layer is already structured in a parallel fashion, a similar ex-
ing method. The GNN is configured with ARMA lay-
tension with the attention mechanism could provide com-
ers and decimation pooling attains top performance also
parable benefits, and further improve the performance.
in MUTAG, and competitive results in Protein. Finally,
GNNs with convolutional ARMA filters
Niepert, Mathias, Ahmed, Mohamed, and Kutzkov, Kon- Yang, Zhilin, Cohen, William W, and Salakhutdinov, Rus-
stantin. Learning convolutional neural networks for lan. Revisiting semi-supervised learning with graph
graphs. In International conference on machine learn- embeddings. In Proceedings of the 33rd International
ing, pp. 2014–2023, 2016. Conference on International Conference on Machine
Learning-Volume 48, pp. 40–48. JMLR. org, 2016.
Perozzi, Bryan, Al-Rfou, Rami, and Skiena, Steven. Deep-
walk: Online learning of social representations. In Pro- Ying, Rex, You, Jiaxuan, Morris, Christopher, Ren, Xiang,
ceedings of the 20th ACM SIGKDD international con- Hamilton, William L, and Leskovec, Jure. Hierarchical
ference on Knowledge discovery and data mining, pp. graph representation learning withdifferentiable pooling.
701–710. ACM, 2014. arXiv preprint arXiv:1806.08804, 2018.
Scarselli, Franco, Gori, Marco, Tsoi, Ah Chung, Hagen- Zhang, Muhan, Cui, Zhicheng, Neumann, Marion, and
buchner, Markus, and Monfardini, Gabriele. The graph Chen, Yixin. An end-to-end deep learning architecture
neural network model. IEEE Transactions on Neural for graph classification. In Proceedings of AAAI Confer-
Networks, 20(1):61–80, 2009. ence on Artificial Intelligence, 2018.
Shervashidze, Nino, Schweitzer, Pascal, Leeuwen, Erik Zhou, Denny, Bousquet, Olivier, Lal, Thomas N, Weston,
Jan van, Mehlhorn, Kurt, and Borgwardt, Karsten M. Jason, and Schölkopf, Bernhard. Learning with local and
Weisfeiler-lehman graph kernels. Journal of Machine global consistency. In Advances in neural information
Learning Research, 12(Sep):2539–2561, 2011. processing systems, pp. 321–328, 2004.
Shuman, David I, Vandergheynst, Pierre, and Frossard,
Pascal. Chebyshev polynomial approximation for dis-
tributed signal processing. In Distributed Computing in
Sensor Systems and Workshops (DCOSS), 2011 Interna-
tional Conference on, pp. 1–8. IEEE, 2011.
Shuman, David I, Faraji, Mohammad Javad, and Van-
dergheynst, Pierre. A multiscale pyramid transform for
graph signals. IEEE Transactions on Signal Processing,
64(8):2119–2134, 2016.
Simonovsky, Martin and Komodakis, Nikos. Dynamic
edgeconditioned filters in convolutional neural networks
on graphs. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017.
Susnjara, Ana, Perraudin, Nathanael, Kressner, Daniel, and
Vandergheynst, Pierre. Accelerated filtering on graphs
using lanczos method. arXiv preprint arXiv:1509.04537,
2015.
Tremblay, Nicolas, Goncalves, Paulo, and Borgnat, Pierre.
Design of graph filters and filterbanks. In Cooperative
and Graph Signal Processing, pp. 299–324. Elsevier,
2018.
Velickovic, Petar, Cucurull, Guillem, Casanova, Aran-
txa, Romero, Adriana, Lio, Pietro, and Bengio,
Yoshua. Graph attention networks. arXiv preprint
arXiv:1710.10903, 2017.
Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V,
Norouzi, Mohammad, Macherey, Wolfgang, Krikun,
Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al.
Google’s neural machine translation system: Bridging
the gap between human and machine translation. arXiv
preprint arXiv:1609.08144, 2016.
Published as a conference paper at ICLR 2019
A BSTRACT
1 I NTRODUCTION
We present a unifying and novel geometry representation for utilizing Convolutional Neural Networks
(CNNs) on geometries represented on weighted simplex meshes (including textured point clouds, line
meshes, polygonal meshes, and tetrahedral meshes) which preserve maximal shape information based
on the Fourier transformation. Most methods that leverage CNNs for shape learning preprocess these
shapes into uniform-grid based 2D images (rendered multiview images) or 3D images (binary voxel
or Signed Distance Function (SDF)). However, rendered 2D images do not preserve the 3D topologies
of the original shapes due to occlusions and the loss of the third spatial dimension. Binary voxels
and SDF representations under low resolution suffer big aliasing errors and under high resolution
become memory inefficient. Loss of information in the input bottlenecks the effectiveness of the
downstream learning process. Moreover, it is not clear how a weighted mesh where each element is
weighted by a different scalar or vector (i.e., texture) can be represented by binary voxels and SDF.
Mesh and graph based CNNs perform learning on the manifold physical space or graph spectrum, but
generality across topologies remains challenging.
In contrast to methods that operate on uniform sampling based representations such as voxel-based and
view-based models, which suffer significant representational errors, we use analytical integration to
precisely sample in the spectral domain to avoid sample aliasing errors. Unlike graph spectrum based
1
Published as a conference paper at ICLR 2019
Figure 1: Top: Schematic of the NUFT transformations of the Stanford Bunny model. Bottom:
Schematic for shape retrieval and surface reconstruction experiments.
methods, our method naturally generalize across input data structures of varied topologies. Using our
representation, CNNs can be directly applied in the corresponding physical domain obtainable by
inverse Fast Fourier Transform (FFT) due to the equivalence of the spectral and physical domains.
This allows for the use of powerful uniform Cartesian grid based CNN backbone architectures (such as
DLA (Yu et al., 2018), ResNet (He et al., 2016)) for the learning task on arbitrary geometrical signals.
Although the signal is defined on a simplex mesh, it is treated as a signal in the Euclidean space
instead of on a graph, differentiating our framework from graph-based spectral learning techniques
which have significant difficulties generalizing across topologies and unable to utilize state-of-the-art
Cartesian CNNs.
We evaluate the effectiveness of our shape representation for deep learning tasks with three experi-
ments: a controlled MNIST toy example, the 3D shape retrieval task, and a more challenging 3D
point cloud to surface reconstruction task. In a series of evaluations on different tasks, we show the
unique advantages of this representation, and good potential for its application in a wider range of
shape learning problems. We achieve state-of-the-art performance among non-pre-trained models for
the shape retrieval task, and beat state-of-the-art models for the surface reconstruction task.
The key contributions of our work are as follows:
• We analytically show that our approach computes the frequency domain representation
precisely, leading to much lower overall representational errors. (Sec. 3)
• We empirically show that our representation preserves maximal shape information compared
to commonly used binary voxel and SDF representations. (Sec. 4.1)
• We show that deep learning models using CNNs in conjunction with our shape representation
achieves state-of-the-art performance across a range of shape-learning tasks including shape
retrieval (Sec. 4.2) and point to surface reconstruction (Sec. 4.3)
2
Published as a conference paper at ICLR 2019
1 0 .3 .1 ? ? Notation Description
d Dimension of Euclidean space Rd
j Degree of simplex. Point j = 0,
Line j = 1, Tri. j = 2, Tet. j = 3
n, N Index of the n-th element among a
0 0 −.2 −.6 ? ? total of N elements
(a) (b) (c) Ωjn Domain of n-th element of order j
x Cartesian space coordinate vector.
x = (x, y, z)
Figure 2: Surface localization in (a) binary
k Spectral domain coordinate vector.
pixel/voxel representation, where the boundary can
k = (u, v, w)
only be in one of 24 (2D) or 28 (3D) discrete lo-
i Imaginary number unit
cations (b) Signed Distance Function representa-
tion, where boundary is linear (c) proposed repre- Table 1: Notation list
sentation, with nonlinear localization of boundary,
achieving subgrid accuracy
2 R ELATED W ORK
Shape learning involves the learning of a mapping from input geometrical signals to desired output
quantities. The representation of geometrical signals is key to the learning process, since on the one
hand the representation determines the learning architectures, and, on the other hand, the richness of
information preserved by the representation acts as a bottleneck to the downstream learning process.
While data representation has not been an open issue for 2D image learning, it is far from being
agreed upon in the existing literature for 3D shape learning. The varied shape representations used in
3D machine learning are generally classified as multiview images (Su et al., 2015a; Shi et al., 2015;
Kar et al., 2017), volumetric voxels (Wu et al., 2015; Maturana & Scherer, 2015; Wu et al., 2016;
Brock et al., 2016), point clouds (Qi et al., 2017a;b; Wang et al., 2018b), polygonal meshes (Kato
et al., 2018; Wang et al., 2018a; Monti et al., 2017; Maron et al., 2017), shape primitives (Zou et al.,
2017; Li et al., 2017), and hybrid representations (Dai & Nießner, 2018).
Our proposed representation is closest to volumetric voxel representation, since the inverse Fourier
Transform of the spectral signal in physical domain is a uniform grid implicit representation of
the shape. However, binary voxel representation suffers from significant aliasing errors during the
uniform sampling step in the Cartesian space (Pantaleoni, 2011). Using boolean values for de facto
floating point numbers during CNN training is a waste of information processing power. Also, the
primitive-in-cell test for binarization requires arbitrary grouping in cases such as having multiple
points or planes in the same cell (Thrun, 2003). Signed Distance Function (SDF) or Truncated
Signed Distance Function (TSDF) (Liu et al., 2017; Canelhas, 2017) provides localization for the
shape boundary, but is still constrained to linear surface localization due to the linear interpolation
process for recovering surfaces from grids. Our proposed representation under Fourier basis can find
nonlinear surface boundaries, achieving subgrid-scale accuracy (See Figure 2).
Cartesian CNNs are the most ubiquitous and mature type of learning architecture in Computer Vision.
It has been thoroughly studied in a range of problems, including image recognition (Krizhevsky
et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016), object detection (Girshick, 2015; Ren
et al., 2015), and image segmentation (Long et al., 2015; He et al., 2017). In the spirit of 2D
image-based deep learning, Cartesian CNNs have been widely used in shape learning models that
adopted multiview shape representation (Su et al., 2015a; Shi et al., 2015; Kar et al., 2017; Su et al.,
2015b; Pavlakos et al., 2017; Tulsiani et al., 2018). Also, due to its straightforward and analogous
extension to 3D by swapping 2D convolutional kernels with 3D counterparts, Cartesian CNNs have
also been widely adopted in shape learning models using volumetric representations (Wu et al., 2015;
Maturana & Scherer, 2015; Wu et al., 2016; Brock et al., 2016). However, the dense nature of the
operations makes it inefficient for sparse 3D shape signals. To this end, improvements to Cartesian
3
Published as a conference paper at ICLR 2019
CNNs have been made using space partitioning tree structures, such as Quadtree in 2D and Octree
in 3D (Wang et al., 2017; Häne et al., 2017; Tatarchenko et al., 2017). These Cartesian CNNs can
leverage backbone CNN architectures being developed in related computer vision problems and
thus achieve good performance. Since the physical domain representation in this study is based on
Cartesian uniform grids, we directly use Cartesian CNNs.
Graph CNNs utilize input graph structure for performing graph convolutions. They have been
developed to engage with general graph structured data (Bruna et al., 2013; Henaff et al., 2015;
Defferrard et al., 2016). Yi et al. (2017) used spectral CNNs with the eigenfunctions of the graph
Laplacian as a basis. However, the generality of this approach across topologies and geometris is still
challenging since consistency in eigenfunction basis is implied.
Specially Designed Neural Networks have been used to perform learning on unconventional data
structures. For example, Qi et al. (2017a) designed a Neural Network architecture for points that
achieves invariances using global pooling, with follow-up work (Qi et al., 2017b) using CNNs-
inspired hiearchical structures for more efficient learning. Masci et al. (2015) performed convolution
directly on the shape manifold and Cohen et al. (2017) designed CNNs for the spherical domain and
used it for 3D shapes by projecting the shapes onto the bounding sphere.
The original work on analytical expressions for Fourier transforms of 2D polygonal shape functions
is given by Lee & Mittra (1983). Improved and simpler calculation methods have been suggested
in Chu & Huang (1989). A 3D formulation is proposed by Zhang & Chen (2001). Theoretical
analyses have been performed for the Fourier analysis of simplex domains (Sun, 2006; Li & Xu,
2009) and Sammis & Strain (2009) designed approximation methods for Fast Fourier Transform of
polynomial functions defined on simplices. Prisacariu & Reid (2011) describe shape with elliptic
Fourier descriptors for level set-based segmentation and tracking. There has also been a substantial
literature on fast non-uniform Fourier transform methods for discretely sampled signal (Greengard &
Lee, 2004). However we are the first to provide a simple general expression for a j-simplex mesh,
an algorithm to perform the transformation, and illustrations of their applicability to deep learning
problems.
Almost all discrete geometric signals can be abstracted into weighted simplicial complexes. A
simplicial complex is a set composed of points, line segments, triangles, and their d-dimensional
counterparts. We call a simplicial complex consisting solely of j-simplices as a homogeneous
simplicial j-complex, or a j-simplex mesh. Most popular geometric representations that the research
community is familiar with are simplex meshes. For example, the point cloud is a 0-simplex mesh, the
triangular mesh is a 2-simplex mesh, and the tetrahedral mesh is a 3-simplex mesh. A j-simplex mesh
consists of a set of individual elements, each being a j-simplex. If signal is non-uniformly distributed
over the simplex, we can define a piecewise constant j-simplex function over the j-simplex mesh.
We call this a weighted simplex mesh. Each element has a distinct signal density.
J-simplex function For the n-th j-simplex with domain Ωjn , we define a density function fnj (x). For
example, for some Computer Vision and Graphics applications, a three-component density value can
be defined on each element of a triangular mesh for its RGB color content. For scientific applications,
signal density can be viewed as mass or charge or other physical quantity.
N
X
ρn , x ∈ Ωjn
fnj (x) = , f j (x) = fnj (x) (1)
0, x ∈/ Ωjn n=1
4
Published as a conference paper at ICLR 2019
We present a general formula for performing the Fourier transform of signal over a single j-simplex.
We provide detailed derivation and proof for j = 0, 1, 2, 3 in the supplemental material.
j+1
X e−iσt
Fnj (k) = ij γnj Qj+1 , σt := k · xt (3)
t=1 l=1,l6=t (σt − σl )
We define γnj to be the content distortion factor, which is the ratio of content between the simplex
over the domain Ωjn and the unit orthogonal j-simplex. Content is the j-dimensional analogy of the
3-dimensional volume. The unit orthogonal j-simplex is defined as a j-simplex with one vertex at the
Cartesian origin and all edges adjacent to the origin vertex to be pairwise orthogonal and to have unit
length. Therefore from Equation (2) the final general expression for computing the Fourier transform
of a signal defined on a weighted simplex mesh is:
XN j+1
X e−iσt
F j (k) = ρn ij γnj Qj+1 (4)
n t=1 l=1,l6=t (σt − σl )
For computing the simplex content, we use the Cayley-Menger Determinant for a general expression:
0 1 1 1 ···
s 2 2
1 0 d12 d13 · · ·
(−1)j+1 1 d2 0 d223 · · ·
Cnj = det( B̂ j
), where B̂ j
= 21 (5)
j
2 (j!)2 n n 2 2
1 d31 d32 0 · · ·
.. .. .. ..
. . . .
For the matrix B̂nj , each entry d2mn represents the squared distance between nodes m and n. The
matrix is of size (j + 2) × (j + 2) and is symmetrical. Since the unit orthogonal simplex has content
of CIj = 1/j!, the content distortion factor γnj can be calculated by:
s
Cnj (−1)j+1
j
γn = j = j!Cn = j!j
det(B̂nj ) (6)
CI 2j (j!)2
Auxiliary Node Method: Equation (3) provides a mean of computing the Fourier transform of a
simplex with uniform signal density. However, how do we compute the Fourier transform of polytopes
(i.e., polygons in 2D, polyhedra in 3D) with uniform signal density efficiently? Here, we introduce the
auxiliary node method (AuxNode) that utilizes signed content for efficient computing. We show that
for a solid j-polytope represented by a watertight (j − 1)-simplex mesh, we can compute the Fourier
transform of the entire polytope by traversing each of the elements in its boundary (j − 1)-simplex
mesh exactly once (Zhang & Chen, 2001).
The auxiliary node method performs Fourier transform over the the signed content bounded by an
auxilliary node (a convenient choice being the origin of the Cartesian coordinate system) and each
(j − 1)-simplex on the boundary mesh. This forms an auxiliary j-simplex: Ωjn0 , n0 ∈ [0, N 0 ], where
N 0 is the number of (j − 1)-simplices in the boundary mesh. However due to the overlapping
of these auxiliary j-simplices, we need a means of computing the sign of the transform for the
overlapping regions to cancel out. Equation (3) provides a general expression for computing the
unsigned transform for a single j-simplex. It is trivial to show that since the ordering of the nodes
does not affect the determinant in Equation (5), it gives the unsigned content value.
Therefore, to compute the Fourier transform of uniform signals in j-polytopes represented by its
watertight (j − 1)-simplex mesh using the auxiliary node method, we modify Equation (3):
0
Nn (−1)j j
X X e−iσt
j
Fn (k) = ij
sn0 γnj 0 Qj + Qj (7)
n0 =1 l=1 σl t=1 σt l=1,l6=t (σt − σl )
5
Published as a conference paper at ICLR 2019
Raw NUFT
1.0
0.995
0.8 0.994
Accuracy
0.993
0.6
0.992
0.4
0.991
4 8 12 16 20 24 28
Resolution
(a) Experiment setup (b) MNIST
Figure 3: MNIST experiment. (a) schematic for experiment setup. The original MNIST pixel image
is up-sampled using interpolation and contoured to get a polygonal representation of the digit. For
the polygon, it is transformed into binary pixels, distance functions, and the NUFT physical domain.
(b) classification accuracy versus input resolution under various representation schemes. NUFT
representation is more optimal, irrespective of resolution.
sn0 γnj 0 is the signed content distortion factor for the n0 th auxiliary j-simplex where sn0 ∈ {−1, 1}.
For practical purposes, assume that the auxiliary j-simplex is in Rd where d = j. We can compute
the signed content distortion factor using the determinant of the Jacobian matrix for parameterizing
the auxiliary simplex to a unit orthogonal simplex:
sn0 γnj 0 = j! det(J) = j! det([x1 , x2 , · · · , xj ]) (8)
Since this method requires the boundary simplices to be oriented, the right-hand rule can be used to
infer the correct orientation of the boundary element. For 2D polygons, it requires that the watertight
boundary line mesh be oriented in a counter-clockwise fashion. For 3D polytopes, it requires that the
face normals of boundary triangles resulting from the right-hand rule be consistently outward-facing.
Algorithmic implementation: Several efficiencies can be exploited to achieve fast runtime and high
robustness. First, since the general expression Equation (4) involves division and is vulnerable to
division-by-zero errors (that is not a singularity since it can be eliminated by taking the limit), add a
minor random noise to vertex coordinates as well as to the k = 0 frequency mode for robustness.
Second, to avoid repeated computation, the value should be cached in memory and reused, but caching
all σ and e−iσ values for all nodes and frequencies is infeasible for large mesh and/or high resolution
output, thus the Breadth-First-Search (BFS) algorithm should be used to traverse the vertices for
efficient memory management.
4 E XPERIMENTS
In this section, we will discuss the experiment setup, and we defer the details of our model architecture
and training process to the supplementary material since it is not the focus of this paper.
We use the MNIST experiment as a first example to show that shape information in the input
significantly affects the efficacy of the downstream learning process. Since the scope of this research
is on efficiently learning from nonuniform mesh-based representations, we compare our method with
the state of the art in a slightly different scenario by treating MNIST characters as polygons. We
6
Published as a conference paper at ICLR 2019
Resolution 128 96 64
NUFT Surface
Rep Method F1 mAP NDCG
NDCG
mAP No Pre-training
F1 Volu- Ours(NUFT-V) 0.770 0.745 0.809
metric DeepVoxNet 0.253 0.192 0.277
NUFT Volume
With Pre-training
NDCG
mAP RotationNet 0.798 0.772 0.865
F1 Multi- ImprovGIF 0.767 0.722 0.827
View ReVGG 0.772 0.749 0.828
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
MVCNN 0.764 0.735 0.815
Accuracy
Table 2: Comparison of shape retrieval perfor-
Figure 4: Comparison between NUFT Volume mance with state-of-the-art models. The best
and NUFT Surface performance at different reso- result among each representation category is
lutions. highlighted in bold.
choose this experiment as a first toy example since it is easy to control the learning architecture and
input resolution to highlight the effects of shape representation on deep learning performance.
Experiment setup: We pre-process the original MNIST raw pixel images into polygons, which are
represented by watertight line meshes on their boundaries. The polygonized digits are converted into
(n × n) binary pixel images and into distance functions by uniformly sampling the polygon at the
(n × n) sample locations. For NUFT, we first compute the lowest (n × n) Fourier modes for the
polygonal shape function and then use an inverse Fourier transform to acquire the physical domain
image. We also compare the results with the raw pixel image downsampled to different resolutions,
which serves as an oracle for information ceiling. Then we perform the standard MNIST classification
experiment on these representations with varying resolution n and with the same network architecture.
Results: The experiment results are presented in Figure (3b). It is evident that binary pixel repre-
sentation suffers the most information loss, especially at low resolutions, which leads to rapidly
declining performance. Using the distance function representation preserves more information, but
underperforms our NUFT representation. Due to its efficient information compression in the Spectral
domain, NUFT even outperforms the downsampled raw pixel image at low resolutions.
Shape retrieval is a classic task in 3D shape learning. SHREC17 (Savva et al., 2016) which is based on
the ShapeNet55 Core dataset serves as a compelling benchmark for 3D shape retrieval performance.
We compare the retrieval performance of our model utilizing the NUFT-surface (NUFT-S) and
NUFT-volume (NUFT-V) at various resolutions against state-of-the-art shape retrieval algorithms
to illustrate its potential in a range of 3D learning problems. We performed the experiments on
the normalized dataset in SHREC17. Our model utilizes 3D DLA (Yu et al., 2018) as a backbone
architecture.
Results: Results from the experiment are tabulated in Table 2. For the shape retrieval task, most
state-of-the-art methods are based on multi-view representation that utilize a 2D CNN pretrained on
additional 2D datasets such as ImageNet. We have achieved results on par with, though not better
than, state-of-the-art pretrained 2D models. We outperform other models in this benchmark that
have not been pre-trained on additional data. We also compared NUFT-volume and NUFT-surface
representations in Figure 4. Interestingly NUFT-volume and NUFT-surface representations lead to
similar performances under the same resolution.
7
Published as a conference paper at ICLR 2019
We further illustrate the advantages of our representation with a unique yet important task in compu-
tational geometry that has been challenging to address with conventional deep learning techniques:
surface reconstruction from point cloud. The task is challenging for deep learning in two aspects:
First, it requires input and output signals of different topologies and structures (i.e., input being a
point cloud and output being a surface). Second, it requires precise localization of the signal in 3D
space. Using our NUFT-point representation as input and NUFT-surface representation as output,
we can frame the task as a 3D image-to-image translation problem, which is easy to address using
analogous 2D techniques in 3D. We use the U-Net (Ronneberger et al., 2015) architecture and train it
with a single L2 loss between output and ground truth NUFT-surface representation.
Experiment Setup: We train and test our model using shapes from three categories of ShapeNet55
Core dataset (car, sofa, bottle). We trained our model individually for these three categories. As a
pre-processing step we removed faces not visible from the exterior and simplified the mesh for faster
conversion to NUFT representation. For the input point cloud, we performed uniform point sampling
of 3000 points from each mesh and converted the points into the NUFT-point representation (1283 ).
Then, we converted the triangular mesh into NUFT-surface representation (1283 ). At test time, we
post-process the output NUFT-surface implicit function by using the marching cubes algorithm to
extract 0.5 contours. Since the extracted mesh has thickness, we further shrink the mesh by moving
vertices to positions with higher density values while preserving the rigidity of each face. Last but
not least, we qualitatively and quantitatively compare the performance by showing the reconstructed
mesh against results from the traditional Poisson Surface Reconstruction (PSR) method (Kazhdan
& Hoppe, 2013) at various tree depths (5 and 8) and the Deep Marching Cubes (DMC) algorithm
(Liao et al., 2018). For quantitative comparison, we follow the literature (Seitz et al., 2006) and use
Chamfer distance, Accuracy and Completeness as the metrics for comparison. For comparison with
Liao et al. (2018), we also test the model with noisy inputs (Gaussian of sigma 0.15 voxel-length
under 32 resolution), computed distance metrics after normalizing the models to the range of (0, 32).
Results: Refer to Table 3 for quantitative comparisons with competing algorithms on the same
task, and Figures 5 and 6 for visual comparisons. GT stands for Ground Truth. We achieve new
state-of-the-art in the point to surface reconstruction task, due to the good localization properties of
the NUFT representations and its flexibility across geometry topologies.
8
Published as a conference paper at ICLR 2019
5 C ONCLUSION
ACKNOWLEDGEMENTS
We would like to thank Yiyi Liao for helping with the DMC comparison, Jonathan Shewchuk for
valuable discussions, and Luna Huang for LATEXmagic. Chiyu “Max” Jiang is supported by the
Chang-Lin Tien Graduate Fellowship and the Graduate Division Block Grant Award of UC Berkeley.
This work is supported by a TUM-IAS Rudolf Mößbauer Fellowship and the ERC Starting Grant
Scan2CAD (804724).
R EFERENCES
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and discriminative
voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016.
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally
connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
Daniel R Canelhas. Truncated Signed Distance Fields Applied To Robotics. PhD thesis, Örebro
University, 2017.
Fu-Lai Chu and Chi-Fang Huang. On the calculation of the fourier transform of a polygonal shape
function. Journal of Physics A: Mathematical and General, 1989.
Taco Cohen, Mario Geiger, and Max Welling. Convolutional networks for spherical signals. arXiv
preprint arXiv:1709.04893, 2017.
Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene
segmentation. arXiv, 2018.
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on
graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems,
pp. 3844–3852, 2016.
Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast fourier transform. SIAM
review, 46(3):443–454, 2004.
Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object
reconstruction. arXiv preprint arXiv:1704.00710, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data.
arXiv preprint arXiv:1506.05163, 2015.
9
Published as a conference paper at ICLR 2019
Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In
Advances in Neural Information Processing Systems, pp. 364–375, 2017.
Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In CVPR, 2018.
Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM Transactions
on Graphics (ToG), 32(3):29, 2013.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-
tional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
Shung-Wu Lee and Raj Mittra. Fourier transform of a polygonal shape function and its application in
electromagnetics. IEEE Transactions on Antennas and Propagation, 1983.
Huiyuan Li and Yuan Xu. Discrete fourier analysis on a dodecahedron and a tetrahedron. Mathematics
of Computation, 78(266):999–1029, 2009.
Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas. Grass:
Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG), 36
(4):52, 2017.
Yiyi Liao, Simon Donne, and Andreas Geiger. Deep marching cubes: Learning explicit surface
representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
Computer Society, 2018.
Hongsen Liu, Yang Cong, Shuai Wang, Huijie Fan, Dongying Tian, and Yandong Tang. Deep learning
of directional truncated signed distance function for robust 3d object recognition. In Intelligent
Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 5934–5940. IEEE,
2017.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
Haggai Maron, Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G
Kim, and Yaron Lipman. Convolutional neural networks on surfaces via seamless toric covers.
ACM Trans. Graph, 36(4):71, 2017.
Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic con-
volutional neural networks on riemannian manifolds. In Proceedings of the IEEE international
conference on computer vision workshops, pp. 37–45, 2015.
Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time
object recognition. In IROS, 2015.
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M
Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc.
CVPR, volume 1, pp. 3, 2017.
Jacopo Pantaleoni. Voxelpipe: a programmable pipeline for 3d voxelization. In Proceedings of the
ACM SIGGRAPH Symposium on High Performance Graphics, pp. 99–106. ACM, 2011.
Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis.
6-dof object pose from semantic keypoints. In Robotics and Automation (ICRA), 2017 IEEE
International Conference on, pp. 2011–2018. IEEE, 2017.
Victor Adrian Prisacariu and Ian Reid. Nonlinear shape manifolds as shape priors in level set
segmentation and tracking. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on, pp. 2185–2192. IEEE, 2011.
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. In CVPR, 2017a.
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature
learning on point sets in a metric space. In NIPS, 2017b.
10
Published as a conference paper at ICLR 2019
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In Advances in neural information processing systems,
pp. 91–99, 2015.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-
assisted intervention, pp. 234–241. Springer, 2015.
Ian Sammis and John Strain. A geometric nonuniform fast fourier transform. Journal of Computa-
tional Physics, 228(18):7086–7108, 2009.
Manolis Savva, Fisher Yu, Hao Su, M Aono, B Chen, D Cohen-Or, W Deng, Hang Su, Song Bai,
Xiang Bai, et al. Shrec’16 track large-scale 3d shape retrieval from shapenet core55. In Proceedings
of the eurographics workshop on 3D object retrieval, 2016.
Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison
and evaluation of multi-view stereo reconstruction algorithms. In Computer vision and pattern
recognition, 2006 IEEE Computer Society Conference on, volume 1, pp. 519–528. IEEE, 2006.
Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation
for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional
neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on
computer vision, pp. 945–953, 2015a.
Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for cnn: Viewpoint estimation in
images using cnns trained with rendered 3d model views. In Proceedings of the IEEE International
Conference on Computer Vision, pp. 2686–2694, 2015b.
Jiachang Sun. Multivariate fourier transform methods over simplex and super-simplex domains.
Journal of Computational Mathematics, pp. 305–322, 2006.
Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient
convolutional architectures for high-resolution 3d outputs. In CVPR, 2017.
Sebastian Thrun. Learning occupancy grid maps with forward sensor models. Autonomous robots,
15(2):111–127, 2003.
Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Multi-view consistency as supervisory signal
for learning shape and pose prediction. arXiv preprint arXiv:1801.03910, 2018.
Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:
Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018a.
Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based
convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36
(4):72, 2017.
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon.
Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018b.
Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilis-
tic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural
Information Processing Systems, pp. 82–90, 2016.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong
Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1912–1920, 2015.
Li Yi, Hao Su, Xingwen Guo, and Leonidas Guibas. Syncspeccnn: Synchronized spectral cnn for 3d
shape segmentation. In Computer Vision and Pattern Recognition (CVPR), 2017.
11
Published as a conference paper at ICLR 2019
Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR,
2018.
Cha Zhang and Tsuhan Chen. Efficient feature extraction for 2d/3d objects in mesh representation.
In ICIP, 2001.
Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 3d-prnn: Generating shape
primitives with recurrent neural networks. In The IEEE International Conference on Computer
Vision (ICCV), 2017.
12
Published as a conference paper at ICLR 2019
A PPENDIX
A M ATHEMATICAL D ERIVATION
Without loss of generality assume that the j-simplex is defined in Rd space where d ≥ j, since it is
not possible to define a j-simplex in a space with dimensions lower than j. For most cases below
(except j = 0) we will parameterize the original simplex domain to a unit orthogonal simplex in Rj
(as shown in Figure 7). Denote the original coordinate system in Rd as x and the new coordinate
system in the parametric Rj space as p. Choose the following parameterization scheme:
By performing the Fourier transform integral in the parametric space and restoring the results by the
content distortion factor γnj we can get equivalent results as the Fourier transform on the original
simplex domain. Content is the generalization of volumes in arbitrary dimensions (i.e. unity for
points, length for lines, area for triangles, volume for tetrahedron). Content distortion factor γnj is the
ratio between the content of the original simplex and the content of the unit orthogonal simplex in
parametric space. The content is signed if switching any pair of nodes in the simplex changes the sign
of the content, and it is unsigned otherwise. See subsection 3.2 for means of computing the content
and the content distortion factor.
Points have spatial position x but no size (length, area, volume), hence it can be mathematically
modelled as a delta function. The delta function (or Dirac delta function) has point mass as a function
equal to zero everywhere except for zero and its integral over entire real space is one, i.e.:
ZZZ ∞ ZZZ 0+
δ(x)dx = δ(x)dx = 1 (10)
−∞ 0−
Indeed, for 0-simplex, we have recovered the definition of the Discrete Fourier Transform (DFT).
For a line with vertices at location x1 , x2 ∈ Rd , by parameterizing it onto a unit line, we get:
Z 1
Fn1 (ω) = γn1 dp e−iω·x(p) (15)
0
e−iω·x1 e−iω·x2
= iγn1 + (16)
ω(x1 − x2 ) ω(x2 − x1 )
13
Published as a conference paper at ICLR 2019
z q
x3 1
x1
x2
y 1 p
x
(a) Original j-simplex in Rd space, d ≥ j. (b) Unit orthogonal j-simplex in Rj
For a triangle with vertices x1 , x2 , x3 ∈ Rd , parameterization onto a unit orthogonal triangle gives:
Z 1 Z 1−p
Fn2 (ω) = γn2 dp dq e−iω·x(p) (17)
0 0
Z 1 Z 1−p
= γn2 e−iω·x3 dp e−ipω·(x1 −x3 ) dq e−iqω·(x2 −x3 ) (18)
0 0
e−iω·x1 e−iω·x2 e−iω·x3
= −γn2 + +
ω 2 (x1 − x2 )(x1 − x3 ) ω 2 (x2 − x1 )(x2 − x3 ) ω 2 (x3 − x1 )(x3 − x2 )
(19)
14
Published as a conference paper at ICLR 2019
Model Architecture: We use the state-of-the-art Deep Layer Aggregation (DLA) backbone archi-
tecture with [1,1,2,1] levels and [32, 64, 256, 512] filter numbers. We keep the architecture constant
while varying the input resolution.
Training Details: We train the model with batch size of 64, learning rate of 0.01 and learning rate
step size of 10. We use the SGD optimizer with momentum of 0.9, weight decay of 1 × 10−4 for 15
epochs.
Training Details We train the model with batch size of 64, learning rate of 1 × 10−3 , learning rate
step size of 30. We use the Adam optimizer with momentum of 0.9, and weight decay of 1 × 10−4
for 40 epochs.
Model Architecture We use a modified 3D version of the U-Net architecture consisting of 4 down
convolutions and 4 up-convolutions with skip layers. Number of filters for down convolutions are
[32, 64, 128, 256] and double of that for up convolutions.
Training Details We train the model using Adam optimizer with learning rate 3 × 10−4 for 200
epochs. We use NUFT point representation as input and a single L2 loss between output and ground
truth (NUFT surface) to train the network. We train and evaluate the model at 1283 resolution.
15
Published as a conference paper at ICLR 2019
Figure 8: Visualizing different representations. (a) Shows the original ground truth polygon, (b, c)
show reconstructed polygons from binary and NUFT representations.
Figure 9: Comparison between 3D shapes. (a) Original mesh, (b) Reconstructed mesh from Binary
Voxel (64 × 64 × 64), (c) Reconstructed mesh from NUFT (64 × 64 × 64)
10%
5%
Relative Error
2%
Efficiency: 58x
1%
0.5%
0.2%
20 40 60 80 100 120
Figure 10: Comparison of Representations in Mesh Recovery Accuracy (Example mesh: Stanford
Bunny 1K Mesh). Notes: (i) Relative error is defined by the proportion of volume of differenced
mesh to volume of the original mesh. (ii) Error estimates for NUFT-Volume over 50 on abscissa are
inaccurate due to inadequate quadrature resolution.
16