Vous êtes sur la page 1sur 48

JOURNAL OF LATEX CLASS FILES, VOL. X, NO.

X, DECEMBER 2018 1

A Comprehensive Survey on Graph Neural


Networks
Zonghan Wu, Shirui Pan, Member, IEEE, Fengwen Chen, Guodong Long,
Chengqi Zhang, Senior Member, IEEE, Philip S. Yu, Fellow, IEEE

Abstract—Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video
processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the
Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and
are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has
imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning
approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in
arXiv:1901.00596v1 [cs.LG] 3 Jan 2019

data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into different
categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these
learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal
networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes
and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this
fast-growing field.

Index Terms—Deep Learning, graph neural networks, graph convolutional networks, graph representation learning, graph
autoencoder, network embedding

1 I NTRODUCTION meaningful features that are shared with the entire datasets
for various image analysis tasks.

T HE recent success of neural networks has boosted re-


search on pattern recognition and data mining. Many
machine learning tasks such as object detection [1], [2], ma-
While deep learning has achieved great success on Eu-
clidean data, there is an increasing number of applications
where data are generated from the non-Euclidean domain
chine translation [3], [4], and speech recognition [5], which and need to be effctectively analyzed. For instance, in e-
once heavily relied on handcrafted feature engineering to commence, a graph-based learning system is able to exploit
extract informative feature sets, has recently been revolu- the interactions between users and products [9], [10], [11]
tionized by various end-to-end deep learning paradigms, to make a highly accurate recommendations. In chemistry,
i.e., convolutional neural networks (CNNs) [6], long short- molecules are modeled as graphs and their bioactivity needs
term memory (LSTM) [7], and autoencoders. The success to be identified for drug discovery [12], [13]. In a citation
of deep learning in many domains is partially attributed to network, papers are linked to each other via citationship
the rapidly developing computational resources (e.g., GPU) and they need to be categorized into different groups [14],
and the availability of large training data, and is partially [15]. The complexity of graph data has imposed significant
due to the effectiveness of deep learning to extract latent challenges on existing machine learning algorithms. This is
representation from Euclidean data (e.g., images, text, and because graph data are irregular. Each graph has a variable
video). Taking image analysis as an example, an image can size of unordered nodes and each node in a graph has
be represented as a regular grid in the Euclidean space. a different number of neighbors, causing some important
A convolutional neural network (CNN) is able to exploit operations (e.g., convolutions), which are easy to compute
the shift-invariance, local connectivity, and compositionality in the image domain, but are not directly applicable to the
of image data [8], and as a result, CNN can extract local graph domain any more. Furthermore, a core assumption
of existing machine learning algorithms is that instances are
independent of each other. However, this is not the case for
• Z. Wu, F. Chen, G. Long, C. Zhang are with Centre for graph data where each instance (node) is related to others
Artificial Intelligence, FEIT, University of Technology Sydney,
NSW 2007, Australia (E-mail: zonghan.wu-3@student.uts.edu.au;
(neighbors) via some complex linkage information, which is
fengwen.chen@student.uts.edu.au; guodong.long@uts.edu.au; used to capture the interdependence among data, including
chengqi.zhang@uts.edu.au). citationship, friendship, and interactions.
• S. Pan is with Faculty of Information Technology, Monash University, Recently, there is increasing interest in extending deep
Clayton, VIC 3800, Australia (Email: shirui.pan@monash.edu).
• P. S. Yu is with Department of Computer Science, University of Illinois learning approaches for graph data. Driven by the success
at Chicago, Chicago, IL 60607-7053, USA (Email: psyu@uic.edu) of deep learning, researchers have borrowed ideas from
• Corresponding author: Shirui Pan. convolution networks, recurrent networks, and deep auto-
Manuscript received Dec xx, 2018; revised Dec xx, 201x. encoders to design the architecture of graph neural net-
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 2

works. To handle the complexity of graph data, new gen-


eralizations and definitions for important operations have
been rapidly developed over the past few years. For in-
stance, Figure 1 illustrates how a kind of graph convolution
is inspired by a standard 2D convolution. This survey aims
to provide a comprehensive overview of these methods, for
both interested researchers who want to enter this rapidly
developing field and experts who would like to compare
graph neural network algorithms.
A Brief History of Graph Neural Networks The nota-
(a) 2D Convolution. Analo- (b) Graph Convolution. To get
tion of graph neural networks was firstly outlined in Gori et gous to a graph, each pixel a hidden representation of the
al. (2005) [16], and further elaborated in Scarselli et al. (2009) in an image is taken as a red node, one simple solution
[17]. These early studies learn a target node’s representation node where neighbors are de- of graph convolution opera-
termined by the filter size. tion takes the average value
by propagating neighbor information via recurrent neural The 2D convolution takes a of node features of the red
architectures in an iterative manner until a stable fixed weighted average of pixel val- node along with its neighbors.
point is reached. This process is computationally expensive, ues of the red node along with Different from image data, the
and recently there have been increasing efforts to overcome its neighbors. The neighbors of neighbors of a node are un-
a node are ordered and have a ordered and variable in size.
these challenges [18], [19]. In our survey, we generalize the fixed size.
term graph neural networks to represent all deep learning
approaches for graph data. Fig. 1: 2D Convolution vs. Graph Convolution.
Inspired by the huge success of convolutional networks
in the computer vision domain, a large number of methods
that re-define the notation of convolution for graph data have al. [29] position graph networks as the building blocks for
emerged recently. These approaches are under the umbrella learning from relational data, reviewing part of graph neu-
of graph convolutional networks (GCNs). The first promi- ral networks under a unified framework. However, their
nent research on GCNs is presented in Bruna et al. (2013), generalized framework is highly abstract, losing insights on
which develops a variant of graph convolution based on each method from its original paper. Lee et al. [30] conduct
spectral graph theory [20]. Since that time, there have been a partial survey on the graph attention model, which is
increasing improvements, extensions, and approximations one type of graph neural network. Most recently, Zhang et
on spectral-based graph convolutional networks [12], [14], al. [31] present a most up-to-date survey on deep learning
[21], [22], [23]. As spectral methods usually handle the for graphs, missing those studies on graph generative and
whole graph simultaneously and are difficult to parallel spatial-temporal networks. In summary, none of existing
or scale to large graphs, spatial-based graph convolutional surveys provide a comprehensive overview of graph neural
networks have rapidly developed recently [24], [25], [26], networks, only covering some of the graph convolution
[27]. These methods directly perform the convolution in the neural networks and examining a limited number of works,
graph domain by aggregating the neighbor nodes’ informa- thereby missing the most recent development of alternative
tion. Together with sampling strategies, the computation can graph neural networks, such as graph generative networks
be performed in a batch of nodes instead of the whole graph and graph spatial-temporal networks.
[24], [27], which has the potential to improve the efficiency.
Graph neural networks vs. network embedding The
In addition to graph convolutional networks, many alter-
research on graph nerual networks is closely related to
native graph neural networks have been developed in the
graph embedding or network embedding, another topic
past few years. These approaches include graph attention
which attracts increasing attention from both the data min-
networks, graph autoencoders, graph generative networks,
ing and machine learning communities [32] [33] [34] [35],
and graph spatial-temporal networks. Details on the catego-
[36], [37]. Network embedding aims to represent network
rization of these methods are given in Section 3.
vertices into a low-dimensional vector space, by preserving
Related surveys on graph neural networks. There are both network topology structure and node content informa-
a limited number of existing reviews on the topic of graph tion, so that any subsequent graph analytics tasks such as
neural networks. Using the notation geometric deep learning, classification, clustering, and recommendation can be easily
Bronstein et al. [8] give an overview of deep learning performed by using simple off-the-shelf learning machine
methods in the non-Euclidean domain, including graphs algorithm (e.g., support vector machines for classification).
and manifolds. While being the first review on graph con- Many network embedding algorithms are typically unsu-
volution networks, this survey misses several important pervised algorithms and they can be broadly classified into
spatial-based approaches, including [15], [19], [24], [26], three groups [32], i.e., matrix factorization [38], [39], ran-
[27], [28], which update state-of-the-art benchmarks. Fur- dom walks [40], and deep learning approaches. The deep
thermore, this survey does not cover many newly devel- learning approaches for network embedding at the same
oped architectures which are equally important to graph time belong to graph neural networks, which include graph
convolutional networks. These learning paradigms, includ- autoencoder-based algorithms (e.g., DNGR [41] and SDNE
ing graph attention networks, graph autoencoders, graph [42]) and graph convolution neural networks with unsuper-
generative networks, and graph spatial-temporal networks, vised training(e.g., GraphSage [24]). Figure 2 describes the
are comprehensively reviewed in this article. Battaglia et differences between network embedding and graph neural
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 3

TABLE 1: Commonly used notations.


Notations Descriptions
|·| The length of a set
Element-wise product.
AT Transpose of vector/matrix A.
[A, B] Concatenation of A and B.
G A graph
V The set of nodes in a graph
vi A node vi ∈ V
N (v) the neighbors of node v
E The set of edges in a graph
eij An edge eij ∈ E
X ∈ RN ×D The feature matrix of a graph.
x ∈ RN The feature vector of a graph in case of D = 1.
Xi ∈ R D The feature vector of the node vi .
N The number of nodes, N = |V |.
M The number of edges, M = |E|.
Fig. 2: Network Embedding v.s. Graph Neural Networks. D The dimension of a node vector.
T The total number of time steps in time series.

networks in this paper.

Our Contributions Our paper makes notable contribu-


tions summarized as follows:

• New taxonomy In light of the increasing number of


studies on deep learning for graph data, we propose
a new taxonomy of graph neural networks (GCNs).
In this taxonomy, GCNs are categorized into five
groups: graph convolution networks, graph atten-
tion networks, graph auto-encoders, graph genera-
tive networks, and graph spatial-temporal networks.
We pinpoint the differences between graph neural
networks and network embedding, and draw the
connections between different graph neural network
architectures.
• Comprehensive review This survey provides the
most comprehensive overview on modern deep
learning techniques for graph data. For each type of Fig. 3: Categorization of Graph Neural Networks.
graph neural network, we provide detailed descrip-
tions on representative algorithms, and make neces-
sary comparison and summarise the corresponding 2 D EFINITION
algorithms. In this section, we provide definitions of basic graph con-
• Abundant resources This survey provides abundant cepts. For easy retrieval, we summarize the commonly used
resources on graph neural networks, which include notations in Table 1.
state-of-the-art algorithms, benchmark datasets, Definition 1 (Graph). A Graph is G = (V, E, A) where V
open-source codes, and practical applications. This is the set of nodes, E is the set of edges, and A is the
survey can be used as a hands-on guide for under- adjacency matrix. In a graph, let vi ∈ V to denote a node
standing, using, and developing different deep learn- and eij = (vi , vj ) ∈ E to denote an edge. The adjacency
ing approaches for various real-life applications. matrix A is a N × N matrix with Aij = wij > 0 if
• Future directions This survey also highlights the cur- eij ∈ E and Aij = 0 if eij ∈ / E . The degree of a node is
rent limitations of the existing algorithms, and points the number ofP edges connected to it, formally defined as
out possible directions in this rapidly developing degree(vi ) = Ai,:
field.
A graph can be associated with node attributes X 1 ,
where X ∈ RN ×D is a feature matrix with Xi ∈ RD
Organization of Our Survey The rest of this survey representing the feature vector of node vi . In the case of
is organized as follows. Section 2 defines a list of graph- D = 1, we replace x ∈ RN with X to denote the feature
related concepts. Section 3 clarifies the categorization of vector of the graph.
graph neural networks. Section 4 and Section 5 provides Definition 2 (Directed Graph). A directed graph is a graph
an overview of graph neural network models. Section 6 with all edges pointing from one node to another. For
presents a gallery of applications across various domains. a directed graph, Aij 6= Aji . An undirected graph is a
Section 7 discusses the current challenges and suggests
future directions. Section 8 summarizes the paper. 1. Such graph is referred to an attributed graph in literature.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 4

(a) Graph Convolution Networks with Pooling Modules for Graph


Classification [12]. A GCN layer [14] is followed by a pooling layer to
Fig. 4: A Variant of Graph Convolution Networks with Mul- coarsen a graph into sub-graphs so that node representations on coars-
tiple GCN layers [14]. A GCN layer encapsulates each node’s ened graphs represent higher graph-level representations. To calculate
hidden representation by aggregating feature information from the probability for each graph label, the output layer is a linear layer
its neighbors. After feature aggregation, a non-linear transfor- with the SoftMax function.
mation is applied to the resultant outputs. By stacking multiple
layers, the final hidden representation of each node receives
messages from further neighborhood.

graph with all edges undirectional. For an undirected


graph, Aij = Aji .
Definition 3 (Spatial-Temporal Graph). A spatial-temporal (b) Graph Auto-encoder with GCN [59]. The encoder uses GCN layers
to get latent rerpesentations for each node. The decoder computes the
graph is an attributed graph where the feature matrix X pair-wise distance between node latent representations produced by the
evolves over time. It is defined as G = (V, E, A, X) with encoder. After applying a non-linear activation function, the decoder
X ∈ RT ×N ×D where T is the length of time steps. reconstructs the graph adjacency matrix.

3 C ATEGORIZATION AND F RAMEWORKS


In this section, we present our taxonomy of graph neural
networks. We consider any differentiable graph models
which incorporate neural architectures as graph neural net-
works. We categorize graph neural networks into graph con-
volution networks, graph attention networks, graph auto-
encoders, graph generative networks and graph spatial-
(c) Graph Spatial-Temporal Networks with GCN [71]. A GCN layer is
temporal networks. Of these, graph convolution networks followed by a 1D-CNN layer. The GCN layer operates on At and Xt
play a central role in capturing structural dependencies. to capture spatial dependency, while the 1D-CNN layer slides over X
As illustrated in Figure 3, methods under other categories along the time axis to capture the temporal dependency. The output
layer is a linear transformation, generating a prediction for each node.
partially utilize graph convolution networks as building
blocks. We summarize the representative methods in each Fig. 5: Different Graph Neural Network Models built with
category in Table 2, and we give a brief introduction of each GCNs.
category in the following.

3.1 Taxonomy of GNNs


Graph Convolution Networks (GCNs) generalize the oper- parameters within an end-to-end framework. Figure 6 illus-
ation of convolution from traditional data (images or grids) trates the difference between graph convolutional networks
to graph data. The key is to learn a function f to generate and graph attention networks in aggregating the neighbor
a node vi ’s representation by aggregating its own features node information.
Xi and neighbors’ features Xj , where j ∈ N (vi ). Figure 4
Graph Auto-encoders are unsupervised learning frame-
shows the process of GCNs for node representation learn-
works which aim to learn a low dimensional node vectors
ing. Graph convolutional networks play a central role in
via an encoder, and then reconstruct the graph data via
building up many other complex graph neural network
a decoder. Graph autoencoders are a popular approach to
models, including auto-encoder-based models, generative
learn the graph embedding, for both plain graphs with-
models, and spatial-temporal networks, etc. Figure 5 illus-
out attributed information [41], [42] as well as attributed
trates several graph neural network models building on
graphs [61], [62]. For plain graphs, many algorithms directly
GCNs.
prepossess the adjacency matrix, by either constructing a
Graph Attention Networks are similar to GCNs and seek an new matrix (i.e., pointwise mutual information matrix) with
aggregation function to fuse the neighboring nodes, random rich information [41], or feeding the adjacency matrix to
walks, and candidate models in graphs to learn a new a autoencoder model and capturing both first order and
representation. The key difference is that graph attention second order information [42]. For attributed graphs, graph
networks employ attention mechanisms which assign larger autoencoder models tend to employ GCN [14] as a building
weights to the more important nodes, walks, or models. The block for the encoder and reconstruct the structure informa-
attention weight is learned together with neural network tion via a link prediction decoder [59], [61].
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 5

TABLE 2: Representative Publications of Graph Neural Networks


Category Publications
Spectral-based [12], [14], [20], [21], [22], [23], [43]
Graph Convolution Networks [13], [17], [18], [19], [24], [25], [26], [27], [44], [45]
Spatial-based
[46], [47], [48], [49], [50], [51], [52], [53], [54]
Pooling modules [12], [21], [55], [56]
Graph Attention Networks [15], [28], [57], [58]
Graph Auto-encoder [41], [42], [59], [60], [61], [62], [63]
Graph Generative Networks [64], [65], [66], [67], [68]
Graph Spatial-Temporal Networks [69], [70], [71], [72], [73]

3.2 Frameworks
Graph neural networks, graph convolution networks
(GCNs) in particular, try to replicate the success of CNN
in graph data by defining graph convolutions via graph
spectral theory or spatial locality. With graph structure and
node content information as inputs, the outputs of GCN
can focus on different graph analytics task with one of the
following mechanisms:
• Node-level outputs relate to node regression and
classification tasks. As a graph convolution module
directly gives nodes’ latent representations, a multi-
(a) Graph Convolution Net- (b) Graph Attention Networks
works [14] explicitly assign a [15] implicitly capture the perceptron layer or softmax layer is used as the final
non-parametric weight aij = weight aij via an end to end layer of GCN. We review graph convolution modules
√ 1
to the neigh- neural network architecture, in Section 4.1 and Section 4.2.
deg(vi )deg(vj )
so that more important nodes
bor vj of vi during the aggre-
receive larger weights.
• Edge-level outputs relate to the edge classifica-
gation process. tion and link prediction tasks. To predict the la-
bel/connection strength of an edge, an additional
Fig. 6: Differences between graph convolutional networks
function will take two nodes’ latent representations
and graph attention networks.
from the graph convolution module as inputs.
• Graph-level outputs relate to the graph classification
task. To obtain a compact representation on graph
level, a pooling module is used to coarse a graph
Graph Generative Networks aim to generate plausible into sub-graphs or to sum/average over the node
structures from data. Generating graphs given a graph representations. We review graph pooling module in
empirical distribution is fundamentally challenging, mainly Section 4.3.
because graphs are complex data structures. To address this In Table 3, we list the details of the inputs and outputs
problem, researchers have explored to factor the generation of the main GCNs methods. In particular, we summarize
process as forming nodes and edges alternatively [64], [65], output mechanisms in between each GCN layer and in the
to employ generative adversarial training [66], [67]. One final layer of each method. The output mechanisms may
promising application domain of graph generative networks involve several pooling operations, which are discussed in
is chemical compound synthesis. In a chemical graph, atoms Section 4.3.
are treated as nodes and chemical bonds are treated as
edges. The task is to discover new synthesizable molecules End-to-end Training Frameworks. Graph convolutional net-
which possess certain chemical and physical properties. works can be trained in a (semi-) supervised or purely un-
supervised way within an end-to-end learning framework,
depending on the learning tasks and label information avail-
Graph Spatial-temporal Networks aim to learn unseen pat-
able at hand.
terns from spatial-temporal graphs, which are increasingly
important in many applications such as traffic forecasting • Semi-supervised learning for node-level classifi-
and human activity prediction. For instance, the underlying cation. Given a single network with partial nodes
road traffic network is a natural graph where each key loca- being labeled and others remaining unlabeled, graph
tion is a node whose traffic data is continuously monitored. convolutional networks can learn a robust model that
By developing effective graph spatial temporal network effectively identify the class labels for the unlabeled
models, we can accurately predict the traffic status over nodes [14]. To this end, an end-to-end framework can
the whole traffic system [70], [71]. The key idea of graph be built by stacking a couple of graph convolutional
spatial-temporal networks is to consider spatial dependency layers followed by a softmax layer for multi-class
and temporal dependency at the same time. Many current classification.
approaches apply GCNs to capture the dependency together • Supervised learning for graph-level classification.
with some RNN [70] or CNN [71] to model the temporal Given a graph dataset, graph-level classification aims
dependency. to predict the class label(s) for an entire graph [55],
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 6

− 21 − 21
[56], [74], [75]. The end-to-end learning for this task D AD , where P D is a diagonal matrix of node de-
can be done with a framework which combines grees, Dii = j (Ai,j ). The normalized graph Laplacian
both graph convolutional layers and the pooling matrix possesses the property of being real symmetric
procedure [55], [56]. Specifically, by applying graph positive semidefinite. With this property, the normalized
convolutional layers, we obtain a representation with Laplacian matrix can be factored as L = UΛUT , where
a fixed number of dimensions for each node in each U = [u0 , u1 , · · · , un−1 ] ∈ RN ×N is the matrix of eigenvec-
single graph. Then, we can get the representation of tors ordered by eigenvalues and Λ is the diagonal matrix of
an entire graph through pooling which summarizes eigenvalues, Λii = λi . The eigenvectors of the normalized
the representation vectors of all nodes in a graph. Laplacian matrix forms an orthonormal space, in mathemat-
Finally, by applying the MLP layers and a softmax ical words, UT U = I. In graph signal processing, a graph
layer which are commonly used in existing deep signal x ∈ RN is a feature vector of nodes of the graph
learning frameworks, we can build an end-to-end where xi is the value of ith node. The graph Fourier transform
framework for graph classification. An example is to a signal x is defined as F (x) = UT x and the inverse
given in Fig 5a. graph Fourier transform is defined as F −1 (x̂) = Ux̂,
• Unsupervised learning for graph embedding. When where x̂ represents the resulting signal from graph Fourier
no class labels are available in graphs, we can learn transform. To understand graph Fourier transform, from its
the graph embedding in a purely unsupervised way definition we see that it indeed projects the input graph
in an end-to-end framework. These algorithms ex- signal to the orthonormal space where the basis is formed by
ploit the edge-level information in two ways. One eigenvectors of the normalized graph Laplacian. Elements
simple way is to adapt an autoencoder framework of the transformed signal x̂ are the coordinates of the graph
where the encoder employs graph convolutional lay- signal in the new space P so that the input signal can be
ers to embed the graph into the latent representation represented as x = i x̂i ui , which is exactly the inverse
upon which a decoder is used to reconstruct the graph Fourier transform. Now the graph convolution of the
graph structure [59], [61]. Another way is to utilize input signal x with a filter g ∈ RN is defined as
the negative sampling approach which samples a
portion of node pairs as negative pairs while existing x ∗G g = F −1 (F (x) F (g))
node pairs with links in the graphs being positive (1)
= U(UT x UT g)
pairs. Then a logistic regression layer is applied after
the convolutional layers for end-to-end learning [24]. where denotes the Hadamard product. If we denote a
filter as gθ = diag(UT g), then the graph convolution is
simplified as
4 G RAPH C ONVOLUTION N ETWORKS x ∗G gθ = Ugθ UT x (2)
In this section, we review graph convolution networks
(GCNs), the fundamental of many complex graph neural Spectral-based graph convolution networks all follow this
network models. GCNs approaches fall into two categories, definition. The key difference lies in the choice of the filter
spectral-based and spatial-based. Spectral-based approaches gθ .
define graph convolutions by introducing filters from the
perspective of graph signal processing [76] where the graph
convolution operation is interpreted as removing noise 4.1.2 Methods of Spectral based GCNs
from graph signals. Spatial-based approaches formulate Spectral CNN. Bruna et al. [20] propose the first spectral
graph convolutions as aggregating feature information from convolution neural network (Spectral CNN). Assuming the
neighbors. While GCNs operate on the node level, graph filter gθ = Θki,j is a set of learnable parameters and consid-
pooling modules can be interleaved with the GCN layer, to ering graph signals of multi-dimension, they define a graph
coarsen graphs into high-level sub-structures. As shown in convolution layer as
Fig 5a, such an architecture design can be used to extract
graph-level representations and to perform graph classifi- fk−1
X
cation tasks. In the following, we introduce spectral-based Xk+1
:,j = σ( UΘki,j UT Xk:,i ) (j = 1, 2, · · · , fk ) (3)
GCNs, spatial-based GCNs, and graph pooling modules i=1
separately.
where Xk ∈ RN ×fk−1 is the input graph signal, N is the
number of nodes, fk−1 is the number of input channels and
4.1 Spectral-based Graph Convolutional Networks fk is the number of output channels, Θki,j is a diagonal
Spectral-based methods have a solid foundation in graph matrix filled with learnable parameters, and σ is a non-
signal processing [76]. We first give some basic knowledge linear transformation.
background of graph signal processing, after which we re-
view the representative research on the spetral-based GCNs. Chebyshev Spectral CNN (ChebNet). Defferrard et al.
[12] propose ChebNet which defines a filter as Cheby-
shev polynomials of the diagonal matrix of eigenvalues,
4.1.1 Backgrounds PK
i.e, gθ = i=1 θi Tk (Λ̃), where Λ̃ = 2Λ/λmax − IN . The
A robust mathematical representation of a graph is the Chebyshev polynomials are defined recursively by Tk (x) =
normalized graph Laplacian matrix, defined as L = In − 2xTk−1 (x) − Tk−2 (x) with T0 (x) = 1 and T1 (x) = x. As a
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 7

TABLE 3: Summary of Graph Convolution Networks


Inputs Output Mechanisms
Category Approach (allow edge Outputs
features?) Intermediate Final
Spectral CNN (2014) [20] 7 Graph-level cluster+max pooling softmax function
Spectral ChebNet (2016) [12] 7 Graph-level efficient pooling mlp layer+softmax function
Based 1stChebNet (2017) [14] 7 Node-level activation function softmax function
AGCN (2018) [22] 7 Graph-level max pooling sum pooling
Node-level - mlp layer+softmax function
GNN (2009) [17] 3
Graph-level - add a dummy super node
Node-level - mlp layer/softmax function
GGNNs (2015) [18] 7
Graph-level - sum pooling
SSE (2018) [19] 7 Node-level - softmax function
Spatial Node-level softmax function
MPNN (2017) [13] 3
Based Graph-level - sum pooling
GraphSage (2017) [24] 7 Node-level activation function softmax function
Node-level activation function softmax function
DCNN (2016) [44] 3
Graph-level - mean pooling
PATCHY-SAN (2016) [26] 3 Graph-level - mlp layer+softmax function
LGCN (2018) [27] 7 Node-level skip connections mlp layer+softmax function

result, the convolution of a graph signal x with the defined of 1stChebNet is that the computation cost increases expo-
filter gθ is nentially with the increase of the number of 1stChebNet
K layers during batch training. Each node in the last layer
X
x ∗G gθ = U( θi Tk (Λ̃))UT x has to expand its neighborhood recursively across previous
i=1
layers. Chen et al. [45] assume the rescaled adjacent matrix
K
(4) Ã in Equation 7 comes from a sampling distribution. Under
X
= θi Ti (L̃)x this assumption, the technique of Monte Carlo and variance
i=1 reduction techniques are used to facilitate the training pro-
cess. Chen et al. [46] reduce the receptive field size of the
where L̃ = 2L/λmax − IN . graph convolution to an arbitrary small scale by sampling
From Equation 4, ChebNet implictly avoids the compu- neighborhoods and using historical hidden representations.
tation of the graph Fourier basis, reducing the computation Huang et al. [54] propose an adaptive layer-wise sampling
complexity from O(N 3 ) to O(KM ). Since Ti (L̃) is a polyno- approach to accelerate the training of 1stChebNet, where
mial of L̃ of ith order, Ti (L̃)x operates locally on each node. sampling for the lower layer is conditioned on the top
Therefore, the filters of ChebNet are localized in space. one. This method is also applicable for explicit variance
First order of ChebNet (1stChebNet 2 ) Kipf et al. [14] in- reduction.
troduce a first-order approximation of ChebNet. Assuming Adaptive Graph Convolution Network (AGCN). To ex-
K = 1 and λmax = 2 , Equation 4 is simplified as plore hidden structural relations unspecified by the graph
1 1 Laplacian matrix, Li et al. [22] propose the adaptive graph
x ∗G gθ = θ0 x − θ1 D− 2 AD− 2 x (5)
convolution network (AGCN). AGCN augments a graph
To restrain the number of parameters and avoid over- with a so-called residual graph, which is constructed by
fitting, 1stChebNet further assumes θ = θ0 = −θ1 , leading computing a pairwise distance of nodes. Despite being able
to the following definition of graph convolution, to capture complement relational information, AGCN incurs
1 1
expensive O(N 2 ) computation.
x ∗G gθ = θ(In + D− 2 AD− 2 )x (6)
4.1.3 Summary
In order to incorporate multi-dimensional graph input
Spectral CNN [20] relys on the eigen-decomposition of the
signals, 1stChebNet proposes a graph convolution layer
Laplacian matrix. It has three effects. First, any perturbation
which modifies Equation 6,
to a graph results in a change of eigen basis. Second, the
Xk+1 = ÃXk Θ (7) learned filters are domain dependent, meaning they cannot
be applied to a graph with a different structure. Third, eigen-
− 12 − 12
where à = IN + D AD . decomposition requires O(N 3 ) computation and O(N 2 )
The graph convolution defined by 1stChebNet is local- memory. Filters defined by ChebNet [12] and 1stChebNet
ized in space. It bridges the gap between spectral-based [14] are localized in space. The learned weights can be
methods and spatial-based methods. Each row of the output shared across different locations in a graph. However, a
represents the latent representation of each node obtained common drawback of spectral methods is they need to
by a linear transformation of aggregated information from load the whole graph into the memory to perform graph
the node itself and its neighboring nodes with weights convolution, which is not efficient in handling big graphs.
specified by the row of Ã. However, the main drawback
4.2 Spatial-based Graph Convolutional Networks
2. Due to its impressive performance in many node classification
tasks, 1stChebNet is simply termed as GCN and is considered as a Imitating the convolution operation of a conventional con-
strong baseline in the research community. volution neural network on an image, spatial-based meth-
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 8

ℎ() ℎ$( ℎ(% ℎ(&*$ ℎ(&


where lv denotes the label attributes of node v , lco [v] denotes

!"#$ !"#$ !"#$
the label attributes of corresponding edges of node v , htne [v]
denotes the hidden representations of node v ’s neighbors at
time step t, and lne [v] denotes the label attributes of node
(a) Recurrent-based v ’s neighbors.
To ensure convergence, the recurrent function f (·) must
be a contraction mapping, which shrinks the distance be-
ℎ() ℎ$( ℎ(% ℎ(&*$ ℎ(&
… tween two points after mapping. In case of f (·) is a neural
!"#$ !"#% !"#&
network, a penalty term has to be imposed on the Jacobian
matrix of parameters. GNNs used the Almeida-Pineda algo-
(b) Composition-based rithm [77], [78] to train its model. The core idea is to run the
propagation process to reach fixed points and then perform
Fig. 7: Recurrent-based v.s. Composition-based Spatial the backward procedure given the converged solution.
GCNs. Gated Graph Neural Networks (GGNNs) GGNNs employs
gated recurrent units(GRU) [79] as the recurrent function,
reducing the recurrence to a fixed number of steps. The
ods define graph convolution based on a node’s spatial re-
spatial graph convolution of GGNNs is defined as
lations. To relate images with graphs, images can be consid- X
ered as a special form of graph with each pixel representing htv = GRU (hvt−1 , Whtu ) (9)
a node. As illustrated in Figure 1a, each pixel is directly u∈N (v)
connected to its nearby pixels. With a 3 × 3 window, the
Different from GNNs, GGNNs use back-propagation
neighborhood of each node is its surrounding eight pixels.
through time (BPTT) to learn the parameters. The adavan-
The positions of these eight pixels indicate an ordering of a
tage is that it no longer needs to constrain parameters to
node’s neighbors. A filter is then applied to this 3 × 3 patch
ensure convergence. However, the downside of training by
by taking the weighted average of pixel values of the central
BPTT is that it sacrifices efficiency both in time and memory.
node and its neighbors across each channel. Due to the spe-
This is especially problematic for large graphs, as GGNNs
cific ordering of neighboring nodes, the trainable weights
need to run the recurrent function multiple times over all
are able to be shared across different locations. Similarly, for
nodes, requring intermediate states of all nodes to be stored
a general graph, the spatial-based graph convolution takes
in memory.
the aggregation of the central node representation and its
neighbors representation to get a new representation for this Stochastic Steady-state Embedding (SSE). To improve the
node, as depicted by Figure 1b. To explore the depth and learning efficiency, the SSE algorithm [19] updates the node
breadth of a node’s receptive field, a common practice is to latent representations stochastically in an asynchronous
stack multiple graph convolution layer together. According fashion. As shown in Algorithm 1, SSE recursively estimates
to the different approaches of stacking convolution layers, node latent representations and updates the parameters
spatial-based GCNs can be further divided into two cate- with sampled batch data. To ensure convergence to steady
gories, recurrent-based and composition-based spatial GCNs. states, the recurrent function of SSE is defined as a weighted
Recurrent-based methods apply a same graph convolution average of the historical states and new states,
layer to update hidden representations, while composition- X
hv t = (1 − α)hv t−1 + αW1 σ(W2 [xv , [hu t−1 , xu ]])
based methods apply a different graph convolution layer
u∈N (v)
to update hidden representations. Figure 7 illustrates this
(10)
difference. In the following, we give an overview of these
Though summing neighborhood information implicitly con-
two branches.
siders node degree, it remains questionable whether the
scale of this summation affects the stability of this algorithm.
4.2.1 Recurrent-based Spatial GCNs
The main idea of recurrent-based methods is to update a 4.2.2 Composition Based Spatial GCNs
node’s latent representation recursively until a stable fixed Composition-based methods update the nodes’ representa-
point is reached. This is done by imposing constraints on tions by stacking multiple graph convolution layers.
recurrent functions [17], employing gate recurrent unit ar-
chitectures [18], updating node latent representations asyn- Message Passing Neural Networks (MPNNs). Gilmer
chronously and stochastically [19]. In the following, we will et al. [13] generalizes several existing graph convolution
introduce these three methods. networks including [12], [14], [18], [20], [53], [80], [81]
into a unified framework named Message Passing Neural
Graph Neural Networks(GNNs) Being one of the earli- Networks (MPNNs). MPNNs consists of two phases, the
est works on graph neural networks, GNNs recursively message passing phase and the readout phase. The mes-
update node latent representations until convergence. In sage passing phase actually run T -step spatial-based graph
other words, from the perspective of the diffusion process, convolutions. The graph convolution operation is defined
each node exchanges information with its neighbors until through a message function Mt (·) and an updating function
equilibrium is reached. To handle heterogeneous graphs, the Ut (·) according to
spatial graph convolution of GNNs is defined as X
htv = Ut (hvt−1 , Mt (ht−1 t−1
v , hw , evw )) (11)
htv = f (lv , lco [v], ht−1
ne [v], lne [v]) (8) w∈N (v)
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 9

ALGORITHM 1: Learning with Stochastic Fixed Point


Iteration [19]

Initialize parameters,{h0v }v∈V


for k = 1 to K do
for t = 1 to T do
Sample n nodes from the whole node set V
Use Equation 10 to update hidden
representations of sampled n nodes
end
for p = 1 to P do Fig. 8: Learning Process of GraphSage [24]
Sample m nodes from the labeled node set V
Forward model according to Equation 10
series of transition probability matrix. The diffusion convo-
Back-propagate gradients
end lution operation of DCNN is formulated as
end Zm m m
i,j,: = f (Wj,: Pi,j,: Xi,: ) (14)
In Equation 14,zm denotes the hidden representation
i,j,:
of node i for hop j in graph m, Pm :,j,: denotes the prob-
The readout phase is actually a pooling operation which ability transition matrix of hop j in graph m, and Xm i,:
produces a representation of the entire graph based on denote the input features of node i in graph m, where
hidden representations of each individual node. It is defined zm ∈ RNm ×H×F , W ∈ RH×F , Pm ∈ RNm ×H×Nm and
as Xm ∈ RNm ×F .
ŷ = R(hTv |v ∈ G) (12) Though covering a larger receptive field through higher
orders of transition matrix, the DCNN model needs
Through the output function R(·), the final representation ŷ O(Nm 2
H) memory, causing severe problems when applying
is used to perform graph-level prediction tasks. The authors it to large graphs.
present that several other graph convolution networks fall
PATCHY-SAN [26] uses standard convolution neural net-
into their framework by assuming different forms of Ut (·)
work (CNN) to solve graph classification tasks. To do this,
and Mt (·).
it converts graph-structured data into grid-structured data.
GraphSage [24] introduces the notion of the aggregation First, it selects a fixed number of nodes for each graph using
function to define graph convolution. The aggregation func- a graph labelling procedure. A graph labelling procedure
tion essentially assembles a node’s neighborhood informa- essentially assigns a ranking to each node in the graph,
tion. It must be invariant to permutations of node orderings which can be based on node-degree, centrality, Weisfeiler-
such as mean, sum and max function. The graph convolu- Lehman color [82] [83] etc. Second, as each node in a graph
tion operation is defined as, can have a different number of neighbors, PATCHY-SAN
selects and orders a fixed number of neighbors for each
htv = σ(Wt · aggregatek (ht−1 k−1
v , {hu , ∀u ∈ N (v)}) (13) node according to their graph labellings. Finally, after the
grid-structured data with fixed-size is formed, PATCHY-
Instead of updating states over all nodes, GraphSage SAN employed standard CNN to learn the graph hidden
proposes a batch-training algorithm, which improves scal- representations. Utilizing standard CNN in GCNs has the
ability for large graphs. The learning process of GraphSage advantage of keeping shift-invariance, which relies on the
consists of three steps. First, it samples a node’s local k-hop sorting function. As a result, the ranking criteria in the node
neighborhood with fixed-size. Second, it derives the central selection and ordering process is of paramount importance.
node’s final state by aggregating its neighbors feature in- In PATCHY-SAN, the ranking is based on graph labellings.
formation. Finally, it uses the central node’s final state to However, graph labellings only take graph structures into
make predictions and backpropagate errors. This process is consideration, ignoring node feature information.
illustrated in Figure 8.
Assuming the number of neighbors to be sampled at tth Large-scale Graph Convolution Networks (LGCN). In a
hop is st , the time complxity of GraphSage in one batch is follow-up work, large-scale graph convolution networks
Q (LGCN) [27] proposes a ranking method based on node fea-
O( Tt=1 st ). Therefore the computation cost increases expo-
nentially with the increase of t. This prevents GraphSage ture information. Unlike PATCHY-SAN, LGCN uses stan-
from having a deep architecture. However, in practice, the dard CNN to generate node-level outputs. For each node,
authors find that with t = 2 GraphSage already achieves LGCN assembles a feature matrix of its neigborhood and
high performance. sortes this feature matrix along each column. The first k
rows of the sorted feature matrix are taken as the input
grid-data for the target node. In the end LGCN applies 1D
4.2.3 Miscellaneous Variants of Spatial GCNs CNN on the resultant inputs to get the target node’s hidden
Diffusion Convolution Neural Networks (DCNN) [44] representation. While deriving graph labellings in PATCHY-
proposed a graph convolution network which encapsulates SAN requires complex pre-processing, sorting feature val-
the graph diffusion process. A hidden node representation ues in LGCN does not need a pre-processing step, making
is obtained by independently convolving inputs with power it more efficient. To suit the scenario of large-scale graphs,
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 10

LGCN proposes a subgraph training strategy, which puts Input graphs are first processed by the coarsening process
the sampled subgraphs into a mini-batch. described in Fig 5a . After coarsening, the vertices of the
input graph and its coarsened versions are reformed in a
Mixture Model Network (MoNet) [25] unifies standard
balanced binary tree. Arbitrarily ordering the nodes at the
CNN with convolutional architectures on non-Euclidean
coarsest level then propagating this ordering to the lower
domains. While several spatial-based approaches ignore the
level in the balanced binary tree would finally produce a
relative positions between a node and its neighbors when
regular ordering in the finest level. Pooling such a rear-
aggregating neighborhood feature information, MoNet in-
ranged 1D signal is much more efficient than the original.
troduce pseudo-coordinates and weight functions to let the
Zhang et al. also propose a framework DGCNN [55]
weight of a node’s neighbor be determined by the relative
with a similar pooling strategy named SortPooling which
position (pseudo-coordinates) between the node and its
performs pooling by rearranging vertices to a meaningful
neighbor. Under such a framework, several approaches on
order. Different to ChebNet [12], DGCNN sorts vertices
manifolds such as Geodesic CNN (GCNN) [84], Anisotropic
according to their structural roles within the graph. The
CNN(ACNN) [85], Spline CNN [86], and on graphs such
graph’s unordered vertex features from spatial graph con-
as GCN [14], DCNN [44] can be generalized as special
volutions are treated as a continuous WL colors [82], and
instances of MoNet. However these approaches under the
they are then used to sort vertices. In addition to sorting
framework of MoNet have fixed weight functions. MoNet
the vertex features, it unifies the graph size to k by truncat-
instead proposes a Gaussian kernel with learnable parame-
ing/extending the graph’s feature tensor. The last n−k rows
ters to freely adjust the weight function.
are deleted if n > k , otherwise k − n zero rows are added.
This method enhances the pooling network to improve the
4.2.4 Summary
performance of GCNs by solving one challenge underlying
Spatial-based methods define graph convolutions via ag- graph structured tasks which is referred to as permutation
gregating feature information from neighbors. According to invariant. Verma and Zhang propose graph capsule net-
different ways of stacking graph convolution layers, spatial- works [89] which further explore the permutation invariant
based methods are split into two groups, recurrent-based for graph data.
and composition-based. While recurrent-based approaches Recently a pooling module, DIFFPOOL [56], is proposed
try to obtain nodes’ steady states, composition-based ap- which can generate hierarchical representations of graphs
proaches try to incorporate higher orders of neighborhood and can be combined with not only CNNs, but also var-
information. In each layer, both two groups have to update ious graph neural network architectures in an end-to-end
hidden states over all nodes during training. However, it fashion. Compared to all previous coarsening methods,
is not efficient as it has to store all the intermediate states DIFFPOOL does not simply cluster the nodes in one graph,
into memory. To address this issue, several training strate- but provide a general solution to hierarchically pool nodes
gies have been proposed, including sub-graph training for across a broad set of input graphs. This is done by learning
composition-based approaches such as GraphSage [24] and a cluster assignment matrix S at layer l referred to as
stochastically asynchronous training for recurrent-based ap- S(l) ∈ Rnl ×nl +1 . Two separate GNNs with both input
proaches such as SSE [19]. cluster node features X(l) and coarsened adjacency matrix
A(l) are being used to generate the assignment matrix S(l)
4.3 Graph Pooling Modules and embedding matrices Z(l) as follows:
When generalizing convolutional neural networks to graph-
Z(l) = GN Nl,embed (A(l) , X(l) ) (16)
structured data, another key component, graph pooling
module, is also of vital importance, particularly for graph- S(l) = sof tmax(GN Nl,pool (A(l) , X(l) )) (17)
level classification tasks [55], [56], [87]. According to Xu
et al. [88], pooling-assisted GCNs are as powerful as the Equation 16 and 17 can be implemented with any
Weisfeiler-Lehman test [82] in distinguishing graph struc- standard GNN module, which processes the same input
tures. Similar to the original pooling layer which comes data but has distinct parametrizations since the roles they
with CNNs, graph pooling module could easily reduce the play in the framework are different. The GN Nl,embed will
variance and computation complexity by down-sampling produce new embeddings while the GN Nl,pool generates a
from original feature data. Mean/max/sum pooling is the probabilistic assignment of the input nodes to nl+1 clusters.
most primitive and most effective way of implementing this The Softmax function is applied in a row-wise fashion in
since calculating the mean/max/sum value in the pooling Equation 17. As a result, each row of S(l) corresponds to one
window is rapid. of the nl nodes(or clusters) at layer l, and each column of
S(l) corresponds to one of the nl at the next layer. Once we
hG = mean/max/sum(hT1 , hT2 , ..., hTn ) (15) have Z(l) and S(l) , the pooling operation comes as follows:
Henaff et al. [21] prove that performing a simple T
X(l+1) = S(l) Z(l) ∈ Rnl+1 ×d (18)
max/mean pooling at the beginning of the network is espe-
cially important to reduce the dimensionality in the graph T
A(l+1) = S(l) A(l) S(l) ∈ Rnl+1 ×nl+1 (19)
domain and mitigate the cost of the expensive graph Fourier
transform operation. Equation 18 takes the cluster embeddings Z(l) then
Defferrard et al. optimize max/min pooling and devices aggregates these embeddings according to the cluster as-
an efficient pooling strategy in their approach ChebNet [12]. signments S(l) to calculate embedding for each of the nl+1
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 11

clusters. Initial cluster embedding would be node repre- 5.1 Graph Attention Networks
sentation. Similarly, Equation 19 takes the adjacency matrix Attention mechanisms have almost become a standard in
A(l) as inputs and generates a coarsened adjacency matrix sequence-based tasks [90]. The virtue of attention mecha-
denoting the connectivity strength between each pair of the nisms is their ability to focus on the most important parts
clusters. of an object. This specialty has been proven to be useful
Overall, DIFFPOOL [56] redefines the graph pooling for many tasks, such as machine translation and natural
module by using two GNNs to cluster the nodes. Any language understanding. Thanks to the increased model
standard GCN module is able to combine with DIFFPOOL, capacity of attention mechanisms, graph neural networks
to not only achieve enhanced performance, but also to speed also benefit from this by using attention during aggregation,
up the convolution operation. integrating outputs from multiple models, and generating
importance-oriented random walks. In this section, we will
discuss how attention mechanisms are being used in graph
4.4 Comparison Between Spectral and Spatial Models structured data.
As the earliest convolutional networks for graph data,
spectral-based models have achieved impressive results in 5.1.1 Methods of Graph Attention Networks
many graph related analytics tasks. These models are ap- Graph Attention Network (GAT) [15] is a spatial-based
pealing in that they have a theoretical foundation in graph graph convolution network where the attention mechanism
signal processing. By designing new graph signal filters is involved in determining the weights of a node’s neighbors
[23], we can theoretically design new graph convolution when aggregating feature information. The graph convolu-
networks. However, there are several drawbacks to spectral- tion operation of GAT is defined as,
based models. We illustrate this in the following from three X
aspects, efficiency, generality and flexibility. hti = σ( α(hit−1 , ht−1
j )W
t−1 t−1
hj ) (20)
j∈Ni
In terms of efficiency, the computational cost of spectral-
based models increases dramatically with the graph size where α(·) is an attention function which adaptively con-
because they either need to perform eigenvector compu- trols the contribution of a neighbor j to the node i. In order
tation [20] or handle the whole graph at the same time, to learn attention weights in different subspaces, GAT uses
which makes them difficult to parallel or scale to large multi-head attentions.
graphs. Spatial based models have the potential to handle X
hti =kK
k=1 σ( αk (hit−1 , ht−1
j )Wk
t−1 t−1
hj ) (21)
large graphs as they directly perform the convolution in
j∈Ni
the graph domain via aggregating the neighbor nodes. The
computation can be performed in a batch of nodes instead where k denotes concatenation.
of the whole graph. When the number of neighbor nodes
Gated Attention Network (GAAN) [28] also employs
increases, sampling techniques [24], [27] can be developed
the multi-head attention attention mechanism in updat-
to improve efficiency.
ing a node’s hidden state. However rather than assigning
In terms of generality, spectral-based models assumed
an equal weight to each head, GAAN introduces a self-
a fixed graph, making them generalize poorly to new or
attention mechanism which computes a different weight for
different graphs. Spatial-based models on the other hand
each head. The updating rule is defined as,
perform graph convolution locally on each node, where
weights can be easily shared across different locations and X
structures. hti = φo (xi ⊕ kK k
k=1 gi αk (ht−1
i , ht−1 t−1
j )φv (hj )) (22)
In terms of flexibility, spectral-based models are limited j∈Ni
to work on undirected graphs. There is no clear definition where φo (·) and φv (·) denotes feedforward neural networks
of the Laplacian matrix on directed graphs so that the only and gik is the attention weight of the k th attention head.
way to apply spectral-based models to directed graphs is to
transfer directed graphs to undirected graphs. Spatial-based Graph Attention Model (GAM) [57] proposes a recur-
models are more flexible to deal with multi-source inputs rent neural network model to solve graph classification
such as edge features and edge directions because these problems, which processes informative parts of a graph
inputs can be incorporated into the aggregation function by adaptively visiting a sequence of important nodes. The
(e.g. [13], [17], [51], [52], [53]). GAM model is defined as
As a result, spatial models have attracted increasing
attention in recent years [25]. ht = fh (fs (rt−1 , vt−1 , g; θs ), ht−1 ; θh ) (23)
where fh (·) is a LSTM network, fs is the step network
which takes a step from the current node vt−1 to one of
5 B EYOND G RAPH C ONVOLUTIONAL N ETWORKS its neighbors ct , prioritizing those whose type have higher
rank in vt−1 which is generated by a policy network:
In this section, we review other graph neural networks
including graph attention neural networks, graph auto- rt = fr (ht ; θr ) (24)
encoder, graph generative networks, and graph spatial-
temporal networks. In Table 4, we provide a summary of where rt is a stochastic rank vector which indicates which
main approaches under each category. node is more important and thus should be further explored
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 12

TABLE 4: Summary of Alternative Graph Neural Networks (Graph Convolutional Networks Excluded). We summarize
methods based on their inputs, outputs, targeted tasks, and whether a method is GCN-based. Inputs indicate whether a
method suits attributed graphs (A), directed graphs (D), and spatial-temporal graphs (S).
Inputs GCN
Category Approaches Outputs Tasks
A D S Based
Graph GAT (2017) [15] 3 3 7 node labels node classification 3
Attention GAAN (2018) [28] 3 3 7 node labels node classification 3
Networks GAM (2018) [57] 3 3 7 graph labels graph classification 7
Attention Walks (2018) [58] 7 7 7 node embedding network embedding 7
GAE (2016) [59] 3 7 7 reconstructed adajacency matrix network embedding 3
Graph ARGA (2018) [61] 3 7 7 reconstructed adajacency matrix network embedding 3
Auto-encoder reconstructed sequences of
NetRA (2018) [62] 7 7 7 network embedding 7
random walks
DNGR (2016) [41] 7 7 7 reconstructed PPMI matrix network embedding 7
SDNE (2016) [42] 7 3 7 reconstructed adajacency matrix network embedding 7
DNRE (2018) [63] 3 7 7 reconstructed node embedding network embedding 7
MolGAN (2018) [66] 3 7 7 new graphs graph generation 3
Graph
DGMG (2018) [65] 7 7 7 new graphs graph generation 3
Generative
GraphRNN (2018) [64] 7 7 7 new graphs graph generation 7
Networks
NetGAN (2018) [67] 7 7 7 new graphs graph generation 7
spatial-temporal
DCRNN (2018) [70] 7 7 3 node value vectors 3
Graph forecasting
Spatial-Temporal spatial-temporal
CNN-GCN (2017) [71] 7 7 3 node value vectors 3
Networks forecasting
spatial-temporal
ST-GCN (2018) [72] 7 7 3 graph labels 3
classification
spatial-temporal
Structural RNN (2016) [73] 7 7 3 node labels/value vectors 7
forecasting

with high priority, ht contains historical information that the architectures. A typical solution is to leverage multi-layer
agent has aggregated from exploration of the graph, and is perceptrons as the encoder to obtain node embeddings,
used to make a prediction for the graph label. where a decoder reconstructs a node’s neighborhood statis-
tics such as positive pointwise mutual information (PPMI)
Attention Walks [58] learns node embeddings through
[41] or the first and second order of proximities [42]. Re-
random walks. Unlike DeepWalk [40] using fixed apriori,
cently, researchers have explored the use of GCN [14] as an
Attention Walks factorizes the co-occurance matrix with
encoder, combining GCN [14] with GAN [91], or combining
differentiable attention weights.
LSTM [7] with GAN [91] in designing a graph auto-encoder.
C
X We will first review GCN based autoencoder and then
E[D] = P̃(0) ak (P)k (25) summarize other variants in this category.
k=1

where D denotes the cooccurence matrix, P̃(0) denotes 5.2.1 GCN Based Auto-encoders
the initial position matrix, and P denotes the probability Graph Auto-encoder (GAE) [59] firstly integrates GCN
transition matrix. [14] into a graph auto encoder framework. The encoder is
defined as
5.1.2 Summary Z = GCN (X, A) (26)
Attention mechanisms contribute to graph neural networks
while the decoder is defined as
in three different ways, namely assigning attention weights
to different neighbors when aggregating feature informa- Â = σ(ZZT ) (27)
tion, ensembling multiple models according to attention
weights, and using attention weights to guide random The framework of GAE is also dipicted in Fig 5b. The GAE
walks. Despite categorizing GAT [15] and GAAN [28] under can be trained in a variational manner, i.e., to minimize the
the umbrella of graph attention networks, they can also be variational lower bound L:
considered as spatial-based graph convolution networks at
L = Eq(Z|X,A) [logp (A|Z)] − KL[q(Z|X, A)||p(Z)] (28)
the same time. The advantage of GAT [15] and GAAN [28]
is that they can adpatively learn the importance weights of
Adversarially Regularized Graph Autoencoder (ARGA)
neighbors as illustrated in Fig 6. However, the computa-
[61] employs the training scheme of generative adversarial
tion cost and memory consumption increase rapidly as the
networks (GANs) [91] to regularize a graph auto-encoder. In
attention weights between each pair of neighbors must be
ARGA, an encoder encodes a node’s structural information
computed.
with its features into a hidden representation by GCN [14],
and a decoder reconstructs the adjacency matrix from the
5.2 Graph Auto-encoders outputs of the encoder. The GANs play a min-max game be-
Graph auto-encoders are one class of network embedding tween a generator and a discriminator in training generative
approaches which aim at representing network vertices into models. A generator generates “faked samples” as real as
a low-dimensional vector space by using neural network possible while a discriminator makes its best to distinguish
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 13

the “faked samples” from the real ones. GAN helps ARGA detail, bi,j = 1 if Ai,j = 0 and bi,j = β > 1 if Ai,j = 1.
to regularize the learned hidden representations of nodes to Overall, the objective function is defined as
follow a prior distribution. In detail, the encoder, working
as a generator, tries to make the learned node hidden rep- L = L2nd + αL1st + λLreg (32)
resentations indistinguishable from a real prior distribution.
where Lreg is the L2 regularization term.
A discriminator, on the other side, tries to identify whether
the learned node hidden representations are generated from Deep Recursive Network Embedding (DRNE) [63] di-
the encoder or from a real prior distribution. rectedly reconstructs a node’s hidden state instead of the
whole graph statistics. Using an aggregation function as the
5.2.2 Miscellaneous Variants of Graph Auto-encoders encoder, DRNE designs the loss function as,
X
Network Representations with Adversarially Regularized L= ||hv − aggregate(hu |u ∈ N (v))||2 (33)
Autoencoders (NetRA) [62] is a graph auto-encoder frame- v∈V
work which shares a similar idea with ARGA. It also
One inovation of DRNE is that it choose LSTM as aggrega-
regularizes node hidden representations to comply with a
tion function where the neighbors sequence is ordered by
prior distribution via adversarial training. Instead of recon-
their node degree.
structing the adjacency matrix, they recover node sequences
sampled from random walks by a sequence-to-sequence
architecture [92]. 5.2.3 Summary
DNGR and SDNE learn node embeddings only given the
Deep Neural Networks for Graph Representations topological structures, while GAE, ARGA, NetRA, DRNE
(DNGR) [41] uses the stacked denoising autoencoder learn node embeddings when both topological information
[93] to reconstruct the pointwise mutual information ma- and node content features are available. One challenge of
trix(PPMI). The PPMI matrix intrinsically captures nodes graph auto-encoders is the sparsity of the adjacency matrix
co-occurence information when a graph is serialized as A, causing the number of positive entries of the decoder to
sequences by random walks. Formally, the PPMI matrix is be far less than the negative ones. To tackle this issue, DNGR
defined as reconstructs a denser matrix namely the PPMI matrix, SDNE
count(v1 , v2 ) · |D| imposes a penalty to zero entries of the adjacency matrix,
PPMIv1 ,v2 = max(log( ), 0) (29) GAE reweights the terms in the adjacency matrix, and
count(v1 )count(v2 )
P NetRA linearizes Graphs into sequences.
where |D| = v1 ,v2 count(v1 , v2 ) and v1 , v2 ∈ V . The
stacked denoising autoencoder is able to learn highly non-
linear regularity behind data. Different from conventional 5.3 Graph Generative Networks
neural autoencoders, it adds noise to inputs by randomly The goal of graph generative networks is to generate graphs
switching entries of inputs to zero. The learned latent repre- given an observed set of graphs. Many approaches to graph
sentation is more robust especially when there are missing generative networks are domain specific. For instance, in
values present. molecular graph generation, some works model a string
representation of molecular graphs called SMILES [94], [95],
Structural Deep Network Embedding (SDNE) [42] uses
[96], [97]. In natural language processing, generating a se-
stacked auto encoder to preserve nodes first-order proximity
mantic or a knowledge graph is often conditioned on a given
and second-order proximity jointly. The first-order proxim-
sentence [98], [99]. Recently, several general approaches
ity is defined as the distance between a node’s hidden rep-
have been proposed. Some works factor the generation
resentation and its neighbor’s hidden representation. The
process as forming nodes and edges alternatively [64], [65]
goal for the first-order proximity is to drive representations
while others employ generative adversarial training [66],
of adjacent nodes close to each other as much as possible.
[67]. The methods in this category either employ GCN as
Specifically, the loss function L1st is defined as
building blocks or use different architectures.
n
X (k) (k)
L1st = Ai,j ||hi − hj ||2 (30) 5.3.1 GCN Based Graph Generative Networks
i,j=1
Molecular Generative Adversarial Networks (MolGAN)
The second-order proximity is defined as the distance be- [66] integrates relational GCN [100], improved GAN [101]
tween a node’s input and its reconstructed inputs where and reinforcement lerarning (RL) objective to generate
the input is the corresponding row of the node in the graphs with desired properties. The GAN consists of a
adjacent matrix. The goal for the second-order proximity is generator and a discriminator, competing with each other
to preserve a node’s neighborhood information. Concretely, to improve the authenticity of the generator. In MolGAN,
the loss function L2nd is defined as the generator tries to propose a faked graph along with its
feature matrix while the discriminator aims to distinguish
n
X the faked sample from the empirical data. Additionally
L2nd = ||(x̂i − xi ) bi ||2 (31)
a reward network is introduced in parallel with the dis-
i=1
criminator to encourage the generated graphs to possess
The role of vector bi is to penalize non-zero elements more certain properties according to an external evaluator. The
than zero elements since the inputs are highly sparse. In framework of MolGAN is described in Fig 9.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 14

Fig. 9: Framework of MolGAN [67]. A generator first samples an initial vector from a standard normal distribution. Passing
this initial vector through a neural network, the generator outputs a dense adjacency matrix A and a corresponding feature
matrix X . Next, the generator produces a sampled discrete à and X̃ from categorical distributions based on A and X .
Finally, GCN is used to derive a vector representation of the sampled graph. Feeding this graph representation to two
distinct neural networks, a discriminator and a reward network outputs a score between zero and one separately, which
will be used as feedback to update the model parameters.

Deep Generative Models of Graphs (DGMG) [65] uti- graphs is difficult to inspect visually. MolGAN and DGMG
lizes spatial-based graph convolution networks to obtain make use of external knowledge to evaluate the validity
a hidden representation of an existing graph. The decision of generated molecule graphs. GraphRNN and NetGAN
process of generating nodes and edges is conditioned on the evaluate generated graphs by graph statistics (e.g. node
resultant graph representation. Briefly, DGMG recursively degrees). Whereas DGMG and GraphRNN generate nodes
proposes a node to a growing graph until a stopping criteria and edges sequentially, MolGAN and NetGAN generate
is evoked. In each step after adding a new node, DGMG nodes and edges jointly. According to [68], the disadvantage
repeatedly decides whether to add an edge to the added of the former approaches is that when graphs become large,
node until the decision turns to false. If the decision is true, modelling a long sequence is not realistic. The challenge
it evaluates the probability distribution of connecting the of the later approaches is that global properties of the
newly added node to all existing nodes and samples one graph are difficult to control. A recent approach [68] adopts
node from the probability distribution. After a new node variational auto-encoder to generate a graph by proposing
and its connections are added to the existing graph, DGMG the adjacency matrix, imposing penalty terms to address
updates the graph representation again. validity constraints. However as the output space of a graph
with n nodes is n2 , none of these methods is scalable to large
5.3.2 Miscellaneous Graph Generative Networks graphs.
GraphRNN [64] exploits deep graph generative models
through two-level recurrent neural networks. The graph- 5.4 Graph Spatial-Temporal Networks
level RNN adds a new node each time to a node sequence Graph spatial-temporal networks capture spatial and tem-
while the edge level RNN produces a binary sequence poral dependencies of a spatial-temporal graph simultane-
indicating connections between the newly added node and ously. Spatial-temporal graphs have a global graph structure
previously generated nodes in the sequence. To linearize with inputs to each node which are changing across time.
a graph into a sequence of nodes for training the graph For instance, in traffic networks, each sensor taken as a
level RNN, GraphRNN adopts the breadth-first-search (BFS) node records the traffic speed of a certain road continuously
strategy. To model the binary sequence for training the edge- where the edges of the traffic network are determined by
level RNN, GraphRNN assumes multivariate Bernoulli or the distance between pairs of sensors. The goal of graph
conditional Bernoulli distribution. spatial-temporal networks can be forecasting future node
values or labels, or predicting spatial-temporal graph labels.
NetGAN [67] combines LSTM [7] with Wasserstein GAN
Recent studies have explored the use of GCNs [72] solely,
[102] to generate graphs from a random-walk-based ap-
a combination of GCNs with RNN [70] or CNN [71], and
proach. The GAN framework consists of two modules, a
a recurrent architecture tailored to graph structures [73]. In
generator and a discriminator. The generator makes its best
the following, we introduce these methods.
effort to generate plausible random walks through a LSTM
network while the discriminator tries to distinguish faked 5.4.1 GCN Based Graph Spatial-Temporal Networks
random walks from the real ones. After training, a new Diffusion Convolutional Recurrent Neural Network
graph is obtained by normalizing a co-occurence matrix of (DCRNN) [70] introduces diffusion convolution as graph
nodes which occur in a set of random walks. convolution for capturing spatial dependency and uses
sequence-to-sequence architecture [92] with gated recurrent
5.3.3 Summary units (GRU) [79] to capture temporal dependency.
Evaluating generated graphs remains a difficult problem. Diffusion convolution models a truncated diffusion pro-
Unlike synthesized images or audios, which can be di- cess with forward and backward directions. Formally, the
rectly assessed by human experts, the quality of generated diffusion convolution is defined as
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 15

node and each edge is passed through a nodeRNN and


K−1
X an edgeRNN respectively. Since assuming different RNNs
X:,p ?G f (θ) = (θk1 (D−1 k −1 T k for different nodes and edges increases model complex-
O A) + θk2 (DI A ) )X:,p
k=0 ity dramantically, they instead split nodes and edges into
(34) semantic groups. For example, a human-object interaction
where DO is the out-degree matrix and DI is the in- graph consists of two groups of nodes, human nodes and
degree matrix. To allow multiple input and output channels, object nodes, and three groups of edges, human-human
DCRNN proposes a diffusion convolution layer, defined as edges, object-object edges, and human-object edges. Nodes
P or edges in a same semantic group share the same RNN
X
Z:,q = σ( X:,p ?G f (Θq,p,:,: )) (35) model. To incorporate the spatial information, a nodeRNN
p=1 will take the outputs of edgeRNNs as inputs.

where X ∈ RN ×P and Z ∈ RN ×Q , Θ ∈ RQ×P ×K×2 , Q 5.4.3 Summary


is the number of output channels and P is the number of The advantage of DCRNN is that it is able to handle long-
input channels. term dependencies because of the recurrent network archi-
To capture temporal dependency, DCRNN processes the tectures. Though simpler than DCRNN, CNN-GCN pro-
inputs of GRU using a diffusion convolution layer so that cesses spatial-temporal graphs more efficiently owing to the
the recurrent unit simultaneously receives history informa- fast implementation of 1D CNN. ST-GCN considers tempo-
tion from the last time step and neighborhood information ral flow as graph edges, resulting in the size of the adjacency
from graph convolution. The modified GRU in DCRNN is matrix growing quadratically. On the one hand, it increases
named as the diffusion convolutional gated recurrent Unit the computation cost of the graph convolution layer. On the
(DCGRU), other hand, to capture the long-term dependency, the graph
convolution layer has to be stacked many times. Structural-
r(t) = sigmoid(Θr ?G [X(t) , H(t−1) ] + br ) RNN improves model efficiency by sharing the same RNN
within the same semantic group. However, Structural-RNN
u(t) = sigmoid(Θu ?G [X(t) , H(t−1) ] + bu )
(36) demands human prior knowledge to split the semantic
C(t) = tanh(ΘC ?G [X(t) , (r(t) H(t−1) )] + br ) groups.
H(t) = u(t) H(t−1) + (1 − u(t) ) C(t)
To meet the demands of multi-step forecasting, DCGRN 6 A PPLICATIONS
adopts sequence-to-sequence architecture [92] where the Graph neural networks have a wide variety of applications.
recurrent unit is replaced by DCGRU. In this section, we first summarize the benchmark datasets
frequently used in the literature. Then we report the bench-
CNN-GCN [71] interleaves 1D-CNN with GCN [14] to
mark performance on four commonly used datasets and
learn spatial-temporal graph data. For an input tensor
list the available open source implementations of graph
X ∈ RT ×N ×D , the 1D-CNN layer slides over X[:,i,:] along
neural networks. Finally, we provide practical applications
the time axis to aggregate temporal information for each
of graph neural networks in various domains.
node while the GCN layer operates on X[i,:,:] to aggregate
spatial information at each time step. The output layer is a
linear transformation, generating a prediction for each node. 6.1 Datasets
The framework of CNN-GCN is depicted in Fig 5c. In our survey, we count the frequency of each dataset which
Spatial Temporal GCN (ST-GCN) [72] adopts a different occurs in the papers reviewed in this work, and report in
approach by extending the temporal flow as graph edges so Table 5 the datasets which occur at least twice.
that spatial and temporal information can be extracted using Citation Networks consist of papers, authors and their
a unified GCN model at the same time. ST-GCN defines relationship such as citation, authorship, co-authorship. Al-
a labelling function to assign a label to each edge of the though citation networks are directed graphs, they are often
graph according to the distance of the two related nodes. treated as undirected graphs in evaluating model perfor-
In this way, the adjacency matrix can be represented as a mance with respect to node classification, link prediction,
summation of K adjacency matrices where K is the number and node clustering tasks. There are three popular datasets
of labels. Then ST-GCN applies GCN [14] with a different for paper-citation networks, Cora, Citeseer and Pubmed.
weight matrix to each of the K adjacency matrix and sums The Cora dataset contains 2708 machine learning publi-
them. X −1 cations grouped into seven classes. The Citeseer dataset
−1
fout = Λj 2 Aj Λj 2 fin Wj (37) contains 3327 scientific papers grouped into six classes.
j Each paper in Cora and Citeseer is repesented by a one-hot
vector indicating the presence or absence of a word from
5.4.2 Miscellaneous Variants a dictionary. The Pubmed dataset contains 19717 diabetes-
Structural-RNN. Jain et al. [73] propose a recurrent struc- related publications. Each paper in Pubmed is represented
tured framework named Structural-RNN. The aim of by a term frequency-inverse document frequency (TF-IDF)
Structural-RNN is to predict node labels at each time step. In vector. Furthermore, DBLP is a large citation dataset with
Structural-RNN, it comprises of two kinds of RNNs, namely millions of papers and authors which have been collected
nodeRNN and edgeRNN. The temporal information of each from computer science bibliographies. The raw dataset of
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 16

DBLP can be found on https://dblp.uni-trier.de. A pro- 6.2 Benchmarks & Open-source Implementations
cessed version of the DBLP paper-citation network is up-
Of the datasets listed in Table 5, Cora, Pubmed, Citeseer,
dated continuously by https://aminer.org/citation.
and PPI are the most frequently used datasets. They are
Social Networks are formed by user interactions from often tested to compare the performance of graph convo-
online services such as BlogCatalog, Reddit, and Epinions. lution networks in node classification tasks. In Table 6, we
The BlogCatalog dataset is a social network which con- report the benchmark performance of these four datasets,
sists of bloggers and their social relationships. The labels all of which use standard data splits. Open-source imple-
of bloggers represent their personal interests. The Reddit mentations facilitate the work of baseline experiments in
dataset is an undirected graph formed by posts collected deep learning research. Due to the vast number of hyper-
from the Reddit discussion forum. Two posts are linked if parameters, it is difficult to achieve the same results as
they contain comments by the same user. Each post has a reported in the literature without using published codes.
label indicating the community to which it belongs. The In Table 7, we provide the hyperlinks of open-source imple-
Epinions dataset is a multi-relation graph collected from an mentations of the graph neural network models reviewed in
online product review website where commenters can have Section 4-5. Noticeably, Fey et al. [86] published a geometric
more than one type of relation, such as trust, distrust, co- learning library in PyTorch named PyTorch Geometric 3 ,
review, and co-rating. which implements serveral graph neural networks includ-
ing ChebNet [12], 1stChebNet [14], GraphSage [24], MPNNs
Chemical/Biological Graphs Chemical molecules and com- [13], GAT [15] and SplineCNN [86]. Most recently, the Deep
pounds can be represented by chemical graphs with atoms Graph Library (DGL) 4 is released which provides a fast
as nodes and chemical bonds as edges. This category of implementation of many graph neural networks with a set
graphs is often used to evaluate graph classification perfor- of functions on top of popular deep learning platforms such
mance. The NCI-1 and NCI-9 dataset contains 4100 and 4127 as PyTorch and MXNet.
chemical compounds respectively, labeled as to whether
they are active to hinder the growth of human cancer cell
lines. The MUTAG dataset contains 188 nitro compounds, 6.3 Practical Applications
labeled as to whether they are aromatic or heteroaromatic. Graph neural networks have a wide range of applications
The D&D dataset contains 1178 protein structures, labeled across different tasks and domains. Despite general tasks at
as to whether they are enzymes or non-enzymes. The QM9 which each category of GNNs is specialized, including node
dataset contains 133885 molecules labeled with 13 chemical classification, node representation learning, graph classifi-
properties. The Tox21 dataset contains 12707 chemical com- cation, graph generation, and spatial-temporal forecasting,
pounds labeled with 12 types of toxicity. Another important GNNs can also be applied to node clustering, link predic-
dataset is the Protein-Protein Interaction network(PPI). It tion [119], and graph partition [120]. In this section, we
contains 24 biological graphs with nodes represented by mainly introduce practical applications according to general
proteins and edges represented by the interactions between domains to which they belong.
proteins. In PPI, each graph is associated with a human
Computer Vision One of biggest application areas for graph
tissue. Each node is labeled with its biological states.
neural networks is computer vision. Researchers have ex-
Unstructured Graphs To test the generalization of graph plored leveraging graph structures in scene graph gener-
neural networks to unstructured data, the k nearest neigh- ation, point clouds classification and segmentation, action
bor graph(k-NN graph) has been widely used. The MNIST recognition and many other directions.
dataset contains 70000 images of size 28×28 labeled with 10 In scene graph generation, semantic relationships be-
digits. A typical way to convert a MNIST image to a graph tween objects facilitate the understanding of the semantic
is to construct a 8-NN graph based on its pixel locations. meaning behind a visual scene. Given an image, scene
The Wikipedia dataset is a word co-occurence network ex- graph generation models detect and recognize objects and
tracted from the first million bytes of the Wikipedia dump. predict semantic relationships between pairs of objects [121],
Labels of words represent part-of-speech (POS) tags. The 20- [122], [123]. Another application inverses the process by
NewsGroup dataset consists of around 20,000 News Group generating realistic images given scene graphs [124]. As
(NG) text documents categorized by 20 news types. The natural language can be parsed as semantic graphs where
graph of the 20-NewsGroup is constructed by representing each word represents an object, it is a promising solution to
each document as a node and using the similarities between synthesize images given textual descriptions.
nodes as edge weights. In point clouds classification and segmentation, a point
cloud is a set of 3D points recorded by LiDAR scans.
Others There are several other datasets worth mentioning. Solutions for this task enable LiDAR devices to see the
The METR-LA is a traffic dataset collected from the high- surrounding environment, which is typically beneficial for
ways of Los Angeles County. The MovieLens-1M dataset unmanned vehicles. To identify objects depicted by point
from the MovieLens website contains 1 million item rat- clouds, [125], [126], [127] convert point clouds into k-nearest
ings given by 6k users. It is a benchmark dataset for neighbor graphs or superpoint graphs, and use graph con-
recommender systems. The NELL dataset is a knowledge volution networks to explore the topological structure.
graph obtained from the Never-Ending Language Learning
project. It consist of facts represented by a triplet which 3. https://github.com/rusty1s/pytorch geometric
involves two entities and their relation. 4. https://www.dgl.ai/
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 17

TABLE 5: Summary of Commonly Used Datasets


Category Dataset Source # Graphs # Nodes # Edges #Features # Labels Citation
[14], [15], [23], [27], [45]
Cora [103] 1 2708 5429 1433 7 [44], [46], [49], [58], [59]
Citation [61], [104]
Networks [14], [15], [27], [46], [49]
Citeseer [103] 1 3327 4732 3703 6
[58], [59], [61]
[14], [15], [27], [44], [45]
Pubmed [103] 1 19717 44338 500 3
[48], [49], [59], [61], [67]
dblp.uni-trier.de
DBLP [105](aminer.org 1 - - - - [62], [67], [104], [106]
/citation)
BlogCatalog [107] 1 10312 333983 - 39 [42], [48], [62], [108]
Social
Reddit [24] 1 232965 11606919 602 41 [24], [28], [45], [46]
Networks
Epinions www.epinions.com 1 - - - - [50], [106]
[15], [19], [24], [27], [28]
PPI [109] 24 56944 818716 50 121
[46], [48], [62]
NCI-1 [110] 4100 - - 37 2 [26], [44], [47], [52], [57]
Chemical/
NCI-109 [110] 4127 - - 38 2 [26], [44], [52]
Biological
MUTAG [111] 188 - - 7 2 [26], [44], [52]
Graphs
D&D [112] 1178 - - - 2 [26], [47], [52]
QM9 [113] 133885 - - - 13 [13], [66]
tripod.nih.gov/
tox21 12707 - - - 12 [22], [53]
tox21/challenge/
yann.lecun.com
Unstruct- MNIST 70000 - - - 10 [12], [20], [23], [52]
/exdb/mnist/
ured
www.mattmahoney
Graphs Wikipedia 1 4777 184812 - 40 [62], [108]
.net/dc/textdata
20NEWS [114] 1 18846 - - 20 [12], [41]
METR-LA [115] - - - - - [28], [70]
Others [116]
grouplens.org/
Movie-Lens1M 1 10000 1 Million - - [23], [108]
datasets/
movielens/1m/
Nell [117] 1 65755 266144 61278 210 [14], [46], [49]

TABLE 6: Benchmark performance of four most frequently and items, as well as content information, graph-based
used datasets. The listed methods use the same training, recommender systems are able to produce high-quality
validation, and test data for evaluation. recommendations. The key to a recommender system is to
Method Cora Citeseer Pubmed PPI score the importance of an item to an user. As a result,
1stChebnet (2016) [14] 81.5 70.3 79.0 - it can be cast as a link prediction problem. The goal is
GraphSage (2017) [24] - - - 61.2 to predict the missing links between users and items. To
GAT (2017) [15] 83.0±0.7 72.5±0.7 79.0±0.3 97.3±0.2
Cayleynets (2017) [23] 81.9±0.7 - - - address this problem, Van et al. [9] and Ying et al. [11] et al.
StoGCN (2018) [46] 82.0±0.8 70.9±0.2 79±0.4 97.9+.04 propose a GCN-based graph auto-encoder. Monti et al. [10]
DualGCN (2018) [49] 83.5 72.6 80.0 -
GAAN (2018) [28] - - - 98.71±0.02
combine GCN and RNN to learn the underlying process that
GraphInfoMax (2018) [118] 82.3±0.6 71.8±0.7 76.8±0.6 63.8±0.2 generates the known ratings.
GeniePath (2018) [48] - - 78.5 97.9
LGCN (2018) [27] 83.3±0.5 73.0±0.6 79.5±0.2 77.2±0.2 Traffic Traffic congestion has become a hot social issue in
SSE (2018) [19] - - - 83.6 modern cities. Accurately forecasting traffic speed, volume
or the density of roads in traffic networks is fundamentally
important in route planning and flow control. [28], [70], [71],
In action recognition, recognizing human actions con- [134] adopt a graph-based approach with spatial-temporal
tained in videos facilitates a better understanding of video neural networks. The input to their models is a spatial-
content from a machine aspect. One group of solutions temporal graph. In this spatial-temporal graph, nodes are
detects the locations of human joints in video clips. Human represented by sensors placed on roads, edges are repre-
joints which are linked by skeletons naturally form a graph. sented by the distance of pair-wise nodes above a threshold
Given the time series of human joint locations, [72], [73] and each node contains a time series as features. The goal
applies spatial-temporal neural networks to learn human is to forecast the average speed of a road within a time
action patterns. interval. Another interesting application is taxi-demand pre-
In addition, the number of possible directions in which diction. This greatly helps intelligent transportation systems
to apply graph neural networks in computer vision is still make use of resources and save energy effectively. Given
growing. This includes few-shot image classification [128], historical taxi demands, location information, weather data,
[129], semantic segmentation [130], [131], visual reasoning and event features, Yao et al. [135] incorporate LSTM, CNN
[132] and question answering [133]. and node embeddings trained by LINE [136] to form a joint
representation for each location to predict the number of
Recommender Systems Graph-based recommender sys-
taxis demanded for a location within a time interval.
tems take items and users as nodes. By leveraging the
relations between items and items, users and users, users Chemistry In chemistry, researchers apply graph neural
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 18

TABLE 7: A Summary of Open-source Implementations


Model Framework Github Link
ChebNet (2016) [12] tensorflow https://github.com/mdeff/cnn graph
1stChebNet (2017) [14] tensorflow https://github.com/tkipf/gcn
GGNNs (2015) [18] lua https://github.com/yujiali/ggnn
SSE (2018) [19] c https://github.com/Hanjun-Dai/steady state embedding
GraphSage (2017) [24] tensorflow https://github.com/williamleif/GraphSAGE
LGCN (2018) [27] tensorflow https://github.com/divelab/lgcn/
SplineCNN (2018) [86] pytorch https://github.com/rusty1s/pytorch geometric
GAT (2017) [15] tensorflow https://github.com/PetarV-/GAT
GAE (2016) [59] tensorflow https://github.com/limaosen0/Variational-Graph-Auto-Encoders
ARGA (2018) [61] tensorflow https://github.com/Ruiqi-Hu/ARGA
DNGR (2016) [41] matlab https://github.com/ShelsonCao/DNGR
SDNE (2016) [42] python https://github.com/suanrong/SDNE
DRNE (2016) [63] tensorflow https://github.com/tadpole/DRNE
GraphRNN (2018) [64] tensorflow https://github.com/snap-stanford/GraphRNN
DCRNN (2018) [70] tensorflow https://github.com/liyaguang/DCRNN
CNN-GCN (2017) [71] tensorflow https://github.com/VeritasYin/STGCN IJCAI-18
ST-GCN (2018) [72] pytorch https://github.com/yysijie/st-gcn
Structural RNN (2016) [73] theano https://github.com/asheshjain399/RNNexp

networks to study the graph strcutures of molecules. In Scalability Most graph neural networks do not scale well
a molecular graph, atoms function as nodes and chem- for large graphs. The main reason for this is when stacking
ical bonds function as edges. Node classification, graph multiple layers of a graph convolution, a node’s final state
classification and graph generation are three main tasks involves a large number of its neighbors’ hidden states,
targeting at molecular graphs in order to learn molecular leading to high complexity of backpropagation. While sev-
fingerprints [53], [80], to predict molecular properties [13], eral approaches try to improve their model efficiency by fast
to infer protein interfaces [137], and to synthesize chemical sampling [45], [46] and sub-graph training [24], [27], they are
compounds [65], [66], [138]. still not scalable enough to handle deep architectures with
large graphs.
Others There have been initial explorations into applying
GNNs to other problems such as program verification [18], Dynamics and Heterogeneity The majority of current graph
program reasoning [139], social influence prediction [140], neural networks tackle with static homogeneous graphs. On
adversarial attacks prevention [141], electrical health records the one hand, graph structures are assumed to be fixed.
modeling [142], [143], event detection [144] and combinato- On the other hand, nodes and edges from a graph are
rial optimization [145]. assumed to come from a single source. However, these two
assumptions are not realistic in many scenarios. In a social
network, a new person may enter into a network at any time
7 F UTURE D IRECTIONS
and an existing person may quit the network as well. In
Though graph neural networks have proven their power a recommender system, products may have different types
in learning graph data, challenges still exist due to the where their inputs may have different forms such as texts
complexity of graphs. In this section, we provide four future or images. Therefore, new methods should be developed to
directions of graph neural networks. handle dynamic and heterogeneous graph structures.
Go Deep The success of deep learning lies in deep neu-
ral architectures. In image classification, for example, an 8 C ONCLUSION
outstanding model named ResNet [146] has 152 layers.
However, when it comes to graphs, experimental studies In this survey, we conduct a comprehensive overview of
have shown that with the increase in the number of layers, graph neural networks. We provide a taxonomy which
the model performance drops dramatically [147]. According groups graph neural networks into five categories: graph
to [147], this is due to the effect of graph convolutions in convolutional networks, graph attention networks, graph
that it essentially pushes representations of adjacent nodes autoencoders and graph generative networks. We provide
closer to each other so that, in theory, with an infinite times a thorough review, comparisons, and summarizations of the
of convolutions, all nodes’ representations will converge to a methods within or between categories. Then we introduce
single point. This raises the question of whether going deep a wide range of applications of graph neural networks.
is still a good strategy for learning graph-structured data. Datasets, open source codes, and benchmarks for graph
neural networks are summarized. Finally, we suggest four
Receptive Field The receptive field of a node refers to a future directions for graph neural networks.
set of nodes including the central node and its neighbors.
The number of neighbors of a node follows a power law
distribution. Some nodes may only have one neighbor, ACKNOWLEDGMENT
while other nodes may neighbors as many as thousands. This research was funded by the Australian Government
Though sampling strategies have been adopted [24], [26], through the Australian Research Council (ARC) under
[27], how to select a representative receptive field of a node grants 1) LP160100630 partnership with Australia Govern-
remains to be explored. ment Department of Health and 2) LP150100671 partnership
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 19

with Australia Research Alliance for Children and Youth [21] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks
(ARACY) and Global Business College Australia (GBCA). on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[22] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graph convolu-
We acknowledge the support of NVIDIA Corporation and tional neural networks,” in Proceedings of the AAAI Conference on
MakeMagic Australia with the donation of GPU used for Artificial Intelligence, 2018, pp. 3546–3553.
this research. [23] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, “Cayleynets:
Graph convolutional neural networks with complex rational
spectral filters,” arXiv preprint arXiv:1705.07664, 2017.
R EFERENCES [24] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only learning on large graphs,” in Advances in Neural Information
look once: Unified, real-time object detection,” in Proceedings of Processing Systems, 2017, pp. 1024–1034.
the IEEE conference on computer vision and pattern recognition, 2016, [25] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M.
pp. 779–788. Bronstein, “Geometric deep learning on graphs and manifolds
[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards using mixture model cnns,” in Proceedings of the IEEE Conference
real-time object detection with region proposal networks,” in on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017, p. 3.
Advances in neural information processing systems, 2015, pp. 91–99. [26] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional
[3] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches neural networks for graphs,” in Proceedings of the International
to attention-based neural machine translation,” in Proceedings of Conference on Machine Learning, 2016, pp. 2014–2023.
the Conference on Empirical Methods in Natural Language Processing, [27] H. Gao, Z. Wang, and S. Ji, “Large-scale learnable graph convolu-
2015, pp. 1412–1421. tional networks,” in Proceedings of the ACM SIGKDD International
[4] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, Conference on Knowledge Discovery & Data Mining. ACM, 2018,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural pp. 1416–1424.
machine translation system: Bridging the gap between human [28] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “Gaan:
and machine translation,” arXiv preprint arXiv:1609.08144, 2016. Gated attention networks for learning on large and spatiotem-
[5] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, poral graphs,” in Proceedings of the Uncertainty in Artificial Intelli-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep gence, 2018.
neural networks for acoustic modeling in speech recognition: [29] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
The shared views of four research groups,” IEEE Signal processing V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
magazine, vol. 29, no. 6, pp. 82–97, 2012. R. Faulkner et al., “Relational inductive biases, deep learning, and
[6] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, graph networks,” arXiv preprint arXiv:1806.01261, 2018.
speech, and time series,” The handbook of brain theory and neural [30] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh, “Attention
networks, vol. 3361, no. 10, p. 1995, 1995. models in graphs: A survey,” arXiv preprint arXiv:1807.07984,
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” 2018.
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [31] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A
[8] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van- survey,” arXiv preprint arXiv:1812.04202, 2018.
dergheynst, “Geometric deep learning: going beyond euclidean [32] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network em-
data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, bedding,” IEEE Transactions on Knowledge and Data Engineering,
2017. 2017.
[9] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolu- [33] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learn-
tional matrix completion,” stat, vol. 1050, p. 7, 2017. ing on graphs: Methods and applications,” in Advances in Neural
[10] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix com- Information Processing Systems, 2017, pp. 1024–1034.
pletion with recurrent multi-graph neural networks,” in Advances [34] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation
in Neural Information Processing Systems, 2017, pp. 3697–3707. learning: A survey,” IEEE Transactions on Big Data, 2018.
[11] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and [35] H. Cai, V. W. Zheng, and K. Chang, “A comprehensive survey of
J. Leskovec, “Graph convolutional neural networks for web- graph embedding: problems, techniques and applications,” IEEE
scale recommender systems,” in Proceedings of the ACM SIGKDD Transactions on Knowledge and Data Engineering, 2018.
International Conference on Knowledge Discovery and Data Mining. [36] P. Goyal and E. Ferrara, “Graph embedding techniques, applica-
ACM, 2018, pp. 974–983. tions, and performance: A survey,” Knowledge-Based Systems, vol.
[12] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional 151, pp. 78–94, 2018.
neural networks on graphs with fast localized spectral filtering,” [37] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep
in Advances in Neural Information Processing Systems, 2016, pp. network representation,” in Proceedings of the International Joint
3844–3852. Conference on Artificial Intelligence. AAAI Press, 2016, pp. 1895–
[13] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, 1901.
“Neural message passing for quantum chemistry,” in Proceedings [38] X. Shen, S. Pan, W. Liu, Y.-S. Ong, and Q.-S. Sun, “Discrete
of the International Conference on Machine Learning, 2017, pp. 1263– network embedding,” in Proceedings of the International Joint Con-
1272. ference on Artificial Intelligence, 7 2018, pp. 3549–3555.
[14] T. N. Kipf and M. Welling, “Semi-supervised classification with [39] H. Yang, S. Pan, P. Zhang, L. Chen, D. Lian, and C. Zhang,
graph convolutional networks,” in Proceedings of the International “Binarized attributed network embedding,” in IEEE International
Conference on Learning Representations, 2017. Conference on Data Mining. IEEE, 2018.
[15] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, [40] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
and Y. Bengio, “Graph attention networks,” in Proceedings of the of social representations,” in Proceedings of the ACM SIGKDD
International Conference on Learning Representations, 2017. international conference on Knowledge discovery and data mining.
[16] M. Gori, G. Monfardini, and F. Scarselli, “A new model for ACM, 2014, pp. 701–710.
learning in graph domains,” in Proceedings of the International Joint [41] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning
Conference on Neural Networks, vol. 2. IEEE, 2005, pp. 729–734. graph representations,” in Proceedings of the AAAI Conference on
[17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon- Artificial Intelligence, 2016, pp. 1145–1152.
fardini, “The graph neural network model,” IEEE Transactions on [42] D. Wang, P. Cui, and W. Zhu, “Structural deep network embed-
Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. ding,” in Proceedings of the ACM SIGKDD International Conference
[18] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1225–
sequence neural networks,” in Proceedings of the International 1234.
Conference on Learning Representations, 2015. [43] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst,
[19] H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song, “Learning “Accelerated filtering on graphs using lanczos method,” arXiv
steady-states of iterative algorithms over graphs,” in Proceedings preprint arXiv:1509.04537, 2015.
of the International Conference on Machine Learning, 2018, pp. 1114– [44] J. Atwood and D. Towsley, “Diffusion-convolutional neural net-
1122. works,” in Advances in Neural Information Processing Systems, 2016,
[20] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net- pp. 1993–2001.
works and locally connected networks on graphs,” in Proceedings
of International Conference on Learning Representations, 2014.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 20

[45] J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph [67] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann, “Net-
convolutional networks via importance sampling,” in Proceedings gan: Generating graphs via random walks,” in Proceedings of the
of the International Conference on Learning Representations, 2018. International Conference on Machine Learning, 2018.
[46] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph [68] T. Ma, J. Chen, and C. Xiao, “Constrained generation of semanti-
convolutional networks with variance reduction,” in Proceedings cally valid graphs via regularizing variational autoencoders,” in
of the International Conference on Machine Learning, 2018, pp. 941– Advances in Neural Information Processing Systems, 2018, pp. 7110–
949. 7121.
[47] F. P. Such, S. Sah, M. A. Dominguez, S. Pillai, C. Zhang, [69] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Struc-
A. Michael, N. D. Cahill, and R. Ptucha, “Robust spatial filter- tured sequence modeling with graph convolutional recurrent
ing with graph convolutional neural networks,” IEEE Journal of networks,” arXiv preprint arXiv:1612.07659, 2016.
Selected Topics in Signal Processing, vol. 11, no. 6, pp. 884–896, 2017. [70] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional
[48] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, and L. Song, “Geniepath: recurrent neural network: Data-driven traffic forecasting,” in
Graph neural networks with adaptive receptive paths,” arXiv Proceedings of International Conference on Learning Representations,
preprint arXiv:1802.00910, 2018. 2018.
[49] C. Zhuang and Q. Ma, “Dual graph convolutional networks for [71] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional
graph-based semi-supervised classification,” in Proceedings of the networks: A deep learning framework for traffic forecasting,”
World Wide Web Conference on World Wide Web. International in Proceedings of the International Joint Conference on Artificial
World Wide Web Conferences Steering Committee, 2018, pp. 499– Intelligence, 2017, pp. 3634–3640.
508. [72] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph con-
[50] T. Derr, Y. Ma, and J. Tang, “Signed graph convolutional net- volutional networks for skeleton-based action recognition,” in
work,” arXiv preprint arXiv:1808.06354, 2018. Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[51] T. Pham, T. Tran, D. Q. Phung, and S. Venkatesh, “Column [73] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn:
networks for collective classification,” in Proceedings of the AAAI Deep learning on spatio-temporal graphs,” in Proceedings of the
Conference on Artificial Intelligence, 2017, pp. 2485–2491. IEEE Conference on Computer Vision and Pattern Recognition, 2016,
[52] M. Simonovsky and N. Komodakis, “Dynamic edgeconditioned pp. 5308–5317.
filters in convolutional neural networks on graphs,” in Proceed- [74] S. Pan, J. Wu, X. Zhu, C. Zhang, and P. S. Yu, “Joint structure
ings of the IEEE conference on computer vision and pattern recognition, feature exploration and regularization for multi-task graph clas-
2017. sification,” IEEE Transactions on Knowledge and Data Engineering,
[53] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, vol. 28, no. 3, pp. 715–728, 2016.
“Molecular graph convolutions: moving beyond fingerprints,” [75] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Task sensitive fea-
Journal of computer-aided molecular design, vol. 30, no. 8, pp. 595– ture exploration and learning for multitask graph classification,”
608, 2016. IEEE transactions on cybernetics, vol. 47, no. 3, pp. 744–758, 2017.
[54] W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling [76] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van-
towards fast graph representation learning,” in Advances in Neu- dergheynst, “The emerging field of signal processing on graphs:
ral Information Processing Systems, 2018, pp. 4563–4572. Extending high-dimensional data analysis to networks and other
[55] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end irregular domains,” IEEE Signal Processing Magazine, vol. 30,
deep learning architecture for graph classification,” in Proceedings no. 3, pp. 83–98, 2013.
of the AAAI Conference on Artificial Intelligence, 2018. [77] L. B. Almeida, “A learning rule for asynchronous perceptrons
[56] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, with feedback in a combinatorial environment.” in Proceedings of
“Hierarchical graph representation learning with differentiable the International Conference on Neural Networks, vol. 2. IEEE, 1987,
pooling,” in Advances in Neural Information Processing Systems, pp. 609–618.
2018, pp. 4801–4811. [78] F. J. Pineda, “Generalization of back-propagation to recurrent
[57] J. B. Lee, R. Rossi, and X. Kong, “Graph classification using struc- neural networks,” Physical review letters, vol. 59, no. 19, p. 2229,
tural attention,” in Proceedings of the ACM SIGKDD International 1987.
Conference on Knowledge Discovery & Data Mining. ACM, 2018, [79] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
pp. 1666–1674. F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
[58] S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. A. Alemi, “Watch resentations using rnn encoder-decoder for statistical machine
your step: Learning node embeddings via graph attention,” in translation,” in Proceedings of the Conference on Empirical Methods
Advances in Neural Information Processing Systems, 2018, pp. 9197– in Natural Language Processing, 2014, pp. 1724–1734.
9207. [80] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell,
[59] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional
arXiv preprint arXiv:1611.07308, 2016. networks on graphs for learning molecular fingerprints,” in
[60] C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang, “Mgae: Marginal- Advances in Neural Information Processing Systems, 2015, pp. 2224–
ized graph autoencoder for graph clustering,” in Proceedings of 2232.
the ACM on Conference on Information and Knowledge Management. [81] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and
ACM, 2017, pp. 889–898. A. Tkatchenko, “Quantum-chemical insights from deep tensor
[61] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang, “Adver- neural networks,” Nature communications, vol. 8, p. 13890, 2017.
sarially regularized graph autoencoder for graph embedding.” [82] B. Weisfeiler and A. Lehman, “A reduction of a graph to a
in Proceedings of the International Joint Conference on Artificial canonical form and an algebra arising during this reduction,”
Intelligence, 2018, pp. 2609–2615. Nauchno-Technicheskaya Informatsia, vol. 2, no. 9, pp. 12–16, 1968.
[62] W. Yu, C. Zheng, W. Cheng, C. C. Aggarwal, D. Song, B. Zong, [83] B. L. Douglas, “The weisfeiler-lehman method and graph isomor-
H. Chen, and W. Wang, “Learning deep network representations phism testing,” arXiv preprint arXiv:1101.5211, 2011.
with adversarially regularized autoencoders,” in Proceedings of [84] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst,
the ACM SIGKDD International Conference on Knowledge Discovery “Geodesic convolutional neural networks on riemannian mani-
& Data Mining. ACM, 2018, pp. 2663–2671. folds,” in Proceedings of the IEEE International Conference on Com-
[63] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu, “Deep recursive puter Vision Workshops, 2015, pp. 37–45.
network embedding with regular equivalence,” in Proceedings of [85] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein, “Learning
the ACM SIGKDD International Conference on Knowledge Discovery shape correspondence with anisotropic convolutional neural net-
and Data Mining. ACM, 2018, pp. 2357–2366. works,” in Advances in Neural Information Processing Systems, 2016,
[64] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, pp. 3189–3197.
“Graphrnn: A deep generative model for graphs,” Proceedings of [86] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast
International Conference on Machine Learning, 2018. geometric deep learning with continuous b-spline kernels,” in
[65] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia, “Learning Proceedings of the IEEE Conference on Computer Vision and Pattern
deep generative models of graphs,” in Proceedings of the Interna- Recognition, 2018, pp. 869–877.
tional Conference on Machine Learning, 2018. [87] S. Pan, J. Wu, and X. Zhu, “Cogboost: Boosting for fast cost-
[66] N. De Cao and T. Kipf, “Molgan: An implicit generative model sensitive graph classification,” IEEE Transactions on Knowledge &
for small molecular graphs,” arXiv preprint arXiv:1805.11973, Data Engineering, no. 1, pp. 1–1, 2015.
2018.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 21

[88] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are [111] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J.
graph neural networks,” arXiv preprint arXiv:1810.00826, 2018. Shusterman, and C. Hansch, “Structure-activity relationship of
[89] S. Verma and Z.-L. Zhang, “Graph capsule convolutional neural mutagenic aromatic and heteroaromatic nitro compounds. cor-
networks,” arXiv preprint arXiv:1805.08090, 2018. relation with molecular orbital energies and hydrophobicity,”
[90] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Journal of medicinal chemistry, vol. 34, no. 2, pp. 786–797, 1991.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [112] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures
in Advances in Neural Information Processing Systems, 2017, pp. from non-enzymes without alignments,” Journal of molecular biol-
5998–6008. ogy, vol. 330, no. 4, pp. 771–783, 2003.
[91] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [113] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilien-
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- feld, “Quantum chemistry structures and properties of 134 kilo
sarial nets,” in Advances in neural information processing systems, molecules,” Scientific data, vol. 1, p. 140022, 2014.
2014, pp. 2672–2680. [114] T. Joachims, “A probabilistic analysis of the rocchio algorithm
[92] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence with tfidf for text categorization.” Carnegie-mellon univ pitts-
learning with neural networks,” in Advances in Neural Information burgh pa dept of computer science, Tech. Rep., 1996.
Processing Systems, 2014, pp. 3104–3112. [115] H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M.
[93] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex- Patel, R. Ramakrishnan, and C. Shahabi, “Big data and its tech-
tracting and composing robust features with denoising autoen- nical challenges,” Communications of the ACM, vol. 57, no. 7, pp.
coders,” in Proceedings of the international conference on Machine 86–94, 2014.
learning. ACM, 2008, pp. 1096–1103. [116] B. N. Miller, I. Albert, S. K. Lam, J. A. Konstan, and J. Riedl,
[94] G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. “Movielens unplugged: experiences with an occasionally con-
Farias, and A. Aspuru-Guzik, “Objective-reinforced generative nected recommender system,” in Proceedings of the international
adversarial networks (organ) for sequence generation models,” conference on Intelligent user interfaces. ACM, 2003, pp. 263–266.
arXiv preprint arXiv:1705.10843, 2017. [117] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr,
[95] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, “Grammar and T. M. Mitchell, “Toward an architecture for never-ending
variational autoencoder,” arXiv preprint arXiv:1703.01925, 2017. language learning.” in Proceedings of the AAAI Conference on
[96] H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song, “Syntax-directed Artificial Intelligence, 2010, pp. 1306–1313.
variational autoencoder for molecule generation,” in Proceedings [118] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Ben-
of the International Conference on Learning Representations, 2018. gio, and R. D. Hjelm, “Deep graph infomax,” arXiv preprint
[97] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández- arXiv:1809.10341, 2018.
Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera- [119] M. Zhang and Y. Chen, “Link prediction based on graph neural
Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, networks,” in Advances in Neural Information Processing Systems,
“Automatic chemical design using a data-driven continuous 2018.
representation of molecules,” ACS central science, vol. 4, no. 2, [120] T. Kawamoto, M. Tsubaki, and T. Obuchi, “Mean-field theory
pp. 268–276, 2018. of graph neural networks in graph partitioning,” in Advances in
[98] B. Chen, L. Sun, and X. Han, “Sequence-to-action: End-to-end Neural Information Processing Systems, 2018, pp. 4362–4372.
semantic graph generation for semantic parsing,” in Proceedings of [121] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation
the Annual Meeting of the Association for Computational Linguistics, by iterative message passing,” in Proceedings of the IEEE Confer-
2018, pp. 766–777. ence on Computer Vision and Pattern Recognition, vol. 2, 2017.
[99] D. D. Johnson, “Learning graphical state transitions,” in Proceed- [122] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn
ings of the International Conference on Learning Representations, 2016. for scene graph generation,” in European Conference on Computer
[100] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, Vision. Springer, 2018, pp. 690–706.
and M. Welling, “Modeling relational data with graph convolu- [123] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Fac-
tional networks,” in European Semantic Web Conference. Springer, torizable net: an efficient subgraph-based framework for scene
2018, pp. 593–607. graph generation,” in European Conference on Computer Vision.
[101] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Springer, 2018, pp. 346–363.
Courville, “Improved training of wasserstein gans,” in Advances [124] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from
in Neural Information Processing Systems, 2017, pp. 5767–5777. scene graphs,” arXiv preprint, 2018.
[102] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv [125] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.
preprint arXiv:1701.07875, 2017. Solomon, “Dynamic graph cnn for learning on point clouds,”
[103] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi- arXiv preprint arXiv:1801.07829, 2018.
Rad, “Collective classification in network data,” AI magazine, [126] L. Landrieu and M. Simonovsky, “Large-scale point cloud seman-
vol. 29, no. 3, p. 93, 2008. tic segmentation with superpoint graphs,” in Proceedings of the
[104] X. Zhang, Y. Li, D. Shen, and L. Carin, “Diffusion maps for textual IEEE Conference on Computer Vision and Pattern Recognition, 2018.
network embedding,” in Advances in Neural Information Processing [127] G. Te, W. Hu, Z. Guo, and A. Zheng, “Rgcnn: Regular-
Systems, 2018. ized graph cnn for point cloud segmentation,” arXiv preprint
[105] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: arXiv:1806.02952, 2018.
extraction and mining of academic social networks,” in Proceed- [128] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph
ings of the ACM SIGKDD International Conference on Knowledge neural networks,” in Proceedings of the International Conference on
Discovery and Data Mining. ACM, 2008, pp. 990–998. Learning Representations, 2018.
[106] Y. Ma, S. Wang, C. C. Aggarwal, D. Yin, and J. Tang, “Multi- [129] M. Guo, E. Chou, D.-A. Huang, S. Song, S. Yeung, and L. Fei-
dimensional graph convolutional networks,” arXiv preprint Fei, “Neural graph matching networks for fewshot 3d action
arXiv:1808.06099, 2018. recognition,” in European Conference on Computer Vision. Springer,
[107] L. Tang and H. Liu, “Relational learning via latent social dimen- 2018, pp. 673–689.
sions,” in Proceedings of the ACM SIGKDD International Conference [130] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural
on Knowledge Ciscovery and Data Mining. ACM, 2009, pp. 817– networks for rgbd semantic segmentation,” in Proceedings of the
826. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[108] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, pp. 5199–5208.
and M. Guo, “Graphgan: Graph representation learning with [131] L. Yi, H. Su, X. Guo, and L. J. Guibas, “Syncspeccnn: Synchro-
generative adversarial nets,” in Proceedings of the AAAI Conference nized spectral cnn for 3d shape segmentation.” in Proceedings of
on Artificial Intelligence, 2017. the IEEE Conference on Computer Vision and Pattern Recognition,
[109] M. Zitnik and J. Leskovec, “Predicting multicellular function 2017, pp. 6584–6592.
through multi-layer tissue networks,” Bioinformatics, vol. 33, [132] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visual reason-
no. 14, pp. i190–i198, 2017. ing beyond convolutions,” in Proceedings of the IEEE Conference on
[110] N. Wale, I. A. Watson, and G. Karypis, “Comparison of descrip- Computer Vision and Pattern Recognition, 2018.
tor spaces for chemical compound retrieval and classification,” [133] M. Narasimhan, S. Lazebnik, and A. Schwing, “Out of the
Knowledge and Information Systems, vol. 14, no. 3, pp. 347–375, box: Reasoning with graph convolution nets for factual visual
2008. question answering,” in Advances in Neural Information Processing
Systems, 2018, pp. 2655–2666.
JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018 22

[134] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “High-order graph [141] D. Zügner, A. Akbarnejad, and S. Günnemann, “Adversarial
convolutional recurrent neural network: a deep learning frame- attacks on neural networks for graph data,” in Proceedings of the
work for network-scale traffic learning and forecasting,” arXiv ACM SIGKDD International Conference on Knowledge Discovery and
preprint arXiv:1802.07007, 2018. Data Mining. ACM, 2018, pp. 2847–2856.
[135] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, [142] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun, “Gram:
and Z. Li, “Deep multi-view spatial-temporal network for taxi graph-based attention model for healthcare representation learn-
demand prediction,” in Proceedings of the AAAI Conference on ing,” in Proceedings of the ACM SIGKDD International Conference on
Artificial Intelligence, 2018, pp. 2588–2595. Knowledge Discovery and Data Mining. ACM, 2017, pp. 787–795.
[136] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: [143] E. Choi, C. Xiao, W. Stewart, and J. Sun, “Mime: Multilevel
Large-scale information network embedding,” in Proceedings of medical embedding of electronic health records for predictive
the International Conference on World Wide Web. International healthcare,” in Advances in Neural Information Processing Systems,
World Wide Web Conferences Steering Committee, 2015, pp. 2018, pp. 4548–4558.
1067–1077. [144] T. H. Nguyen and R. Grishman, “Graph convolutional networks
[137] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur, “Protein interface with argument-aware pooling for event detection,” in Proceedings
prediction using graph convolutional networks,” in Advances in of the AAAI Conference on Artificial Intelligence, 2018, pp. 5900–
Neural Information Processing Systems, 2017, pp. 6530–6539. 5907.
[138] J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec, “Graph [145] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization
convolutional policy network for goal-directed molecular graph with graph convolutional networks and guided tree search,” in
generation,” in Advances in Neural Information Processing Systems, Advances in Neural Information Processing Systems, 2018, pp. 536–
2018. 545.
[139] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to [146] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
represent programs with graphs,” in Proceedings of the Interna- for image recognition,” in Proceedings of the IEEE conference on
tional Conference on Learning Representations, 2017. computer vision and pattern recognition, 2016, pp. 770–778.
[140] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang, “Deepinf: [147] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convo-
Social influence prediction with deep learning,” in Proceedings of lutional networks for semi-supervised learning,” in Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery the AAAI Conference on Artificial Intelligence, 2018.
& Data Mining. ACM, 2018, pp. 2110–2119.
Graph Neural Networks with convolutional ARMA filters

Filippo Maria Bianchi 1 Daniele Grattarola 2 Cesare Alippi 2 3 Lorenzo Livi 4 5

Abstract ture the variability of a graph structure. Therefore, to apply


CNNs on graphs, different approaches have been proposed
Recent graph neural networks implement convo-
to modify the convolution operations (Atwood & Towsley,
lutional layers based on polynomial filters oper-
2016; Monti et al., 2017; Fey et al., 2018) or to locally ap-
arXiv:1901.01343v2 [cs.LG] 14 Jan 2019

ating in the spectral domain. In this paper, we


proximate a graph with a regular structure before apply-
propose a novel graph convolutional layer based
ing the traditional spatial convolution (Niepert et al., 2016;
on auto-regressive moving average (ARMA) fil-
Zhang et al., 2018).
ters that, compared to the polynomial ones, pro-
vides a more flexible response thanks to a rich Graph Neural Networks (GNNs) constitute a class of re-
transfer function that accounts for the concept cently developed tools lying at the intersection between
of state. We implement the ARMA filter with deep learning and methods for structured data, which per-
a recursive and distributed formulation, obtain- form inference on discrete objects (assigned to nodes)
ing a convolutional layer that is efficient to train, by accounting for arbitrary relationships (edges) among
is localized in the node space and can be ap- them (Battaglia et al., 2018). A GNN combines node
plied to graphs with different topologies. In or- features within local neighborhoods on the graph to learn
der to learn more abstract and compressed rep- nodes/graph embeddings (Perozzi et al., 2014; Duvenaud
resentations in deeper layers of the network, we et al., 2015; Yang et al., 2016; Hamilton et al., 2017; Bac-
alternate pooling operations based on node deci- ciu et al., 2018), or to directly perform inference tasks by
mation with convolutions on coarsened versions mapping the node features into categorical labels or real
of the original graph. We consider three major values (Scarselli et al., 2009; Micheli, 2009).
graph inference problems: semi-supervised node
Of particular interest for this work are those GNNs that
classification, graph classification, and graph sig-
implement a convolution operation in the spectral domain
nal classification. Results show that the proposed
with a nonlinear trainable filter, which maps the node fea-
graph neural network with ARMA filters outper-
tures in a new space (Bruna et al., 2013; Henaff et al.,
form those based on polynomial filters and sets
2015). To avoid computing the expensive spectral decom-
the new state-of-the-art in several tasks.
position and projection in the frequency domain, state-
of-the-art GNNs approximate graph filters with finite or-
der polynomials (Defferrard et al., 2016; Kipf & Welling,
1. Introduction 2016a;b). Polynomial filters have a finite impulse response
Several deep learning architectures have been proposed to (FIR) and realize a weighted moving average (MA) filter-
process data represented as graphs. The well-established ing of graph signals (Tremblay et al., 2018). Since MA
Convolutional Neural Networks (CNNs) (Krizhevsky et al., filters account only for a local neighbourhood of nodes,
2012) convolve an input tensor with a small trainable ker- fast distributed implementations have been proposed based
nel of the same rank, applied to fixed-size volumes. Such on Chebyshev polynomials and Lanczos iterations (Susn-
a bias yields locality and translation invariance in space, jara et al., 2015; Defferrard et al., 2016). Despite their
which works well for regular grids, but prevents to cap- attractive computational efficiency, FIR filters are sensi-
tive to changes in the graph signal (an instance of the
1
Machine Learning Group, UiT the Arctic University of node features) or in the underlying graph structure (Isufi
Tromsø, Norway 2 Faculty of Informatics, Università della et al., 2016). Moreover, polynomial filters are very smooth
Svizzera Italiana, Switzerland 3 Dept. of Electronics, Informa-
tion, and Bioengineering, Politecnico di Milano, Italy 4 Dept. and cannot model sharp changes in the frequency re-
of Computer Science and Mathematics, University of Manitoba, sponse (Tremblay et al., 2018). A more versatile class
Canada 5 Dept. of Computer Science, University of Exeter, United of filters are the Auto-Regressive Moving Average filters
Kingdom. Correspondence to: Filippo Maria Bianchi <fil- (ARMA) that allow for a more accurate filter design and, in
ippo.m.bianchi@uit.no>. several cases, give exact rather than approximate solutions
in modeling the desired response (Narang et al., 2013)
GNNs with convolutional ARMA filters

Contribution In this paper, we address the limitations of This formulation inspired the seminal work of Bruna et al.
existing graph convolutional layers in modeling a desired (2013) that implemented spectral graph convolutions in a
filter response and propose a GNN based on a novel ARMA neural network. Their GNN learns end-to-end the param-
layer. The ARMA layer implements a non-linear and train- eters of each filter implemented as h = Bc, where B ∈
able ARMA graph filter that generalizes the existing graph RM ×K is a cubic B-spline basis and c ∈ RK is a vector of
convolutional layers based on polynomial filters and pro- control parameters. Those filters are not localized, since the
vides the GNN with enhanced modeling capability, thanks full projection of the eigenvectors yields paths of infinite
to a flexible design of the filter transfer function. Contrarily length and the filter accounts for interactions of each node
to polynomial filters, ARMA filters are not localized in the with the whole graph, rather than those limited to the node
node space, making their implementation inefficient within neighborhood. Since this contrasts with the local design of
a GNN. To address such a scalability issue, the proposed classic convolutional filters, Henaff et al. (2015) introduced
ARMA layer relies on a recursive formulation, which leads a parametrization of the spectral filters with smooth coef-
to a fast and distributed implementation that exploits effi- ficients to achieve spatial localization. However, the main
cient sparse operations on tensors. The resulting filters are issue with such spectral filtering (1) is the computational
not learned in the Fourier space induced by a given Lapla- complexity: not only the eigendecomposition of L is ex-
cian, but are local in the node space and independent from pensive, but a double product with U must be computed
the underlying graph structure. This allows our GNN to whenever the filter is applied. Notably, U in (1) is full even
process graphs with different topologies. when L is sparse. Finally, the same filter cannot be ap-
plied to graphs with different structures since it depends on
We use a node pooling procedure based on node deci-
a specific Laplacian spectrum.
mation, which builds on the multi-resolution framework
adopted in graph signal processing (Shuman et al., 2016).
This allows us to build deep architectures that yield more 2.1. Chebyshev polynomial filters
abstract representations at different network depths. Given The desired transfer function h(λ) can be approximated by
an input graph, node decimation drops approximately half a polynomial of order K,
of the nodes and a coarsened version of the graph on the re-
K
maining ones is obtained through graph reduction. Pooling X
of different strides is implemented in the GNN by means of hPOLY (λ) = wk λk , (2)
multiplications with pre-computed matrices. k=0

To assess the performance of our GNN, we apply it to which performs a weighted MA of the graph signal (Trem-
semi-supervised node classification, graph signal classifi- blay et al., 2018). Polynomial filters are localized in space,
cation, and graph classification. Results show that the pro- since the output at each node in the filtered signal is a linear
posed GNN with ARMA filters outperforms GNNs based combination of the nodes in its K-hop neighbourhood. A
on polynomial filters, setting the new state-of-the-art in localized filter overcomes an important limitation of spec-
several tasks. tral formulations relying on a fixed Laplacian spectrum,
making it suitable also for inference tasks on graphs with
different structures (Zhang et al., 2018).
2. Spectral filtering in GNNs
Compared to conventional polynomials, Chebyshev poly-
We assume a graph with M nodes to be characterized by nomials attenuate unwanted oscillations around the cut-off
a symmetric adjacency matrix A ∈ RM ×M and we re- frequencies (Shuman et al., 2011). Chebyshev polynomi-
fer to graph signal X ∈ RM ×F as the instance of all fea- als are exploited to implement fast localized filters in a
tures (vectors in RF ) associated with the graph nodes. Let GNN, avoiding to eigen-decompose the Laplacian by ap-
L = IM − D−1/2 AD−1/2 be the symmetrically normal- proximating the filter convolution with Chebyshev expan-
ized Laplacian (D isPthe degree matrix), with spectral de- sion Tk (x) = 2xTk−1 (x) − Tk−2 (x) (Defferrard et al.,
M
composition L = m=1 λm um um . A graph filter is a
T
2016). It follows that the convolutional layers perform the
linear operator that modifies the components of X on the filtering operation
eigenvectors basis of L, according to a transfer function h !
K−1
acting on each eigenvalue λ. The filtered graph signal reads X
X̄ = σ Tk (L̃)XWk , (3)
k=0

M
X where L̃ = 2L/λmax −IM , σ is a non-linear activation (e.g.,
X̄ = h(λm )um uTm xm , ReLU), and Wk ∈ RFin ×Fout are the k trainable weight
m=1
(1) matrices that map the node’s features from an input space
= U diag[h(λ1 ), . . . , h(λM )] UT X Fin to a new space Fout .
GNNs with convolutional ARMA filters

2.2. First-order polynomial filters λmin )/2λn . The frequency response of the approximated
ARMA(1,0) filter is
A first-order polynomial filter is adopted by Kipf & Welling
(2016a) to solve the task of semi-supervised node classifi- r b 1
cation. They propose a GNN called Graph Convolutional hARMA (µ) = with r = − and p = . (8)
µ−p a a
Network (GCN), where the convolutional layer is a simpli-
fied version of Chebyshev filters The effect of an ARMA(K,K) filter can be obtained by
  summing the outputs of K ARMA(1,0) filters
X̄ = σ ÂXW . (4) K K X
M
X X rk
X̄ = X̄k = um uTm xm . (9)
Their formulation is obtained by (3) considering only K = k=1 k=1 n=1
µm + pk
1 and setting W = W0 = −W1 . Additionally, L̃ is re-
placed by  = D̃−1/2 ÃD̃−1/2 , with à = A + IM . In 3.1. Recursive and distributed implementation of the
respect to L̃, Â contains self-loops that compensate for the ARMA layer
removal of the term of order 0 in the polynomial filter, en-
suring that a node is part of its 1st order neighbourhood, Here we propose a recursive implementation of the
and that its features are preserved after the convolution. ARMA(K,K) filter based on neural networks; see Fig. 1.
The convolution with higher-order neighbourhoods can be Equation (7) must be applied many times before converg-
obtained by stacking multiple layers. However, since each ing to a steady state. Instead, to obtain a more efficient
layer (4) performs a Laplacian smoothing, after few con- implementation, we apply the recursive update only a few
volutions the node features becomes too smoothed over the times and compensate by adding a non-linearity and train-
graph (Li et al., 2018) able parameters.
We implement the recursive update in (7) with a Graph
3. The ARMA graph convolutional layer Convolutional Skip (GCS) layer, defined as
 
The polynomial filters discussed in the previous section are X̄(t+1) = σ L̃X̄(t) W(t) + XV(t) , (10)
sensitive to changes in the graph signal or in the underly-
ing graph structure, and their smoothness prevents to model t t+1 t+1

filter responses with sharp changes. Moreover, they have where W(t) ∈ RFout ×Fout and V(t) ∈ RFin ×Fout are train-
poor interpolation and extrapolation capability around the able parameters; we set X̄(0) = X. The modified Lapla-
known graph frequencies (Isufi et al., 2016). On the other cian matrix L̃ = I − L is derived by setting λmin = 0 and
hand, an ARMA filter approximates better the optimal h λmax = 2 in M. This is a reasonable simplification, since
thanks to a rational design that allows to model a larger the spectrum of L lies in [0, 2] and the trainable parameters
variety of filter shapes (Tremblay et al., 2018). The filter in W(t) can adjust the small offset introduced. Each GCS
response of an ARMA(P, Q) reads layer extracts local substructure information by aggregat-
ing node information in local neighbourhoods and, through
PQ q the skip connection, by combining them with the original
q=0 bq λ
hARMA (λ) = PP , (5) node features. If L and/or X are represented by sparse ten-
1+ p
p=1 ap λ sors, the GCS can exploit efficient sparse operations.
which in the node domain translates to the filtering relation We build K parallel stacks, each one with T GCS layers,
P  and define the output of the ARMA convolutional layer as
Q q
q=0 bq L X
K
!
X̄ = PP . (6) X (T )
1 + p=0 ap Lp X̄ = avgpool X̄k , (11)
k=1
It is possible to note that the Laplacian appearing in the
denominator implies a matrix inversion and multiplication (T )
where X̄k is the last output of the k-th stack. We apply
between dense matrices, which is inefficient to implement dropout to the skip connection of each GCS layer not only
in a GNN. Hence, we consider the distributed formulation for regularization, but also to encourage diversity in the fil-
proposed by Loukas et al. (2015), which approximates the ters learned in each one of the K parallel stacks. To provide
effect of an ARMA(1,0) filter with a first-order recursion a further regularization and reduce the number of parame-
ters in the ARMA layer, the GCS layers in each stack may
X̄(t+1) = aMX̄(t) + bX. (7) (1)
share the same parameters, except for Wk ∈ RFin ×Fout
The eigenvalues of M = (λmax − λmin )/2I − L are re- that performs a different mapping in the first layer of the
(i) (i+1)
lated to those of the Laplacian L as follows: µn = (λmax − stack. Namely, Wk = Wk = Wk ∈ RFout ×Fout , ∀i >
GNNs with convolutional ARMA filters

L
̃  W
1
(1)
+ V
1
(1)
σ L
̃  W
1
(2)
+ V
1
(2)
σ L
̃  W
1
(T )
+ V
1
(T )
σ

Graph Conv Skip 1-1 Graph Conv Skip 1-2 Graph Conv Skip 1-T

...

...

...
Avg Pool ¯
X X

L
̃  W
K
(1)
+ V
K
(1)
σ L
̃  W
K
(2)
+ V
K
(2)
σ L
̃  W
K
(T )
+ V
K
(T )
σ

Graph Conv Skip K-1 Graph Conv Skip K-2 Graph Conv Skip K-T

ARMA Graph Conv layer

Figure 1. The ARMA convolutional layer. Same colour indicates shared weights.

(i) (i+1)
1 and Vk = Vk = Vk ∈ RFin ×Fout , ∀i. Since each of this method to medium and large graphs is not feasible,
stack of GCS layers is executed independently from the as it introduces a number of additional trainable parameters
others, it is possible to implement the ARMA layer in a quadratic in the number of nodes. The other approach fol-
distributed fashion using multiple GPUs. lowed in most GNNs consists of pre-computing coarsened
versions of the graph using hierarchical clustering (Bruna
We also notice that the ARMA layer can deal naturally with
et al., 2013; Defferrard et al., 2016; Monti et al., 2017; Fey
time-varying graph signals (Holme, 2015; Grattarola et al., (l) (l)
2018) by replacing the constant term X in (10) with a time et al., 2018). At each level l, two vertices xi and xj are
(l+1)
dependent input X(t) . clustered together in a new vertex xz . Then, a stan-
dard pooling operation is applied to halve the size of the
3.2. Relationship to other approaches graph signal. To make the pooling output consistent with
the cluster assignment, the graph signal is rearranged so
The GCS layer has a similar formulation to the graph con- that elements i and j end up in consecutive positions. This
volutional layer in (4). However, thanks to the skip con- approach has several drawbacks. First, the connectivity of
nection, it is possible to stack multiple layers without in- the original graph is not preserved in the coarsened graphs
curring in the risk of over-smoothing the node features of and the spectrum of their associated Laplacians is usually
the graph (Li et al., 2018). The formulation of the ARMA not contained in the spectrum of the original Laplacian.
layer with shared weights shares analogies with a recur- Second, the procedure to rearrange vertices is cumbersome
rent neural network with residual connections (Wu et al., to implement; moreover, it requires to add fake vertices so
2016). Finally, similarly to GNNs operating in the node do- that the number of nodes can be halved each time, hence
main (Scarselli et al., 2009; Gallicchio & Micheli, 2010), injecting noisy information in the graph signal. Finally,
(t+1)
each GCS layer computes the filtered signal x̄i at ver- clustering results depend on the initial nodes order of the
(t)
tex i as a combination of signals xj in its 1-hop neigh- nodes, which hampers stability and reproducibility.
borhood, j ∈ N (i). Such a commutative aggregation
In this paper, we use a pooling procedure that builds on the
solves the problem of undefined vertex ordering and vary-
multi-resolution framework adopted in graph signal pro-
ing neighborhood sizes.
cessing (Shuman et al., 2016), which addresses the draw-
backs of the aforementioned methods. A similar, yet pre-
4. Node Pooling liminary approach was recently discussed by Simonovsky
& Komodakis (2017). Here, we provide a more detailed
Node pooling associates a single label to the node features
formulation framed within the GNN framework of the
and is particularly important in tasks such as graph (sig-
pooling procedure based on node decimation, and of the
nal) classification. However, contrarily to other neural net-
graph reduction to generate a new coarsened graph, neces-
works, GNNs also require to coarsen the original graph to
sary to apply graph convolutions in the next GNN layer. In
perform further convolutions on graph signals as the node
the experiments, we provide a systematic comparison with
dimensionality is reduced through the network layers.
respect to pooling methods based on graph clustering.
A recent approach (Ying et al., 2018) proposes to learn dif-
ferentiable soft assignments to cluster the nodes at each 4.1. Node decimation pooling and graph reduction
layer. The original adjacency matrix acts as a prior when
learning the soft assignment and sparsity is enforced with Pooling with node decimation. A simple way to deci-
an entropy-based regularization. However, the application mate nodes V of an arbitrary graph consists of partitioning
them in two sets based on Fiedler vector umax of the Lapla-
GNNs with convolutional ARMA filters

cian, and then drop one of the two sets of nodes (Shuman implies that deeper layers will require more computation.
et al., 2016). In particular, the pooling operation keeps only A solution is to apply after each reduction spectral sparsifi-
the nodes in V + , defined as cation (Batson et al., 2013) on Lnew . However, we experi-
enced numerical instability and poor convergence when ap-
V + = {n ∈ V : umax (n) ≥ 0}. (12) plying the sparsification algorithm. Therefore, we opted for
dropping connections with weights below a small threshold
We note that is equivalent to keep each time the nodes in (1E-4), which keeps the desired level of sparsity in Lnew
V − , i.e., those associated with a negative value in umax . without altering too much its spectrum.
Despite its simplicity, this procedure offers important ad-
vantages: i) approximately half of the nodes are removed
each time, i.e., |V + | ≈ |V − |; ii) the nodes in V + and V − ∈ 
+

are connected by edges with small weights; iii) the Fiedler



∈ 

vector can be quickly computed with the power method. x̄


(0)
= Conv (L
(0)
,x
(0)
)
Furthermore, compared to the pooling based on graph clus-
tering, this approach avoids to introduce fake nodes and to
reorder nodes according to their cluster indices. L
(0)
,S
(0)

The pooling operation is implemented by multiplying a


graph signal X with a decimation matrix S, which is ob- (3) (0) (2) (1) (0)
x = x̄ S S S
tained by keeping in the identity matrix IM only the rows (1) (1)
L ,S
corresponding to the vertices in V + ,

Xpool = SX = [IM ]V + ,V X. (13)


(2) (2)
L ,S

Graph reduction. A simple approach to reduce the origi- x̄


(3)
= Conv (L
(3)
,x
(3)
)
nal Laplacian to a new Laplacian Lnew defined on the subset
V + consists in computing L
(3)
,S
(3)
x
(4)
= x̄
(3)
S
(3)

Lnew = ([L]2 )V + ,V + , (14) x̄


(4)
= Conv (L
(4)
,x
(4)
)

(4) (4)

which are the selected rows and columns of the 2-hop L ,S


x
(5)
= x̄
(4)
S
(4)

Laplacian (Narang & Ortega, 2010). Since the decima-


tion operation ideally removes the first-closest neighbour Figure 2. How to perform convolutions with selected Laplacians
of nodes V + (i.e., the nodes in V − ), it is intuitive that be- in the pyramid and higher order pooling.
fore being dropped the nodes should propagate their infor-
mation in the first-order neighbourhood. While this graph Pooling with larger stride. It is possible to perform con-
reduction is very fast to compute, it does not always pre- volutions only with some Laplacians in the pyramid and
serve connectivity, introduces self-loops, and the spectra of apply pooling with larger stride to transit from level i to
L and Lnew might not be interlaced (i.e., the spectrum of level i + k, with k > 1. The application of a single dec-
Lnew is not always contained in the spectrum of L). imation matrix S(i) corresponds to a classic pooling with
The Kron reduction (Shuman et al., 2016) is a more ad- stride 2, as it approximately halves the number of nodes. A
vanced technique that defines the reduced Laplacian as pooling with stride 2k is obtained by applying k decimation
matrices in cascade. Fig. 2 depicts an example of a pooling
Lnew = LV + ,V + − LV + ,V − L−1 (15) with approximately stride 8, which allows to skip 2 levels
V − ,V − LV ,V
− +

in the pyramid and to apply directly a convolution with the


The resulting Lnew is a well-defined Laplacian where two Laplacian L(3) after the first convolution with L(0) .
nodes are connected only if there is a path between them
in the original L. Furthermore, Lnew does not introduce 5. Experiments
self-loops and guarantees spectral interlacing and resis-
tance distance preservation (Shuman et al., 2016). The To assess the performance of the proposed model, we con-
main drawback compared to (14) is the computation of sider three classification tasks on graph data: node clas-
the inverse, which can give memory issues in very large sification, graph signal classification, and graph classifica-
graphs. Due to the connectivity preservation property, Lnew tion. In the following, we define each task and report the re-
becomes denser after each Kron reduction. Since graph sults obtained with our approach, comparing them with the
convolutions are implemented by sparse operations, this state of the art. Since we process only graphs of medium
GNNs with convolutional ARMA filters

and small size, in all experiments we use Kron reduction to weight each link when applying the graph convolution.
(15). However, we advise the reduction in (14) when deal-
ing with very large graphs to avoid memory issues.
Table 2. Mean node classification accuracy

5.1. Semi-supervised node classification Method Cora Citeseer Pubmed


LP 68.0 45.3 63.0
The input for this task is a single graph described by an ad- DW 67.2 43.2 65.3
jacency matrix A ∈ RM ×M , a graph signal X ∈ RM ×Fin PL 75.7 64.7 77.2
GAT 83.0 72.5 79.0
and the labels yl ∈ RMl of a subset of nodes Ml ⊂ M .
The target outputs are the labels yu ∈ RMu of the unla- Cheby 81.2 69.8 74.4
GCN 81.5 70.3 79.0
belled nodes. For this task, pooling is not required since ARMA (ours) 84.7 73.8 81.4
the output is computed in the input node space by mapping
nodes features into labels through graph convolutions.
We follow the same experimental setup of (Kipf & Welling, 5.2. Graph signal classification
2016a) applied to three citation network datasets: Cite- In this task, N different graph signals X ∈ RM ×Fin , de-
seer, Cora and Pubmed. Each dataset is a graph, whose fined on the same adjacency matrix A ∈ RM ×M , must
nodes x ∈ RFin are documents represented by sparse bag- be classified with labels y1 , . . . , yN . Like in traditional
of-words feature vectors. The binary undirected edges in CNNs, this task can be solved by a deep architecture com-
A indicate citation links between documents. For training, posed of L graph convolutional layers, each one followed
20 labels per document class are used (yl ) and the perfor- by a pooling layer. In each layer l, the graph convolu-
mance is evaluated as classification accuracy on yu . tion modifies the vertex features by mapping the graph
As in (Kipf & Welling, 2016a), we use a 2-layers GNN with signal x(l) ∈ RMl ×Fl into x̄(l) ∈ RMl ×F(l+1) , while
16 hidden units and we report in Tab. 2 the mean classifi- the pooling operation maps x̄(l) into a new node space
cation accuracy obtained for different graph convolutional x(l+1) ∈ RMl+1 ×Fl+1 . In the last layer, the features of
layers: the ones based on Chebyshev polynomials (Cheby), the remaining nodes are aggregated by a global operation,
their first order approximation (GCN), and the proposed x ∈ RML ×FL → x ∈ RFL , and a Softmax layer is applied
ARMA layers. Tab. 1 reports the hyperparameters con- to compute the labels. We perform experiments following
figuration found with cross-validation: L2 regularization the same setting of (Defferrard et al., 2016) on the MNIST
weight, dropout probability (pdrop ), number of stacks (K) and 20news datasets and, unless specified otherwise, we
and depth (T ) in the ARMA filter, and usage of shared use the same hyperparameters.
weights in the GCS layer. As additional baselines, we in-
clude the results from the literature obtained by Label Prop- MNIST. To emulate a classic CNNs operating on a reg-
agation (LP) (Zhou et al., 2004), Deepwalk (DW) (Perozzi ular 2D grid, an 8-NN graph is defined on the 784 pixel
et al., 2014), Planetoid (PL) (Yang et al., 2016), and Graph positions of the MNIST images. The elements in A are
Attention Networks (GAT) (Velickovic et al., 2017).  
kpi − pj k2
aij = exp − , (16)
σ2
Table 1. Hyperparameters setting for node classification
where pi and pj are the 2D coordinates of pixel i and j.
Dataset L2 reg. pdrop [K, T ] shared W Each graph signal is a vectorized image x ∈ R784×1 . As
Cora 5e-4 0.25 [3,2] yes network architecture, we use GC16-P4-GC32-P4-FC512,
Citeseer 5e-4 0.75 [3,1] yes where GC16 and GC32 indicate a graph convolutional
Pubmed 5e-4 0.0 [1,4] no layer with 16 and 32 hidden units respectively, P4 a pooling
operation with stride 4, and FC512 a fully connected layer
with 512 units. Compared to (Defferrard et al., 2016), we
Node classification is a semi-supervised task that requires
use less hidden units to diversify more the results for dif-
a strong regularization and a simple model to avoid over-
ferent filters and pooling methods. The ARMA filters are
fitting on the few labels available. This is the key of the
configured with K = 5, T = 10, and no shared weights.
success of the GCN model compared to the more com-
As discussed in Sect. 4.1, when using decimation pooling
plex Chebyshev filters. However, despite the more pow-
a stride 4 is approximated by two decimation matrices in
erful modelling capability, thanks to its flexible formula-
cascade (S(1) S(0) and S(3) S(2) in this case).
tion the proposed ARMA layer can implement the right de-
gree of complexity for each task and outperforms other ap- The results, reported in Tab. 3, show that a GNN with
proaches. Notably, our method surpasses even GAT, which ARMA filters achieves the best results. On the other hand,
exploits a sophisticated attention mechanism to learn how the GCN layers yield the worst performance, suggesting
GNNs with convolutional ARMA filters

5.3. Graph classification


Table 3. Graph signal classification results on MNIST.
Pooling In this task, the ith datum is a graph represented by a
GC layer pair {Ai , Xi }, i = 1, . . . N , where Ai ∈ RMi ×Mi is
clust decim(k)
GCN 97.57 95.91 an adjacency matrix with Mi nodes and the graph signal
Cheby 98.17 97.64 Xi ∈ RMi ×F describes the node features. Each sample
ARMA (ours) 98.54 98.11 must be classified with a label yi . To train the GNN on
mini-batches of graphs with a variable number of nodes,
we compute the disjoint union of the graphs in each mini-
batch, and train the network on the obtained Laplacian and
that for more complex graph signal classification tasks their graph signal. In this way, we can apply the convolution and
simple formulation is not sufficient. Also, the GNN per- pooling operations seamlessly, performing batched compu-
forms better when using the hierarchical clustering pool- tations on GPU. At the end, an average pooling matrix ag-
ing (Defferrard et al., 2016), rather than the node decima- gregates the features on the remaining nodes in each graph
tion pooling. This is expected since the artificial 8-NN signal, and a Softmax layer yields the final output for each
graph generated for this task, contrarily to most real-world graph. Fig. 3 depicts an example of the procedure.
graphs, is extremely regular and the node pairs are easily
matched by the clustering procedure. To test our model, we consider 4 datasets from the bench-
mark database for graph kernels1 : Enzymes, Proteins,
D&D, and MUTAG. We used node degree, clustering coef-
20news. The dataset consists of 18,846 documents from ficients, and node labels as additional node features. For
20 classes. Each graph signal is a document that is rep- each experiment we adopted a fixed architecture, which
resented by a bag-of-words of the 10,000 most frequent is GC64-P2-GC64-P2-GC64-P2-AvgPool-Softmax. Such
words in the corpus. Each word is, in turn, represented by a configuration might not be optimal for all dataset, but
a word2vec embedding. The underlying graph of 10,000 the main focus of this experiment is to compare on a com-
nodes is defined by a 16-NN adjacency matrix built with mon ground the different graph filters and the pooling pro-
(16), where pi , pj are the embeddings of words i and j. cedures based on node decimation and hierarchical graph
clustering. Tab. 5 reports the optimal configurations of
ARMA and Cheby filters found with cross-validation on
Table 4. Graph signal classification results on 20news. each dataset. To evaluate model performance we perform
a 10-fold train/test split, using 10% of the training set in
Method Accuracy
each fold as validation set, and in Tab. 6 we report the
Linear SVM 65.90 accuracy averaged over 10 folds. For comparison, we
Multinomial Naive Bayes 68.51
Softmax 66.28 also add in Tab. 6 the results obtained by state-of-the-art
graph kernels and other neural networks for graph classifi-
GCN 65.57
cation: the Weisfeiler-Lehman kernel (WL) (Shervashidze
Cheby 68.26
ARMA (ours) 70.12 et al., 2011); the Edge-Conditioned Convolution net-
work (ECC) (Simonovsky & Komodakis, 2017); PATCHY-
SAN (Niepert et al., 2016); GRAPHSAGE (Hamilton
Tab. 4 reports the average classification accuracy obtained et al., 2017); the Diffusion-CNN (DCNN) (Atwood &
by a GNN with a single conv layer, followed by global av- Towsley, 2016); the network with differential pooling
erage pooling and Softmax. We report all the results from (DIFFPOOL) (Ying et al., 2018); the Deep Graph Convo-
(Defferrard et al., 2016), and we compare them with those lutional Neural Network (DGCNN) (Zhang et al., 2018).
obtained using a GCN and the proposed ARMA layer. The
ARMA layer has 16 hidden units and is configured with Table 5. Hyperparameters setting for graph classification
K=1, T =1, 1E-3 as L2 regulariazion, and 0.75 dropout.
For the GCN layer we used 32 hidden units, which is the Cheby ARMA
Dataset L2 reg. pdrop
K [K, T ] shared W
same number of units for the Chebyshev layer in (Deffer-
rard et al., 2016). The GNN with GCN layer performs Enzymes 5e-4 0.5 3 [1,4] yes
worse than any method. On the other hand, the proposed Proteins 5e-4 0.5 10 [3,2] no
D&D 5e-4 0.0 5 [3,4] yes
ARMA GNN outperforms Chebyshev GNN and also every MUTAG 5e-4 0.25 10 [3,4] no
other model. Since we use only one GCS layer (K=1), the
main difference between the GCN and our layer is the pres-
ence of the skip connection with high dropout, which turns 1
https://ls11-www.cs.tu-dortmund.de/
out to be extremely important for the inference task. staff/morris/graphkerneldatasets
GNNs with convolutional ARMA filters

(0) (0) (0)


x x x
1 2 3
(1) (1) (1)
n1 n2 n3
x x x
1 2 3

(0) (2) (2) (2)


L x x x
1 (1) 1 2 3
L
1
Conv1 Conv2
1/n1
(1)
L
(0) 2 c1
L
2
Pool1 Pool2 1/n2
softmax c2

(1) c3
L 1/n3
3

(0) Average pooling


L
3
matrix

Figure 3. How to perform graph classification with mini batches in the proposed framework.

in D&D results are below the state-of-the-art, suggesting


Training time per epoch (s)

4.0 clust that the adopted architecture (GC64-P2-GC64-P2-GC64-


3.5 decim
3.0
P2-AvgPool-Softmax) is not optimal for this task.
2.5 Contrarily to the results obtained on the artificial grid for
2.0 the MNIST graph signal classification, here the decimation
1.5
pooling always outperforms the clustering pooling. This
1.0
0.5
demonstrates that for irregular graph structures with a vari-
0.0 able number of nodes, the node decimation pooling is much
Enzym Prot D&D Mutag more effective. Moreover, Fig. 4 shows that, when using
decimation pooling, training GNN is faster. Indeed, in clus-
Figure 4. Mean training time per epoch when using ARMA filters ter pooling fake nodes must be added whenever the number
and pooling based on hierarchical clustering or decimation. of nodes is not divisible by 2L (in our case L = 3, since
we apply pooling 3 times), which implies larger graphs and
slower convolutions.
Table 6. Graph classification results
Method Enzymes Protein D&D MUTAG 6. Conclusions
WL 53.53 72.92 74.02 80.72
ECC 53.50 72.65 74.10 89.44 We proposed a recursive formulation of the ARMA graph
PATCHY-SAN – 75.00 76.27 92.63 convolutional layer, which allows for a fast and distributed
GRAPHSAGE 54.25 70.48 75.42 – GNN implementation that exploits efficient sparse tensor
DCNN 18.10 61.29 58.09 66.98
DIFFPOOL 62.53 76.25 80.64 – operations to perform graph convolutions with the Lapla-
DGCNN – 75.54 79,73 85.83 cians. The proposed ARMA layer outperformed existing
GCN 64.83 72.06 64.60 76.13 convolutional layers based on polynomial filters on all clas-
clust

Cheby 66.50 69.19 66.81 80.32 sification tasks on graph data taken into account. To build
ARMA (ours) 67.83 71.92 71.22 85.67 a deep GNN, we used a pooling operation based on node
GCN 67.33 72.15 70.63 86.20 decimation, which achieves superior performance on real-
decim

Cheby 66.50 70.79 68.09 90.39 world graphs with irregular topology and faster training
ARMA (ours) 69.66 75.12 74.86 93.25
time compared to node pooling based on graph clustering.
The current formulation of the ARMA layer only consid-
ers nodes information, but can be extended to incorporate
GCN performs better than Cheby only on the Protein
edge features to weight the contribution of each neigh-
dataset, while the proposed ARMA layer always achieves
bor node using, for example, edge-conditioned convolu-
the best performance showing, once again, a superior mod-
tions (Simonovsky & Komodakis, 2017). Moreover, the re-
eling capability compared to those layers based on poly-
sults presented in (Velickovic et al., 2017) showed a notable
nomial filters. The adopted GNN architecture is particu-
increase in performance when applying multi-head soft at-
larly effective for the Enzymes dataset, as it surpasses the
tention to the Laplacian in a GNN. Given that the ARMA
state-of-the-art with every convolutional layer and pool-
layer is already structured in a parallel fashion, a similar ex-
ing method. The GNN is configured with ARMA lay-
tension with the attention mechanism could provide com-
ers and decimation pooling attains top performance also
parable benefits, and further improve the performance.
in MUTAG, and competitive results in Protein. Finally,
GNNs with convolutional ARMA filters

References Henaff, Mikael, Bruna, Joan, and LeCun, Yann. Deep


convolutional networks on graph-structured data. arXiv
Atwood, James and Towsley, Don. Diffusion-convolutional
preprint arXiv:1506.05163, 2015.
neural networks. In Advances in Neural Information
Processing Systems, pp. 1993–2001, 2016. Holme, Petter. Modern temporal network theory: a col-
Bacciu, Davide, Errica, Federico, and Micheli, Alessio. loquium. The European Physical Journal B, 88(9):234,
Contextual graph markov model: A deep and genera- 2015.
tive approach to graph processing. In Proceedings of
Isufi, Elvin, Loukas, Andreas, Simonetto, Andrea, and
the 35th international conference on Machine learning.
Leus, Geert. Autoregressive moving average graph fil-
ACM, 2018.
tering. arXiv preprint arXiv:1602.04436, 2016.
Batson, Joshua, Spielman, Daniel A, Srivastava, Nikhil,
and Teng, Shang-Hua. Spectral sparsification of graphs: Kipf, Thomas N and Welling, Max. Semi-supervised clas-
theory and algorithms. Communications of the ACM, 56 sification with graph convolutional networks. In Interna-
(8):87–94, 2013. tional Conference on Learning Representations (ICLR),
2016a.
Battaglia, Peter W, Hamrick, Jessica B, Bapst, Victor,
Sanchez-Gonzalez, Alvaro, Zambaldi, Vinicius, Mali- Kipf, Thomas N and Welling, Max. Variational graph auto-
nowski, Mateusz, Tacchetti, Andrea, Raposo, David, encoders. In NIPS Workshop on Bayesian Deep Learn-
Santoro, Adam, Faulkner, Ryan, et al. Relational induc- ing, 2016b.
tive biases, deep learning, and graph networks. arXiv
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
preprint arXiv:1806.01261, 2018.
Imagenet classification with deep convolutional neural
Bruna, Joan, Zaremba, Wojciech, Szlam, Arthur, and Le- networks. In Advances in neural information processing
Cun, Yann. Spectral networks and locally connected net- systems, pp. 1097–1105, 2012.
works on graphs. arXiv preprint arXiv:1312.6203, 2013.
Li, Qimai, Han, Zhichao, and Wu, Xiao-Ming. Deeper
Defferrard, Michaël, Bresson, Xavier, and Vandergheynst, insights into graph convolutional networks for semi-
Pierre. Convolutional neural networks on graphs with supervised learning. In Proceedings of AAAI Conference
fast localized spectral filtering. In Advances in Neural on Artificial Intelligence, 2018.
Information Processing Systems, pp. 3844–3852, 2016.
Loukas, Andreas, Simonetto, Andrea, and Leus, Geert.
Duvenaud, David K, Maclaurin, Dougal, Iparraguirre, Distributed autoregressive moving average graph fil-
Jorge, Bombarell, Rafael, Hirzel, Timothy, Aspuru- ters. IEEE Signal Processing Letters, 22(11):1931–
Guzik, Alán, and Adams, Ryan P. Convolutional net- 1935, 2015.
works on graphs for learning molecular fingerprints. In
Advances in neural information processing systems, pp. Micheli, Alessio. Neural network for graphs: A contex-
2224–2232, 2015. tual constructive approach. IEEE Transactions on Neu-
ral Networks, 20(3):498–511, 2009.
Fey, Matthias, Lenssen, Jan Eric, Weichert, Frank, and
Müller, Heinrich. Splinecnn: Fast geometric deep learn- Monti, Federico, Boscaini, Davide, Masci, Jonathan,
ing with continuous b-spline kernels. In Proceedings of Rodola, Emanuele, Svoboda, Jan, and Bronstein,
the IEEE Conference on Computer Vision and Pattern Michael M. Geometric deep learning on graphs and
Recognition, pp. 869–877, 2018. manifolds using mixture model cnns. In Proceedings
Gallicchio, Claudio and Micheli, Alessio. Graph echo state of the IEEE Conference on Computer Vision and Pattern
networks. In Neural Networks (IJCNN), The 2010 Inter- Recognition, volume 1, pp. 3, 2017.
national Joint Conference on, pp. 1–8. IEEE, 2010. Narang, Sunil K and Ortega, Antonio. Local two-channel
Grattarola, Daniele, Zambon, Daniele, Alippi, Cesare, critically sampled filter-banks on graphs. In Image Pro-
and Livi, Lorenzo. Learning graph embeddings on cessing (ICIP), 2010 17th IEEE International Confer-
constant-curvature manifolds for change detection in ence on, pp. 333–336. IEEE, 2010.
graph streams. arXiv preprint arXiv:1805.06299, 2018.
Narang, Sunil K, Gadde, Akshay, and Ortega, Antonio.
Hamilton, Will, Ying, Zhitao, and Leskovec, Jure. Induc- Signal processing techniques for interpolation in graph
tive representation learning on large graphs. In Advances structured data. In Acoustics, Speech and Signal Pro-
in Neural Information Processing Systems, pp. 1024– cessing (ICASSP), 2013 IEEE International Conference
1034, 2017. on, pp. 5445–5449. IEEE, 2013.
GNNs with convolutional ARMA filters

Niepert, Mathias, Ahmed, Mohamed, and Kutzkov, Kon- Yang, Zhilin, Cohen, William W, and Salakhutdinov, Rus-
stantin. Learning convolutional neural networks for lan. Revisiting semi-supervised learning with graph
graphs. In International conference on machine learn- embeddings. In Proceedings of the 33rd International
ing, pp. 2014–2023, 2016. Conference on International Conference on Machine
Learning-Volume 48, pp. 40–48. JMLR. org, 2016.
Perozzi, Bryan, Al-Rfou, Rami, and Skiena, Steven. Deep-
walk: Online learning of social representations. In Pro- Ying, Rex, You, Jiaxuan, Morris, Christopher, Ren, Xiang,
ceedings of the 20th ACM SIGKDD international con- Hamilton, William L, and Leskovec, Jure. Hierarchical
ference on Knowledge discovery and data mining, pp. graph representation learning withdifferentiable pooling.
701–710. ACM, 2014. arXiv preprint arXiv:1806.08804, 2018.
Scarselli, Franco, Gori, Marco, Tsoi, Ah Chung, Hagen- Zhang, Muhan, Cui, Zhicheng, Neumann, Marion, and
buchner, Markus, and Monfardini, Gabriele. The graph Chen, Yixin. An end-to-end deep learning architecture
neural network model. IEEE Transactions on Neural for graph classification. In Proceedings of AAAI Confer-
Networks, 20(1):61–80, 2009. ence on Artificial Intelligence, 2018.
Shervashidze, Nino, Schweitzer, Pascal, Leeuwen, Erik Zhou, Denny, Bousquet, Olivier, Lal, Thomas N, Weston,
Jan van, Mehlhorn, Kurt, and Borgwardt, Karsten M. Jason, and Schölkopf, Bernhard. Learning with local and
Weisfeiler-lehman graph kernels. Journal of Machine global consistency. In Advances in neural information
Learning Research, 12(Sep):2539–2561, 2011. processing systems, pp. 321–328, 2004.
Shuman, David I, Vandergheynst, Pierre, and Frossard,
Pascal. Chebyshev polynomial approximation for dis-
tributed signal processing. In Distributed Computing in
Sensor Systems and Workshops (DCOSS), 2011 Interna-
tional Conference on, pp. 1–8. IEEE, 2011.
Shuman, David I, Faraji, Mohammad Javad, and Van-
dergheynst, Pierre. A multiscale pyramid transform for
graph signals. IEEE Transactions on Signal Processing,
64(8):2119–2134, 2016.
Simonovsky, Martin and Komodakis, Nikos. Dynamic
edgeconditioned filters in convolutional neural networks
on graphs. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017.
Susnjara, Ana, Perraudin, Nathanael, Kressner, Daniel, and
Vandergheynst, Pierre. Accelerated filtering on graphs
using lanczos method. arXiv preprint arXiv:1509.04537,
2015.
Tremblay, Nicolas, Goncalves, Paulo, and Borgnat, Pierre.
Design of graph filters and filterbanks. In Cooperative
and Graph Signal Processing, pp. 299–324. Elsevier,
2018.
Velickovic, Petar, Cucurull, Guillem, Casanova, Aran-
txa, Romero, Adriana, Lio, Pietro, and Bengio,
Yoshua. Graph attention networks. arXiv preprint
arXiv:1710.10903, 2017.
Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V,
Norouzi, Mohammad, Macherey, Wolfgang, Krikun,
Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al.
Google’s neural machine translation system: Bridging
the gap between human and machine translation. arXiv
preprint arXiv:1609.08144, 2016.
Published as a conference paper at ICLR 2019

C ONVOLUTIONAL N EURAL N ETWORKS ON NON -


UNIFORM GEOMETRICAL SIGNALS USING E UCLIDEAN
SPECTRAL TRANSFORMATION

Chiyu “Max” Jiang Dequan Wang Jingwei Huang


UC Berkeley UC Berkeley Stanford University

Philip Marcus Matthias Nießner


UC Berkeley Technical University of Munich
arXiv:1901.02070v1 [cs.CV] 7 Jan 2019

A BSTRACT

Convolutional Neural Networks (CNN) have been successful in processing data


signals that are uniformly sampled in the spatial domain (e.g., images). However,
most data signals do not natively exist on a grid, and in the process of being sampled
onto a uniform physical grid suffer significant aliasing error and information loss.
Moreover, signals can exist in different topological structures as, for example,
points, lines, surfaces and volumes. It has been challenging to analyze signals with
mixed topologies (for example, point cloud with surface mesh). To this end, we
develop mathematical formulations for Non-Uniform Fourier Transforms (NUFT)
to directly, and optimally, sample nonuniform data signals of different topologies
defined on a simplex mesh into the spectral domain with no spatial sampling
error. The spectral transform is performed in the Euclidean space, which removes
the translation ambiguity from works on the graph spectrum. Our representation
has four distinct advantages: (1) the process causes no spatial sampling error
during the initial sampling, (2) the generality of this approach provides a unified
framework for using CNNs to analyze signals of mixed topologies, (3) it allows
us to leverage state-of-the-art backbone CNN architectures for effective learning
without having to design a particular architecture for a particular data structure in
an ad-hoc fashion, and (4) the representation allows weighted meshes where each
element has a different weight (i.e., texture) indicating local properties. We achieve
results on par with the state-of-the-art for the 3D shape retrieval task, and a new
state-of-the-art for the point cloud to surface reconstruction task.

1 I NTRODUCTION

We present a unifying and novel geometry representation for utilizing Convolutional Neural Networks
(CNNs) on geometries represented on weighted simplex meshes (including textured point clouds, line
meshes, polygonal meshes, and tetrahedral meshes) which preserve maximal shape information based
on the Fourier transformation. Most methods that leverage CNNs for shape learning preprocess these
shapes into uniform-grid based 2D images (rendered multiview images) or 3D images (binary voxel
or Signed Distance Function (SDF)). However, rendered 2D images do not preserve the 3D topologies
of the original shapes due to occlusions and the loss of the third spatial dimension. Binary voxels
and SDF representations under low resolution suffer big aliasing errors and under high resolution
become memory inefficient. Loss of information in the input bottlenecks the effectiveness of the
downstream learning process. Moreover, it is not clear how a weighted mesh where each element is
weighted by a different scalar or vector (i.e., texture) can be represented by binary voxels and SDF.
Mesh and graph based CNNs perform learning on the manifold physical space or graph spectrum, but
generality across topologies remains challenging.
In contrast to methods that operate on uniform sampling based representations such as voxel-based and
view-based models, which suffer significant representational errors, we use analytical integration to
precisely sample in the spectral domain to avoid sample aliasing errors. Unlike graph spectrum based

1
Published as a conference paper at ICLR 2019

Figure 1: Top: Schematic of the NUFT transformations of the Stanford Bunny model. Bottom:
Schematic for shape retrieval and surface reconstruction experiments.

methods, our method naturally generalize across input data structures of varied topologies. Using our
representation, CNNs can be directly applied in the corresponding physical domain obtainable by
inverse Fast Fourier Transform (FFT) due to the equivalence of the spectral and physical domains.
This allows for the use of powerful uniform Cartesian grid based CNN backbone architectures (such as
DLA (Yu et al., 2018), ResNet (He et al., 2016)) for the learning task on arbitrary geometrical signals.
Although the signal is defined on a simplex mesh, it is treated as a signal in the Euclidean space
instead of on a graph, differentiating our framework from graph-based spectral learning techniques
which have significant difficulties generalizing across topologies and unable to utilize state-of-the-art
Cartesian CNNs.
We evaluate the effectiveness of our shape representation for deep learning tasks with three experi-
ments: a controlled MNIST toy example, the 3D shape retrieval task, and a more challenging 3D
point cloud to surface reconstruction task. In a series of evaluations on different tasks, we show the
unique advantages of this representation, and good potential for its application in a wider range of
shape learning problems. We achieve state-of-the-art performance among non-pre-trained models for
the shape retrieval task, and beat state-of-the-art models for the surface reconstruction task.
The key contributions of our work are as follows:

• We develop mathematical formulations for performing Fourier Transforms of signals defined


on a simplex mesh, which generalizes and extends to all geometries in all dimensions. (Sec.
3)

• We analytically show that our approach computes the frequency domain representation
precisely, leading to much lower overall representational errors. (Sec. 3)

• We empirically show that our representation preserves maximal shape information compared
to commonly used binary voxel and SDF representations. (Sec. 4.1)

• We show that deep learning models using CNNs in conjunction with our shape representation
achieves state-of-the-art performance across a range of shape-learning tasks including shape
retrieval (Sec. 4.2) and point to surface reconstruction (Sec. 4.3)

2
Published as a conference paper at ICLR 2019

1 0 .3 .1 ? ? Notation Description
d Dimension of Euclidean space Rd
j Degree of simplex. Point j = 0,
Line j = 1, Tri. j = 2, Tet. j = 3
n, N Index of the n-th element among a
0 0 −.2 −.6 ? ? total of N elements
(a) (b) (c) Ωjn Domain of n-th element of order j
x Cartesian space coordinate vector.
x = (x, y, z)
Figure 2: Surface localization in (a) binary
k Spectral domain coordinate vector.
pixel/voxel representation, where the boundary can
k = (u, v, w)
only be in one of 24 (2D) or 28 (3D) discrete lo-
i Imaginary number unit
cations (b) Signed Distance Function representa-
tion, where boundary is linear (c) proposed repre- Table 1: Notation list
sentation, with nonlinear localization of boundary,
achieving subgrid accuracy

2 R ELATED W ORK

2.1 S HAPE R EPRESENTATION

Shape learning involves the learning of a mapping from input geometrical signals to desired output
quantities. The representation of geometrical signals is key to the learning process, since on the one
hand the representation determines the learning architectures, and, on the other hand, the richness of
information preserved by the representation acts as a bottleneck to the downstream learning process.
While data representation has not been an open issue for 2D image learning, it is far from being
agreed upon in the existing literature for 3D shape learning. The varied shape representations used in
3D machine learning are generally classified as multiview images (Su et al., 2015a; Shi et al., 2015;
Kar et al., 2017), volumetric voxels (Wu et al., 2015; Maturana & Scherer, 2015; Wu et al., 2016;
Brock et al., 2016), point clouds (Qi et al., 2017a;b; Wang et al., 2018b), polygonal meshes (Kato
et al., 2018; Wang et al., 2018a; Monti et al., 2017; Maron et al., 2017), shape primitives (Zou et al.,
2017; Li et al., 2017), and hybrid representations (Dai & Nießner, 2018).
Our proposed representation is closest to volumetric voxel representation, since the inverse Fourier
Transform of the spectral signal in physical domain is a uniform grid implicit representation of
the shape. However, binary voxel representation suffers from significant aliasing errors during the
uniform sampling step in the Cartesian space (Pantaleoni, 2011). Using boolean values for de facto
floating point numbers during CNN training is a waste of information processing power. Also, the
primitive-in-cell test for binarization requires arbitrary grouping in cases such as having multiple
points or planes in the same cell (Thrun, 2003). Signed Distance Function (SDF) or Truncated
Signed Distance Function (TSDF) (Liu et al., 2017; Canelhas, 2017) provides localization for the
shape boundary, but is still constrained to linear surface localization due to the linear interpolation
process for recovering surfaces from grids. Our proposed representation under Fourier basis can find
nonlinear surface boundaries, achieving subgrid-scale accuracy (See Figure 2).

2.2 L EARNING A RCHITECTURES

Cartesian CNNs are the most ubiquitous and mature type of learning architecture in Computer Vision.
It has been thoroughly studied in a range of problems, including image recognition (Krizhevsky
et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016), object detection (Girshick, 2015; Ren
et al., 2015), and image segmentation (Long et al., 2015; He et al., 2017). In the spirit of 2D
image-based deep learning, Cartesian CNNs have been widely used in shape learning models that
adopted multiview shape representation (Su et al., 2015a; Shi et al., 2015; Kar et al., 2017; Su et al.,
2015b; Pavlakos et al., 2017; Tulsiani et al., 2018). Also, due to its straightforward and analogous
extension to 3D by swapping 2D convolutional kernels with 3D counterparts, Cartesian CNNs have
also been widely adopted in shape learning models using volumetric representations (Wu et al., 2015;
Maturana & Scherer, 2015; Wu et al., 2016; Brock et al., 2016). However, the dense nature of the
operations makes it inefficient for sparse 3D shape signals. To this end, improvements to Cartesian

3
Published as a conference paper at ICLR 2019

CNNs have been made using space partitioning tree structures, such as Quadtree in 2D and Octree
in 3D (Wang et al., 2017; Häne et al., 2017; Tatarchenko et al., 2017). These Cartesian CNNs can
leverage backbone CNN architectures being developed in related computer vision problems and
thus achieve good performance. Since the physical domain representation in this study is based on
Cartesian uniform grids, we directly use Cartesian CNNs.
Graph CNNs utilize input graph structure for performing graph convolutions. They have been
developed to engage with general graph structured data (Bruna et al., 2013; Henaff et al., 2015;
Defferrard et al., 2016). Yi et al. (2017) used spectral CNNs with the eigenfunctions of the graph
Laplacian as a basis. However, the generality of this approach across topologies and geometris is still
challenging since consistency in eigenfunction basis is implied.
Specially Designed Neural Networks have been used to perform learning on unconventional data
structures. For example, Qi et al. (2017a) designed a Neural Network architecture for points that
achieves invariances using global pooling, with follow-up work (Qi et al., 2017b) using CNNs-
inspired hiearchical structures for more efficient learning. Masci et al. (2015) performed convolution
directly on the shape manifold and Cohen et al. (2017) designed CNNs for the spherical domain and
used it for 3D shapes by projecting the shapes onto the bounding sphere.

2.3 F OURIER T RANSFORM OF S HAPE F UNCTIONS

The original work on analytical expressions for Fourier transforms of 2D polygonal shape functions
is given by Lee & Mittra (1983). Improved and simpler calculation methods have been suggested
in Chu & Huang (1989). A 3D formulation is proposed by Zhang & Chen (2001). Theoretical
analyses have been performed for the Fourier analysis of simplex domains (Sun, 2006; Li & Xu,
2009) and Sammis & Strain (2009) designed approximation methods for Fast Fourier Transform of
polynomial functions defined on simplices. Prisacariu & Reid (2011) describe shape with elliptic
Fourier descriptors for level set-based segmentation and tracking. There has also been a substantial
literature on fast non-uniform Fourier transform methods for discretely sampled signal (Greengard &
Lee, 2004). However we are the first to provide a simple general expression for a j-simplex mesh,
an algorithm to perform the transformation, and illustrations of their applicability to deep learning
problems.

3 R EPRESENTATION OF S HAPE F UNCTIONS

3.1 M ATHEMATICAL PRELIMINARIES

Almost all discrete geometric signals can be abstracted into weighted simplicial complexes. A
simplicial complex is a set composed of points, line segments, triangles, and their d-dimensional
counterparts. We call a simplicial complex consisting solely of j-simplices as a homogeneous
simplicial j-complex, or a j-simplex mesh. Most popular geometric representations that the research
community is familiar with are simplex meshes. For example, the point cloud is a 0-simplex mesh, the
triangular mesh is a 2-simplex mesh, and the tetrahedral mesh is a 3-simplex mesh. A j-simplex mesh
consists of a set of individual elements, each being a j-simplex. If signal is non-uniformly distributed
over the simplex, we can define a piecewise constant j-simplex function over the j-simplex mesh.
We call this a weighted simplex mesh. Each element has a distinct signal density.
J-simplex function For the n-th j-simplex with domain Ωjn , we define a density function fnj (x). For
example, for some Computer Vision and Graphics applications, a three-component density value can
be defined on each element of a triangular mesh for its RGB color content. For scientific applications,
signal density can be viewed as mass or charge or other physical quantity.
 N
X
ρn , x ∈ Ωjn
fnj (x) = , f j (x) = fnj (x) (1)
0, x ∈/ Ωjn n=1

The piecewise-constant j-simplex function consisting of N simplices is therefore defined as the


superposition of the element-wise simplex function. Using the linearity of the integral in the Fourier
transform, we can decompose the Fourier transform of the density function on the j-simplex mesh to

4
Published as a conference paper at ICLR 2019

be a weighted sum of the Fourier transform on individual j-simplices.


Z Z ∞ XN Z Z N
X
F j (k) = · · · f j (x)e−ik·x dx = ρn · · · e−ik·x dx = ρn Fnj (k) (2)
j
−∞ n=1 Ωn
| {z } n=1
:=Fnj (k)

3.2 S IMPLEX M ESH F OURIER T RANSFORM

We present a general formula for performing the Fourier transform of signal over a single j-simplex.
We provide detailed derivation and proof for j = 0, 1, 2, 3 in the supplemental material.
j+1
X e−iσt
Fnj (k) = ij γnj Qj+1 , σt := k · xt (3)
t=1 l=1,l6=t (σt − σl )
We define γnj to be the content distortion factor, which is the ratio of content between the simplex
over the domain Ωjn and the unit orthogonal j-simplex. Content is the j-dimensional analogy of the
3-dimensional volume. The unit orthogonal j-simplex is defined as a j-simplex with one vertex at the
Cartesian origin and all edges adjacent to the origin vertex to be pairwise orthogonal and to have unit
length. Therefore from Equation (2) the final general expression for computing the Fourier transform
of a signal defined on a weighted simplex mesh is:
XN j+1
X e−iσt
F j (k) = ρn ij γnj Qj+1 (4)
n t=1 l=1,l6=t (σt − σl )

For computing the simplex content, we use the Cayley-Menger Determinant for a general expression:
 
0 1 1 1 ···
s 2 2
1 0 d12 d13 · · ·
(−1)j+1 1 d2 0 d223 · · ·
Cnj = det( B̂ j
), where B̂ j
=  21  (5)
j
2 (j!)2 n n  2 2 
1 d31 d32 0 · · ·
.. .. .. ..
. . . .
For the matrix B̂nj , each entry d2mn represents the squared distance between nodes m and n. The
matrix is of size (j + 2) × (j + 2) and is symmetrical. Since the unit orthogonal simplex has content
of CIj = 1/j!, the content distortion factor γnj can be calculated by:
s
Cnj (−1)j+1
j
γn = j = j!Cn = j!j
det(B̂nj ) (6)
CI 2j (j!)2
Auxiliary Node Method: Equation (3) provides a mean of computing the Fourier transform of a
simplex with uniform signal density. However, how do we compute the Fourier transform of polytopes
(i.e., polygons in 2D, polyhedra in 3D) with uniform signal density efficiently? Here, we introduce the
auxiliary node method (AuxNode) that utilizes signed content for efficient computing. We show that
for a solid j-polytope represented by a watertight (j − 1)-simplex mesh, we can compute the Fourier
transform of the entire polytope by traversing each of the elements in its boundary (j − 1)-simplex
mesh exactly once (Zhang & Chen, 2001).
The auxiliary node method performs Fourier transform over the the signed content bounded by an
auxilliary node (a convenient choice being the origin of the Cartesian coordinate system) and each
(j − 1)-simplex on the boundary mesh. This forms an auxiliary j-simplex: Ωjn0 , n0 ∈ [0, N 0 ], where
N 0 is the number of (j − 1)-simplices in the boundary mesh. However due to the overlapping
of these auxiliary j-simplices, we need a means of computing the sign of the transform for the
overlapping regions to cancel out. Equation (3) provides a general expression for computing the
unsigned transform for a single j-simplex. It is trivial to show that since the ordering of the nodes
does not affect the determinant in Equation (5), it gives the unsigned content value.
Therefore, to compute the Fourier transform of uniform signals in j-polytopes represented by its
watertight (j − 1)-simplex mesh using the auxiliary node method, we modify Equation (3):
0
Nn  (−1)j j 
X X e−iσt
j
Fn (k) = ij
sn0 γnj 0 Qj + Qj (7)
n0 =1 l=1 σl t=1 σt l=1,l6=t (σt − σl )

5
Published as a conference paper at ICLR 2019

Raw NUFT

Binary Pixel Distance Function

1.0
0.995

0.8 0.994

Accuracy
0.993
0.6

0.992

0.4
0.991
4 8 12 16 20 24 28
Resolution
(a) Experiment setup (b) MNIST

Figure 3: MNIST experiment. (a) schematic for experiment setup. The original MNIST pixel image
is up-sampled using interpolation and contoured to get a polygonal representation of the digit. For
the polygon, it is transformed into binary pixels, distance functions, and the NUFT physical domain.
(b) classification accuracy versus input resolution under various representation schemes. NUFT
representation is more optimal, irrespective of resolution.

sn0 γnj 0 is the signed content distortion factor for the n0 th auxiliary j-simplex where sn0 ∈ {−1, 1}.
For practical purposes, assume that the auxiliary j-simplex is in Rd where d = j. We can compute
the signed content distortion factor using the determinant of the Jacobian matrix for parameterizing
the auxiliary simplex to a unit orthogonal simplex:
sn0 γnj 0 = j! det(J) = j! det([x1 , x2 , · · · , xj ]) (8)
Since this method requires the boundary simplices to be oriented, the right-hand rule can be used to
infer the correct orientation of the boundary element. For 2D polygons, it requires that the watertight
boundary line mesh be oriented in a counter-clockwise fashion. For 3D polytopes, it requires that the
face normals of boundary triangles resulting from the right-hand rule be consistently outward-facing.
Algorithmic implementation: Several efficiencies can be exploited to achieve fast runtime and high
robustness. First, since the general expression Equation (4) involves division and is vulnerable to
division-by-zero errors (that is not a singularity since it can be eliminated by taking the limit), add a
minor random noise  to vertex coordinates as well as to the k = 0 frequency mode for robustness.
Second, to avoid repeated computation, the value should be cached in memory and reused, but caching
all σ and e−iσ values for all nodes and frequencies is infeasible for large mesh and/or high resolution
output, thus the Breadth-First-Search (BFS) algorithm should be used to traverse the vertices for
efficient memory management.

4 E XPERIMENTS
In this section, we will discuss the experiment setup, and we defer the details of our model architecture
and training process to the supplementary material since it is not the focus of this paper.

4.1 MNIST WITH POLYGONS

We use the MNIST experiment as a first example to show that shape information in the input
significantly affects the efficacy of the downstream learning process. Since the scope of this research
is on efficiently learning from nonuniform mesh-based representations, we compare our method with
the state of the art in a slightly different scenario by treating MNIST characters as polygons. We

6
Published as a conference paper at ICLR 2019

Resolution 128 96 64

NUFT Surface
Rep Method F1 mAP NDCG
NDCG
mAP No Pre-training
F1 Volu- Ours(NUFT-V) 0.770 0.745 0.809
metric DeepVoxNet 0.253 0.192 0.277
NUFT Volume
With Pre-training
NDCG
mAP RotationNet 0.798 0.772 0.865
F1 Multi- ImprovGIF 0.767 0.722 0.827
View ReVGG 0.772 0.749 0.828
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
MVCNN 0.764 0.735 0.815
Accuracy
Table 2: Comparison of shape retrieval perfor-
Figure 4: Comparison between NUFT Volume mance with state-of-the-art models. The best
and NUFT Surface performance at different reso- result among each representation category is
lutions. highlighted in bold.

choose this experiment as a first toy example since it is easy to control the learning architecture and
input resolution to highlight the effects of shape representation on deep learning performance.
Experiment setup: We pre-process the original MNIST raw pixel images into polygons, which are
represented by watertight line meshes on their boundaries. The polygonized digits are converted into
(n × n) binary pixel images and into distance functions by uniformly sampling the polygon at the
(n × n) sample locations. For NUFT, we first compute the lowest (n × n) Fourier modes for the
polygonal shape function and then use an inverse Fourier transform to acquire the physical domain
image. We also compare the results with the raw pixel image downsampled to different resolutions,
which serves as an oracle for information ceiling. Then we perform the standard MNIST classification
experiment on these representations with varying resolution n and with the same network architecture.
Results: The experiment results are presented in Figure (3b). It is evident that binary pixel repre-
sentation suffers the most information loss, especially at low resolutions, which leads to rapidly
declining performance. Using the distance function representation preserves more information, but
underperforms our NUFT representation. Due to its efficient information compression in the Spectral
domain, NUFT even outperforms the downsampled raw pixel image at low resolutions.

4.2 3D S HAPE R ETRIEVAL

Shape retrieval is a classic task in 3D shape learning. SHREC17 (Savva et al., 2016) which is based on
the ShapeNet55 Core dataset serves as a compelling benchmark for 3D shape retrieval performance.
We compare the retrieval performance of our model utilizing the NUFT-surface (NUFT-S) and
NUFT-volume (NUFT-V) at various resolutions against state-of-the-art shape retrieval algorithms
to illustrate its potential in a range of 3D learning problems. We performed the experiments on
the normalized dataset in SHREC17. Our model utilizes 3D DLA (Yu et al., 2018) as a backbone
architecture.
Results: Results from the experiment are tabulated in Table 2. For the shape retrieval task, most
state-of-the-art methods are based on multi-view representation that utilize a 2D CNN pretrained on
additional 2D datasets such as ImageNet. We have achieved results on par with, though not better
than, state-of-the-art pretrained 2D models. We outperform other models in this benchmark that
have not been pre-trained on additional data. We also compared NUFT-volume and NUFT-surface
representations in Figure 4. Interestingly NUFT-volume and NUFT-surface representations lead to
similar performances under the same resolution.

7
Published as a conference paper at ICLR 2019

Figure 5: Zoom-in comparison.


Method Chamfer Accuracy Complete
DMC 0.218 0.182 0.254
PSR-5 0.352 0.405 0.298
PSR-8 0.198 0.196 0.200
Ours(w/ Noise) 0.144 0.150 0.137
Ours(w/o Noise) 0.145 0.125 0.165

Table 3: Quantitative comparison of surface


reconstruction methods. The metrics above are
distances, hence lower value represents better
performance. Best result is highlighted in bold.
We achieve better results than the current state-
Figure 6: Qualitative side-by-side comparison of
of-the-art method by a sizable margin, and our
surface reconstruction results.
results are robust to noise in the input.

4.3 3D S URFACE R ECONSTRUCTION FROM P OINT C LOUDS

We further illustrate the advantages of our representation with a unique yet important task in compu-
tational geometry that has been challenging to address with conventional deep learning techniques:
surface reconstruction from point cloud. The task is challenging for deep learning in two aspects:
First, it requires input and output signals of different topologies and structures (i.e., input being a
point cloud and output being a surface). Second, it requires precise localization of the signal in 3D
space. Using our NUFT-point representation as input and NUFT-surface representation as output,
we can frame the task as a 3D image-to-image translation problem, which is easy to address using
analogous 2D techniques in 3D. We use the U-Net (Ronneberger et al., 2015) architecture and train it
with a single L2 loss between output and ground truth NUFT-surface representation.
Experiment Setup: We train and test our model using shapes from three categories of ShapeNet55
Core dataset (car, sofa, bottle). We trained our model individually for these three categories. As a
pre-processing step we removed faces not visible from the exterior and simplified the mesh for faster
conversion to NUFT representation. For the input point cloud, we performed uniform point sampling
of 3000 points from each mesh and converted the points into the NUFT-point representation (1283 ).
Then, we converted the triangular mesh into NUFT-surface representation (1283 ). At test time, we
post-process the output NUFT-surface implicit function by using the marching cubes algorithm to
extract 0.5 contours. Since the extracted mesh has thickness, we further shrink the mesh by moving
vertices to positions with higher density values while preserving the rigidity of each face. Last but
not least, we qualitatively and quantitatively compare the performance by showing the reconstructed
mesh against results from the traditional Poisson Surface Reconstruction (PSR) method (Kazhdan
& Hoppe, 2013) at various tree depths (5 and 8) and the Deep Marching Cubes (DMC) algorithm
(Liao et al., 2018). For quantitative comparison, we follow the literature (Seitz et al., 2006) and use
Chamfer distance, Accuracy and Completeness as the metrics for comparison. For comparison with
Liao et al. (2018), we also test the model with noisy inputs (Gaussian of sigma 0.15 voxel-length
under 32 resolution), computed distance metrics after normalizing the models to the range of (0, 32).
Results: Refer to Table 3 for quantitative comparisons with competing algorithms on the same
task, and Figures 5 and 6 for visual comparisons. GT stands for Ground Truth. We achieve new
state-of-the-art in the point to surface reconstruction task, due to the good localization properties of
the NUFT representations and its flexibility across geometry topologies.

8
Published as a conference paper at ICLR 2019

5 C ONCLUSION

We present a general representation for multidimensional signals defined on simplicial complexes


that is versatile across geometrical deep learning tasks and maximizes the preservation of shape
information. We develop a set of mathematical formulations and algorithmic tools to perform
the transformations efficiently. Last but not least, we illustrate the effectiveness of the NUFT
representation with a well-controlled example (MNIST polygon), a classic 3D task (shape retrieval)
and a difficult and mostly unexplored task by deep learning (point to surface reconstruction), achieving
new state-of-the-art performance in the last task. In conclusion, we offer an alternative representation
for performing CNN based learning on geometrical signals that shows great potential in various 3D
tasks, especially tasks involving mixed-topology signals.

ACKNOWLEDGEMENTS

We would like to thank Yiyi Liao for helping with the DMC comparison, Jonathan Shewchuk for
valuable discussions, and Luna Huang for LATEXmagic. Chiyu “Max” Jiang is supported by the
Chang-Lin Tien Graduate Fellowship and the Graduate Division Block Grant Award of UC Berkeley.
This work is supported by a TUM-IAS Rudolf Mößbauer Fellowship and the ERC Starting Grant
Scan2CAD (804724).

R EFERENCES
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and discriminative
voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016.

Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally
connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.

Daniel R Canelhas. Truncated Signed Distance Fields Applied To Robotics. PhD thesis, Örebro
University, 2017.

Fu-Lai Chu and Chi-Fang Huang. On the calculation of the fourier transform of a polygonal shape
function. Journal of Physics A: Mathematical and General, 1989.

Taco Cohen, Mario Geiger, and Max Welling. Convolutional networks for spherical signals. arXiv
preprint arXiv:1709.04893, 2017.

Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene
segmentation. arXiv, 2018.

Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on
graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems,
pp. 3844–3852, 2016.

Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.

Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast fourier transform. SIAM
review, 46(3):443–454, 2004.

Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object
reconstruction. arXiv preprint arXiv:1704.00710, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.

Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data.
arXiv preprint arXiv:1506.05163, 2015.

9
Published as a conference paper at ICLR 2019

Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In
Advances in Neural Information Processing Systems, pp. 364–375, 2017.
Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In CVPR, 2018.
Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM Transactions
on Graphics (ToG), 32(3):29, 2013.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-
tional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
Shung-Wu Lee and Raj Mittra. Fourier transform of a polygonal shape function and its application in
electromagnetics. IEEE Transactions on Antennas and Propagation, 1983.
Huiyuan Li and Yuan Xu. Discrete fourier analysis on a dodecahedron and a tetrahedron. Mathematics
of Computation, 78(266):999–1029, 2009.
Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas Guibas. Grass:
Generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG), 36
(4):52, 2017.
Yiyi Liao, Simon Donne, and Andreas Geiger. Deep marching cubes: Learning explicit surface
representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
Computer Society, 2018.
Hongsen Liu, Yang Cong, Shuai Wang, Huijie Fan, Dongying Tian, and Yandong Tang. Deep learning
of directional truncated signed distance function for robust 3d object recognition. In Intelligent
Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 5934–5940. IEEE,
2017.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
Haggai Maron, Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G
Kim, and Yaron Lipman. Convolutional neural networks on surfaces via seamless toric covers.
ACM Trans. Graph, 36(4):71, 2017.
Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic con-
volutional neural networks on riemannian manifolds. In Proceedings of the IEEE international
conference on computer vision workshops, pp. 37–45, 2015.
Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time
object recognition. In IROS, 2015.
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M
Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc.
CVPR, volume 1, pp. 3, 2017.
Jacopo Pantaleoni. Voxelpipe: a programmable pipeline for 3d voxelization. In Proceedings of the
ACM SIGGRAPH Symposium on High Performance Graphics, pp. 99–106. ACM, 2011.
Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis.
6-dof object pose from semantic keypoints. In Robotics and Automation (ICRA), 2017 IEEE
International Conference on, pp. 2011–2018. IEEE, 2017.
Victor Adrian Prisacariu and Ian Reid. Nonlinear shape manifolds as shape priors in level set
segmentation and tracking. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on, pp. 2185–2192. IEEE, 2011.
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. In CVPR, 2017a.
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature
learning on point sets in a metric space. In NIPS, 2017b.

10
Published as a conference paper at ICLR 2019

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In Advances in neural information processing systems,
pp. 91–99, 2015.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In International Conference on Medical image computing and computer-
assisted intervention, pp. 234–241. Springer, 2015.
Ian Sammis and John Strain. A geometric nonuniform fast fourier transform. Journal of Computa-
tional Physics, 228(18):7086–7108, 2009.
Manolis Savva, Fisher Yu, Hao Su, M Aono, B Chen, D Cohen-Or, W Deng, Hang Su, Song Bai,
Xiang Bai, et al. Shrec’16 track large-scale 3d shape retrieval from shapenet core55. In Proceedings
of the eurographics workshop on 3D object retrieval, 2016.
Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison
and evaluation of multi-view stereo reconstruction algorithms. In Computer vision and pattern
recognition, 2006 IEEE Computer Society Conference on, volume 1, pp. 519–528. IEEE, 2006.
Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation
for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional
neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on
computer vision, pp. 945–953, 2015a.
Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for cnn: Viewpoint estimation in
images using cnns trained with rendered 3d model views. In Proceedings of the IEEE International
Conference on Computer Vision, pp. 2686–2694, 2015b.
Jiachang Sun. Multivariate fourier transform methods over simplex and super-simplex domains.
Journal of Computational Mathematics, pp. 305–322, 2006.
Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient
convolutional architectures for high-resolution 3d outputs. In CVPR, 2017.
Sebastian Thrun. Learning occupancy grid maps with forward sensor models. Autonomous robots,
15(2):111–127, 2003.
Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Multi-view consistency as supervisory signal
for learning shape and pose prediction. arXiv preprint arXiv:1801.03910, 2018.
Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:
Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018a.
Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based
convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36
(4):72, 2017.
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon.
Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018b.
Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilis-
tic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural
Information Processing Systems, pp. 82–90, 2016.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong
Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1912–1920, 2015.
Li Yi, Hao Su, Xingwen Guo, and Leonidas Guibas. Syncspeccnn: Synchronized spectral cnn for 3d
shape segmentation. In Computer Vision and Pattern Recognition (CVPR), 2017.

11
Published as a conference paper at ICLR 2019

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR,
2018.
Cha Zhang and Tsuhan Chen. Efficient feature extraction for 2d/3d objects in mesh representation.
In ICIP, 2001.
Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 3d-prnn: Generating shape
primitives with recurrent neural networks. In The IEEE International Conference on Computer
Vision (ICCV), 2017.

12
Published as a conference paper at ICLR 2019

A PPENDIX

A M ATHEMATICAL D ERIVATION

Without loss of generality assume that the j-simplex is defined in Rd space where d ≥ j, since it is
not possible to define a j-simplex in a space with dimensions lower than j. For most cases below
(except j = 0) we will parameterize the original simplex domain to a unit orthogonal simplex in Rj
(as shown in Figure 7). Denote the original coordinate system in Rd as x and the new coordinate
system in the parametric Rj space as p. Choose the following parameterization scheme:

x(p) = [x1 − xd , x2 − xd , · · · , xd−1 − xd ]p + xd (9)

By performing the Fourier transform integral in the parametric space and restoring the results by the
content distortion factor γnj we can get equivalent results as the Fourier transform on the original
simplex domain. Content is the generalization of volumes in arbitrary dimensions (i.e. unity for
points, length for lines, area for triangles, volume for tetrahedron). Content distortion factor γnj is the
ratio between the content of the original simplex and the content of the unit orthogonal simplex in
parametric space. The content is signed if switching any pair of nodes in the simplex changes the sign
of the content, and it is unsigned otherwise. See subsection 3.2 for means of computing the content
and the content distortion factor.

A.1 P OINT: 0- SIMPLEX

Points have spatial position x but no size (length, area, volume), hence it can be mathematically
modelled as a delta function. The delta function (or Dirac delta function) has point mass as a function
equal to zero everywhere except for zero and its integral over entire real space is one, i.e.:

ZZZ ∞ ZZZ 0+
δ(x)dx = δ(x)dx = 1 (10)
−∞ 0−

For unit point mass at location xn :

fn0 (xn ) = δ(x − xn ) (11)


ZZZ ∞
Fn0 (ω) = fn0 (x)e−iω·x dx (12)
−∞
ZZZ ∞
= δ(x − xn )e−iω·x dx (13)
−∞
−iω·xn
=e (14)

Indeed, for 0-simplex, we have recovered the definition of the Discrete Fourier Transform (DFT).

A.2 L INE : 1- SIMPLEX

For a line with vertices at location x1 , x2 ∈ Rd , by parameterizing it onto a unit line, we get:

Z 1
Fn1 (ω) = γn1 dp e−iω·x(p) (15)
0
 e−iω·x1 e−iω·x2 
= iγn1 + (16)
ω(x1 − x2 ) ω(x2 − x1 )

13
Published as a conference paper at ICLR 2019

z q

x3 1
x1

x2
y 1 p
x
(a) Original j-simplex in Rd space, d ≥ j. (b) Unit orthogonal j-simplex in Rj

Figure 7: Schematic of example for 2-simplex. Original j-simplex is parameterized to a unit


orthogonal j-simplex in Rj space for performing the integration. Parameterization incurs a content
distortion factor γnj which is the ratio between the original simplex and the unit-orthogonal simplex
in parametric space.

A.3 T RIANGLE : 2- SIMPLEX

For a triangle with vertices x1 , x2 , x3 ∈ Rd , parameterization onto a unit orthogonal triangle gives:
Z 1 Z 1−p
Fn2 (ω) = γn2 dp dq e−iω·x(p) (17)
0 0
Z 1 Z 1−p
= γn2 e−iω·x3 dp e−ipω·(x1 −x3 ) dq e−iqω·(x2 −x3 ) (18)
0 0
 e−iω·x1 e−iω·x2 e−iω·x3 
= −γn2 + +
ω 2 (x1 − x2 )(x1 − x3 ) ω 2 (x2 − x1 )(x2 − x3 ) ω 2 (x3 − x1 )(x3 − x2 )
(19)

A.4 T ETRAHEDRON : 3- SIMPLEX

For a tetrahedron with vertices x1 , x2 , x3 , x4 ∈ Rd , parameterization onto a unit orthogonal tetrahe-


dron gives:
Z 1 Z 1−p Z 1−p−q
3
Fn (ω) = γn 3
dp dq dr e−iω·x(p) (20)
0 0 0
Z 1 Z 1−p Z 1−p−q
= γn3 e−iω·x4 dp e −ipω·(x1 −x4 )
dq e−iqω·(x2 −x4 ) dre−irω·(x3 −x4 )
0 0 0
(21)
 e −iω·x1
e −iω·x2
−iγn3 + +
ω 3 (x1 − x2 )(x1 − x3 )(x1 − x3 ) ω 3 (x2 − x1 )(x2 − x3 )(x2 − x4 )
= 
e−iω·x3 e−iω·x4
+
ω 3 (x3 − x1 )(x3 − x2 )(x3 − x4 ) ω 3 (x4 − x1 )(x4 − x2 )(x4 − x3 )
(22)

14
Published as a conference paper at ICLR 2019

B C OMPARISON OF GEOMETRICAL SHAPE INFORMATION


Besides evaluating and comparing the shape representation schemes in the context of machine
learning problems, we evaluate the different representation schemes in terms of its geometrical shape
information. To evaluate geometrical shape information, we first convert the original polytopal shapes
into the corresponding representations and then reconstruct the shape from these representations
using interpolation based upsampling followed by contouring methods. Finally, we use the mesh
Intersection over Union (mesh-IoU) metric between the original and constructed mesh to quantify
geometrical shape information. Mesh boolean and volume computation for 2D polygons and 3D
triangular mesh can be efficiently performed with standard computational geometry methods. In
three dimensions contouring and mesh extraction can be performed using marching cubes algorithm
for mesh reconstruction. For binarized representation, we perform bilinear upsampling (which does
not affect the final result) followed by 0.5-contouring. For our NUFT representation, we perform
spectral domain upsampling (which corresponds to zero-padding of higher modes in spectral domain),
followed by 0.5-contouring. Qualitative side-by-side comparisons are presented for visual inspection,
and qualitative empirical evaluation is performed for the Bunny Mesh (1K faces) model. Refer to
Figures 8, 9, and 10.

C M ODEL A RCHITECTURE AND T RAINING D ETAILS


C.1 MNIST WITH P OLYGONS

Model Architecture: We use the state-of-the-art Deep Layer Aggregation (DLA) backbone archi-
tecture with [1,1,2,1] levels and [32, 64, 256, 512] filter numbers. We keep the architecture constant
while varying the input resolution.

Training Details: We train the model with batch size of 64, learning rate of 0.01 and learning rate
step size of 10. We use the SGD optimizer with momentum of 0.9, weight decay of 1 × 10−4 for 15
epochs.

C.2 3D S HAPE R ETRIEVAL

Model Architecture: We use DLA34 architecture with all 2D convolutions modified to be 3D


convolutions. It consists of [1,1,1,2,2,1] levels, with [16, 32, 64, 128, 256, 512] filter numers.

Training Details We train the model with batch size of 64, learning rate of 1 × 10−3 , learning rate
step size of 30. We use the Adam optimizer with momentum of 0.9, and weight decay of 1 × 10−4
for 40 epochs.

C.3 3D S URFACE R ECONSTRUCTION FROM P OINT C LOUD

Model Architecture We use a modified 3D version of the U-Net architecture consisting of 4 down
convolutions and 4 up-convolutions with skip layers. Number of filters for down convolutions are
[32, 64, 128, 256] and double of that for up convolutions.

Training Details We train the model using Adam optimizer with learning rate 3 × 10−4 for 200
epochs. We use NUFT point representation as input and a single L2 loss between output and ground
truth (NUFT surface) to train the network. We train and evaluate the model at 1283 resolution.

15
Published as a conference paper at ICLR 2019

(a) GT polygon (b) 32×32 Binary (c) 32×32 NUFT

Figure 8: Visualizing different representations. (a) Shows the original ground truth polygon, (b, c)
show reconstructed polygons from binary and NUFT representations.

(a) GT triangular mesh (b) 643 Binary (c) 643 NUFT

Figure 9: Comparison between 3D shapes. (a) Original mesh, (b) Reconstructed mesh from Binary
Voxel (64 × 64 × 64), (c) Reconstructed mesh from NUFT (64 × 64 × 64)

Binary Voxel NUFT−Volume Signed Distance Function

20% Accuracy: 4.2x

10%

5%
Relative Error

2%
Efficiency: 58x
1%

0.5%

0.2%

20 40 60 80 100 120

Resolution in each dimension (n)

Figure 10: Comparison of Representations in Mesh Recovery Accuracy (Example mesh: Stanford
Bunny 1K Mesh). Notes: (i) Relative error is defined by the proportion of volume of differenced
mesh to volume of the original mesh. (ii) Error estimates for NUFT-Volume over 50 on abscissa are
inaccurate due to inadequate quadrature resolution.

16

Vous aimerez peut-être aussi