Graph Neural Network Review and Applications

1
Graph Neural Networks:

A Review of Methods and Applications
Jie Zhou∗ , Ganqu Cui∗ , Zhengyan Zhang∗ , Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li,
Maosong Sun
Abstract—Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling
physics system, learning molecular fingerprints, predicting protein interface, and classifying diseases require that a model learns from
graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures, like
the dependency tree of sentences and the scene graph of images, is an important research topic which also needs graph reasoning
arXiv:1812.08434v3 [cs.LG] 7 Mar 2019
models. Graph neural networks (GNNs) are connectionist models that capture the dependence of graphs via message passing
between the nodes of graphs. Unlike standard neural networks, graph neural networks retain a state that can represent information
from its neighborhood with arbitrary depth. Although the primitive GNNs have been found difficult to train for a fixed point, recent
advances in network architectures, optimization techniques, and parallel computation have enabled successful learning with them. In
recent years, systems based on graph convolutional network (GCN) and gated graph neural network (GGNN) have demonstrated
ground-breaking performance on many tasks mentioned above. In this survey, we provide a detailed review over existing graph neural
network models, systematically categorize the applications, and propose four open problems for future research.
Index Terms—Deep Learning, Graph Neural Network
1 I NTRODUCTION them to construct highly expressive representations, which

led to breakthroughs in almost all machine learning areas
Graphs are a kind of data structure which models a set and started the new era of deep learning [9]. However,
of objects (nodes) and their relationships (edges). Recently, CNNs can only operate on regular Euclidean data like
researches of analyzing graphs with machine learning have images (2D grid) and text (1D sequence) while these data
been receiving more and more attention because of the structures can be regarded as instances of graphs. As we
great expressive power of graphs, i.e. graphs can be used are going deeper into CNNs and graphs, we found the keys
as denotation of a large number of systems across various of CNNs: local connection, shared weights and the use of
areas including social science (social networks) [1], [2], nat- multi-layer [9]. These are also of great importance in solving
ural science (physical systems [3], [4] and protein-protein problems of graph domain, because 1) graphs are the most
interaction networks [5]), knowledge graphs [6] and many typical locally connected structure. 2) shared weights reduce
other research areas [7]. As a unique non-Euclidean data the computational cost compared with traditional spectral
structure for machine learning, graph analysis focuses on graph theory [10]. 3) multi-layer structure is the key to deal
node classification, link prediction, and clustering. Graph with hierarchical patterns, which captures the features of
neural networks (GNNs) are deep learning based methods various sizes. Therefore, it is straightforward to think of
that operate on graph domain. Due to its convincing per- finding the generalization of CNNs to graphs. However, as
formance and high interpretability, GNN has been a widely shown in 1, it is hard to define localized convolutional filters
applied graph analysis method recently. In the following and pooling operators, which hinders the transformation of
paragraphs, we will illustrate the fundamental motivations CNN from Euclidean domain to non-Euclidean domain.
of graph neural networks.
The first motivation of GNNs roots in convolutional
neural networks (CNNs) [8]. CNNs have the ability to
extract multi-scale localized spatial features and compose
* indicates equal contribution.
• Jie Zhou, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu (corresponding
author) and Maosong Sun are with the Department of Computer Science
and Technology, Tsinghua University, Beijing 100084, China.
E-mail: {zhoujie18, zhangzhengyan14, cheng-ya14}@mails.tsing-
hua.edu.cn, {sms, liuzy}@tsinghua.edu.cn
• Ganqu Cui is with the Department of Physics, Tsinghua University,
Beijing 100084, China. Fig. 1. Left: image in Euclidean space. Right: graph in non-Euclidean
Email: cgq15@mails.tsinghua.edu.cn space
• Lifeng Wang, Changcheng Li are with the Tencent Incorporation, Shen-
zhen, China. The other motivation comes from graph embedding, which
Email:{fandywang, harrychli}@tencent.com
learns to represent graph nodes, edges or subgraphs in low-
2
dimensional vectors. In the field of graph analysis, tradi- framework, MoNet, to generalize CNN architectures to
tional machine learning approaches usually rely on hand non-Euclidean domains (graphs and manifolds) and the
engineered features and are limited by its inflexibility and framework could generalize several spectral methods on
high cost. Following the idea of representation learning and graphs [2], [21] as well as some models on manifolds [22],
the success of word embedding [11], DeepWalk [12], which [23]. [24] provides a thorough review of geometric deep
is regarded as the first graph embedding method based learning, which presents its problems, difficulties, solutions,
on representation learning, applies SkipGram model [11] applications and future directions. [20] and [24] focus on
on the generated random walks. Similar approaches such generalizing convolutions to graphs or manifolds, how-
as node2vec [13], LINE [14] and TADW [15] also achieved ever in this paper we only focus on problems defined on
breakthroughs. However, these methods suffer two severe graphs and we also investigate other mechanisms used in
drawbacks [16]. First, no parameters are shared between graph neural networks such as gate mechanism, attention
nodes in the encoder, which leads to computationally in- mechanism and skip connection. [25] proposed the message
efficiency, since it means the number of parameters grows passing neural network (MPNN) which could generalize
linearly with the number of nodes. Second, the direct em- several graph neural network and graph convolutional net-
bedding methods lack the ability of generalization, which work approaches. It presents the definition of the message
means they cannot deal with dynamic graphs or generalize passing neural network and demonstrates its application on
to new graphs. quantum chemistry. [26] proposed the non-local neural net-
Based on CNNs and graph embedding, graph neural work (NLNN) which unifies several “self-attention”-style
networks (GNNs) are proposed to collectively aggregate in- methods. However, the model is not explicitly defined on
formation from graph structure. Thus they can model input graphs in the original paper. Focusing on specific applica-
and/or output consisting of elements and their dependency. tion domains, [25] and [26] only give examples of how to
Further, graph neural network can simultaneously model generalize other models using their framework and they
the diffusion process on the graph with the RNN kernel. do not provide a review over other graph neural network
models. [27] proposed the graph network (GN) framework.
In the following part, we explain the fundamental rea- The framework has a strong capability to generalize other
sons why graph neural networks are worth investigating. models and its relational inductive biases promote combi-
Firstly, the standard neural networks like CNNs and RNNs natorial generalization, which is thought to be a top priority
cannot handle the graph input properly in that they stack for AI. However, [27] is part position paper, part review
the feature of nodes by a specific order. However, there and part unification and it only gives a rough classification
isn’t a natural order of nodes in the graph. To present a of the applications. In this paper, we provide a thorough
graph completely, we should traverse all the possible orders review of different graph neural network models as well as
as the input of the model like CNNs and RNNs, which is a systematic taxonomy of the applications.
very redundant when computing. To solve this problem,
GNNs propagate on each node respectively, ignoring the To summarize, this paper presents an extensive survey
input order of nodes. In other words, the output of GNNs of graph neural networks with the following contributions.
is invariant for the input order of nodes. Secondly, an • We provide a detailed review over existing graph
edge in a graph represents the information of dependency neural network models. We introduce the original
between two nodes. In the standard neural networks, the model, its variants and several general frameworks.
dependency information is just regarded as the feature of We examine various models in this area and provide
nodes. However, GNNs can do propagation guided by a unified representation to present different propaga-
the graph structure instead of using it as part of features. tion steps in different models. One can easily make a
Generally, GNNs update the hidden state of nodes by a distinction between different models using our repre-
weighted sum of the states of their neighborhood. Thirdly, sentation by recognizing corresponding aggregators
reasoning is a very important research topic for high-level and updaters.
artificial intelligence and the reasoning process in human • We systematically categorize the applications and
brain is almost based on the graph which is extracted from divide the applications into structural scenarios, non-
daily experience. The standard neural networks have shown structural scenarios and other scenarios. We present
the ability to generate synthetic images and documents by several major applications and their corresponding
learning the distribution of data while they still cannot learn methods in different scenarios.
the reasoning graph from large experimental data. However, • We propose four open problems for future research.
GNNs explore to generate the graph from non-structural Graph neural networks suffer from over-smoothing
data like scene pictures and story documents, which can be and scaling problems. There are still no effective
a powerful neural model for further high-level AI. Recently, methods for dealing with dynamic graphs as well
it has been proved that an untrained GNN with a simple as modeling non-structural sensory data. We provide
architecture also perform well [17]. a thorough analysis of each problem and propose
There exist several comprehensive reviews on graph future research directions.
neural networks. [18] gives a formal definition of early
graph neural network approaches. And [19] demonstrates The rest of this survey is organized as follows. In Sec. 2,
the approximation properties and computational capabil- we introduce various models in the graph neural network
ities of graph neural networks. [20] proposed a unified family. We first introduce the original framework and its
limitations. And then we present its variants that try to
3
release the limitations. And finally, we introduce several TABLE 1

general frameworks proposed recently. In Sec. 3, we will in- Notations used in this paper.
troduce several major applications of graph neural networks Notations Descriptions
applied to structural scenarios, non-structural scenarios and Rm m-dimensional Euclidean space
other scenarios. In Sec. 4, we propose four open problems a, a, A Scalar, vector, matrix
of graph neural networks as well as several future research AT Matrix transpose
directions. And finally, we conclude the survey in Sec. 5. IN Identity matrix of dimension N
gθ ? x Convolution of gθ and x
N Number of nodes in the graph
2 M ODELS Nv Number of nodes in the graph
Graph neural networks are useful tools on non-Euclidean Ne Number of edges in the graph
structures and there are various methods proposed in the Nv Neighborhood set of node v
literature trying to improve the model’s capability. atv Vector a of node v at time step t
hv Hidden state of node v
In Sec 2.1, we describe the original graph neural net-
htv Hidden state of node v at time step t
works proposed in [18]. We also list the limitations of
evw Features of edge from node v to w
the original GNN in representation capability and train-
ek Features of edge with label k
ing efficiency. In Sec 2.2 we introduce several variants of
graph neural networks aiming to release the limitations.
otv Output of node v
These variants operate on graphs with different types, uti- W i , Ui ,
Matrices for computing i, o, ...
lize different propagation functions and advanced training Wo , Uo , ...
methods. In Sec 2.3 we present three general frameworks bi , bo , ... Vectors for computing i, o, ...
which could generalize and extend several lines of work. σ The logistic sigmoid function
In detail, the message passing neural network (MPNN) [25] ρ An alternative non-linear function
unifies various graph neural network and graph convolu- tanh The hyperbolic tangent function
tional network approaches; the non-local neural network LeakyReLU The LeakyReLU function
(NLNN) [26] unifies several “self-attention”-style methods. Element-wise multiplication operation
And the graph network(GN) [27] could generalize almost k Vector concatenation
every graph neural network variants mentioned in this
paper.
Before going further into different sections, we give all the node features, respectively. Then we have a compact
the notations that will be used throughout the paper. The form as:
detailed descriptions of the notations could be found in H = F (H, X) (3)
Table 1. O = G(H, XN ) (4)
where F , the global transition function, and G, the global
2.1 Graph Neural Networks output function are stacked versions of f and g for all nodes
in a graph, respectively. The value of H is the fixed point of
The concept of graph neural network (GNN) was first
Eq. 3 and is uniquely defined with the assumption that F is
proposed in [18], which extended existing neural networks
a contraction map.
for processing the data represented in graph domains. In a
graph, each node is naturally defined by its features and With the suggestion of Banach’s fixed point theorem [28],
the related nodes. The target of GNN is to learn a state GNN uses the following classic iterative scheme for comput-
embedding hv ∈ Rs which contains the information of ing the state:
neighborhood for each node. The state embedding hv is an Ht+1 = F (Ht , X) (5)
s-dimension vector of node v and can be used to produce
where Ht denotes the t-th iteration of H. The dynamical
an output ov such as the node label. Let f be a parametric
system Eq. 5 converges exponentially fast to the solution of
function, called local transition function, that is shared among
Eq. 3 for any initial value H(0). Note that the computations
all nodes and updates the node state according to the input
described in f and g can be interpreted as the feedforward
neighborhood. And let g be the local output function that
neural networks.
describes how the output is produced. Then, hv and ov are
defined as follows: When we have the framework of GNN, the next question
is how to learn the parameters of f and g . With the target
hv = f (xv , xco[v] , hne[v] , xne[v] ) (1) information (tv for a specific node) for the supervision, the
ov = g(hv , xv ) (2) loss can be written as follow:
X p
where xv , xco[v] , hne[v] , xne[v] are the features of v , the fea- loss = (ti − oi ) (6)
tures of its edges, the states, and the features of the nodes in i=1
the neighborhood of v , respectively. where p is the number of supervised nodes. The learning
Let H, O, X, and XN be the vectors constructed by algorithm is based on a gradient-descent strategy and is
stacking all the states, all the outputs, all the features, and composed of the following steps.
4
• The states htv are iteratively updated by Eq. 1 until entity, the head entity is the parent class of the tail entity,
a time T . They approach the fixed point solution of which suggests we should treat the information propagation
Eq. 3: H(T ) ≈ H. process from parent classes and child classes differently.
• The gradient of weights W is computed from the ADGPM [29] uses two kinds of weight matrix, Wp and
loss. Wc , to incorporate more precise structural information. The
• The weights W are updated according to the gradi- propagation rule is shown as follows:
ent computed in the last step.
Ht = σ(D−1 −1
p Ap σ(Dc Ac H
t−1
Wc )Wp ) (7)
Limitations Though experimental results showed that where D−1 −1
p Ap , Dc Ac are the normalized adjacency matrix
GNN is a powerful architecture for modeling structural for parents and children respectively.
data, there are still several limitations of the original GNN.
Firstly, it is inefficient to update the hidden states of nodes Heterogeneous Graphs The second variant of graph
iteratively for the fixed point. If relaxing the assumption is heterogeneous graph, where there are several kinds of
of the fixed point, we can design a multi-layer GNN to nodes. The simplest way to process heterogeneous graph
get a stable representation of node and its neighborhood. is to convert the type of each node to a one-hot feature
Secondly, GNN uses the same parameters in the iteration vector which is concatenated with the original feature.
while most popular neural networks use different parame- What’s more, GraphInception [30] introduces the concept of
ters in different layers, which serve as a hierarchical feature metapath into the propagation on the heterogeneous graph.
extraction method. Moreover, the update of node hidden With metapath, we can group the neighbors according to
states is a sequential process which can benefit from the their node types and distances. For each neighbor group,
RNN kernel like GRU and LSTM. Thirdly, there are also GraphInception treats it as a sub-graph in a homogeneous
some informative features on the edges which cannot be graph to do propagation and concatenates the propagation
effectively modeled in the original GNN. For example, the results from different homogeneous graphs to do a collective
edges in the knowledge graph have the type of relations and node representation.
the message propagation through different edges should Graphs with Edge Information In the final variant of
be different according to their types. Besides, how to learn graph, each edge also has its information like the weight
the hidden states of edges is also an important problem. or the type of the edge. There are two ways to handle
Lastly, it is unsuitable to use the fixed points if we focus on this kind of graphs: Firstly, we can convert the graph to a
the representation of nodes instead of graphs because the bipartite graph where the original edges also become nodes
distribution of representation in the fixed point will be much and one original edge is split into two new edges which
smooth in value and less informative for distinguishing each means there are two new edges between the edge node
node. and begin/end nodes. The encoder of G2S [31] uses the
following aggregation function for neighbors:
1 X
2.2 Variants of Graph Neural Networks htv = ρ( Wr (rtv ht−1
u ) + br ) (8)
|Nv | u∈N
v
In this subsection, we present several variants of graph
neural networks. Sec 2.2.1 focuses on variants operating where Wr and br are the propagation parameters for
on different graph types. These variants extend the rep- different types of edges (relations). Secondly, we can adapt
resentation capability of the original model. Sec 2.2.2 lists different weight matrices for the propagation on different
several modifications (convolution, gate mechanism, atten- kinds of edges. When the number of relations is very large,
tion mechanism and skip connection) on the propagation r-GCN [32] introduces two kinds of regularization to reduce
step and these models could learn representations with the number of parameters for modeling amounts of rela-
higher quality. Sec 2.2.3 describes variants using advanced tions: basis- and block-diagonal-decomposition. With the basis
training methods, which improve the training efficiency. An decomposition, each Wr is defined as follows:
overview of different variants of graph neural networks B
X
could be found in Fig. 2. Wr = arb Vb (9)
1
2.2.1 Graph Types i.e. as a linear combination of basis transformations Vb ∈

Rdin ×dout with coefficients arb such that only the coefficients
In the original GNN [18], the input graph consists of nodes depend on r. In the block-diagonal decomposition, r-GCN
with label information and undirected edges, which is the defines each Wr through the direct sum over a set of low-
simplest graph format. However, there are many variants dimensional matrices, which needs more parameters than
of graph in the world. In this subsection, we will introduce the first one.
some methods designed to model different kinds of graphs.
Directed Graphs The first variant of graph is directed 2.2.2 Propagation Types
graph. Undirected edge which can be treated as two directed
edges shows that there is a relation between two nodes. The propagation step and output step are of vital impor-
However, directed edges can bring more information than tance in the model to obtain the hidden states of nodes
undirected edges. For example, in a knowledge graph where (or edges). As we list below, there are several major mod-
the edge starts from the head entity and ends at the tail ifications in the propagation step from the original graph
5
(a) Graph Types (b) Training Methods
(c) Propagation Steps
Fig. 2. An overview of variants of graph neural networks.
1 1
neural network model while researchers usually follow a graph Laplacian L = IN − D− 2 AD− 2 = UΛUT (D is the
simple feed-forward neural network setting in the output degree matrix and A is the adjacency matrix of the graph),
step. The comparison of different variants of GNN could be with a diagonal matrix of its eigenvalues Λ.
found in Table 2. The variants utilize different aggregators to This operation results in potentially intense computa-
gather information from each node’s neighbors and specific tions and non-spatially localized filters. [36] attempts to
updaters to update nodes’ hidden states. make the spectral filters spatially localized by introducing
Convolution. There is an increasing interest in general- a parameterization with smooth coefficients. [37] suggests
izing convolutions to the graph domain. Advances in this that gθ (Λ) can be approximated by a truncated expansion
direction are often categorized as spectral approaches and in terms of Chebyshev polynomials Tk (x) up to K th order.
non-spectral approaches. Thus the operation is:
Spectral approaches work with a spectral representation K
X
of the graphs. [35] proposed the spectral network. The gθ ? x ≈ θk Tk (L̃)x (11)
convolution operation is defined in the Fourier domain by k=0
computing the eigendecomposition of the graph Laplacian.
2
The operation can be defined as the multiplication of a sig- with L̃ = λmax L − IN . λmax denotes the largest eigen-
nal x ∈ RN (a scalar for each node) with a filter gθ =diag(θ) value of L. θ ∈ RK is now a vector of Chebyshev coeffi-
parameterized by θ ∈ RN : cients. The Chebyshev polynomials are defined as Tk (x) =
2xTk−1 (x) − Tk−2 (x), with T0 (x) = 1 and T1 (x) = x.
gθ ? x = Ugθ (Λ)UT x (10) It can be observed that the operation is K -localized since it
where U is the matrix of eigenvectors of the normalized is a K th -order polynomial in the Laplacian. [38] uses this
K -localized convolution to define a convolutional neural
6
TABLE 2
Different variants of graph neural networks.
Name Variant Aggregator Updater
H= K
P
ChebNet Nk = Tk (L̃)X k=0 Nk Θk
Spectral 1st -order N0 = X H = N0 Θ0 + N1 Θ1

1 1
Methods model N1 = D− 2 AD− 2 X
1 1
Single N = (IN + D− 2 AD− 2 )X H = NΘ
parameter
1 1
GCN N = D̃− 2 ÃD̃− 2 X H = NΘ
Nv
Convolutional + N htv = σ(htNv WL
P v t−1
htNv = ht−1
v k=1 hk
)
networks in
[33]
Node classification:
N = P∗ X
Non-spectral DCNN H = f (Wc N)
Methods Graph classification:
N = 1TN P∗ X/N
htNv = AGGREGATEt {hut−1 , ∀u ∈ Nv } htv = σ Wt · [ht−1 t

GraphSAGE v khNv ]
exp(LeakyReLU(aT [Whv kWhk ]))

αvk = P T
j∈Nv exp(LeakyReLU(a [Whv kWhj ]))
htNv = σ
P
k∈Nv αvk Whk
Multi-head concatenation:
Graph GAT M htv = htNv
htNv = km=1 σ m m
P
Attention k∈Nv αvk W hk
Networks
Multi-head
average:
1 PM
htNv = σ M m m
P
m=1 k∈Nv αvk W hk
ztv = σ(Wz htNv + Uz hvt−1 )

rtv = σ(Wr htNv + Ur hvt−1 )
htNv = ht−1
P
Gated Graph GGNN k∈Nv k +b ft = tanh(Wht + U(rt ht−1 ))
h v Nv v v
Neural Net-
works htv = (1 − ztv ) ht−1
v + ztv h
ft
v
itv = σ(Wi xtv + Ui htNv + bi )

t
fvk = σ Wf xtv + Uf ht−1
k + bf
ov = σ(W xv + U hNv + bo )
t o t o t
htNv = ht−1
P
Tree LSTM k∈Nv k utv = tanh(Wu xtv + Uu htNv + bu )
(Child sum)
ctv = itv utv + k∈Nv fvkt
ct−1
P
k
t t t
hv = ov tanh(cv )
PK itv = σ(Wi xtv + htiNv + b )

i
i t−1
hti
Nv = P l=1 Ul hvl
tf
fvk = σ(W xv + hNv k + bf )
t f t
htf K f t−1
Nv k =P l=1 Ukl hvl otv = σ(Wo xtv + hto Nv + b )
o
Graph LSTM Tree LSTM to K o t−1
hNv = l=1 Ul hvl uv = tanh(W xv + hNv + bu )
t u t tu
(N-ary)
ctv = itv utv + K t t−1
PK u t−1
P
htu
Nv = l=1 Ul hvl l=1 fvl cvl
t t t
hv = ov tanh(cv )
itv = σ(Wi xtv + htiNv + b )
i
f
hti = k∈Nv Uim(v,k) ht−1
P fvk = σ(W xv + Um(v,k) ht−1
t f t
k + bf )
Nv k
Graph LSTM hto = k∈Nv Uom(v,k) hkt−1
P otv = σ(Wo xtv + hto Nv + b )
o
Nv
uv = tanh(W xv + hNv + bu )
t u t tu
in [34] htu = k∈Nv Uum(v,k) hkt−1
P
Nv
ctv = itv utv + k∈Nv fvk t
ct−1
P
k
t t t
hv = ov tanh(cv )
7
network which could remove the need to compute the where X is an N × F tensor of input features (N is the
eigenvectors of the Laplacian. number of nodes and F is the number of features). P∗ is an
[2] limits the layer-wise convolution operation to K = 1 N ×K×N tensor which contains the power series {P, P2 , ...,
to alleviate the problem of overfitting on local neighborhood PK } of matrix P. And P is the degree-normalized transition
structures for graphs with very wide node degree distribu- matrix from the graphs adjacency matrix A. Each entity
tions. It further approximates λmax ≈ 2 and the equation is transformed to a diffusion convolutional representation
simplifies to: which is a K × F matrix defined by K hops of graph
diffusion over F features. And then it will be defined by
1 1
gθ0 ? x ≈ θ00 x + θ10 (L − IN ) x = θ00 x − θ10 D− 2 AD− 2 x (12) a K × F weight matrix and a non-linear activation function
f . Finally H (which is N × K × F ) denotes the diffusion
with two free parameters θ00 and θ10 . After constraining the representations of each node in the graph.
number of parameters with θ = θ00 = −θ10 , we can obtain
the following expression: As for graph classification, DCNN simply takes the
average of nodes’ representation,
1 1
gθ ? x ≈ θ IN + D− 2 AD− 2 x (13)
H = f Wc 1TN P∗ X/N (17)
Note that stacking this operator could lead to numer- and 1N here is an N × 1 vector of ones. DCNN can also be
ical instabilities and exploding/vanishing gradients, [2] applied to edge classification tasks, which requires convert-
− 12 − 12
introduces the renormalization trick: IN + D P AD → ing edges to nodes and augmenting the adjacency matrix.
− 21 − 21
D̃ ÃD̃ , with Ã = A + IN and D̃ii = j Ãij . Finally,
[2] generalizes the definition to a signal X ∈ RN ×C with C [40] extracts and normalizes a neighborhood of exactly
input channels and F filters for feature maps as follows: k nodes for each node. And then the normalized neigh-
borhood serves as the receptive field for the convolutional
1 1
Z = D̃− 2 ÃD̃− 2 XΘ (14) operation.
where Θ ∈ RC×F is a matrix of filter parameters and Z ∈ [20] proposed a spatial-domain model (MoNet) on non-
RN ×F is the convolved signal matrix. Euclidean domains which could generalize several pre-
vious techniques. The Geodesic CNN (GCNN) [22] and
[39] presents a Gaussian process-based Bayesian ap- Anisotropic CNN (ACNN) [23] on manifolds or GCN [2]
proach to solve the semi-supervised learning problem on and DCNN [21] on graphs could be formulated as particular
graphs. It shows parallels between the model and the spec- instances of MoNet.
tral filtering approaches, which could give us some insights
from another perspective. [1] proposed the GraphSAGE, a general inductive
framework. The framework generates embeddings by sam-
However, in all of the spectral approaches mentioned pling and aggregating features from a node’s local neigh-
above, the learned filters depend on the Laplacian eigenba- borhood.
sis, which depends on the graph structure, that is, a model
htNv = AGGREGATEt {ht−1

trained on a specific structure could not be directly applied u , ∀u ∈ Nv }
(18)
htv = σ Wt · [ht−1 t

to a graph with a different structure. v khNv ]
Non-spectral approaches define convolutions directly However, [1] does not utilize the full set of neighbors in
on the graph, operating on spatially close neighbors. The Eq.18 but a fixed-size set of neighbors by uniformly sam-
major challenge of non-spectral approaches is defining the pling. And [1] suggests three aggregator functions.
convolution operation with differently sized neighborhoods
and maintaining the local invariance of CNNs. • Mean aggregator. It could be viewed as an approxi-
mation of the convolutional operation from the trans-
[33] uses different weight matrices for nodes with dif- ductive GCN framework [2], so that the inductive
ferent degrees, version of the GCN variant could be derived by
Nv
X htv = σ(W · MEAN({ht−1 t−1
v } ∪ {hu , ∀u ∈ Nv }) (19)
x = hv + hi
i=1 (15) The mean aggregator is different from other aggre-

Nv
h0v =σ xWL gators because it does not perform the concatenation
operation which concatenates ht−1v and htNv in Eq.18.
Nv It could be viewed as a form of “skip connection” [41]
where WL is the weight matrix for nodes with degree Nv
at Layer L. And the main drawback of the method is that and could achieve better performance.
it cannot be applied to large-scale graphs with more node • LSTM aggregator. [1] also uses an LSTM-based ag-
degrees. gregator which has a larger expressive capability.
However, LSTMs process inputs in a sequential man-
[21] proposed the diffusion-convolutional neural net-
ner so that they are not permutation invariant. [1]
works (DCNNs). Transition matrices are used to define the
adapts LSTMs to operate on an unordered set by
neighborhood for nodes in DCNN. For node classification,
permutating node’s neighbors.
it has
• Pooling aggregator. In the pooling aggregator, each
H = f (Wc P∗ X) (16)
neighbor’s hidden state is fed through a fully-
connected layer and then a max-pooling operation
8
X
is applied to the set of the node’s neighbors. ctv = itv utv + t
fvk ct−1
k
k∈Nv
htNv = max({σ Wpool ht−1

+ b , ∀u ∈ Nv }) (20)
u
htv = otv tanh(ctv )
Note that any symmetric functions could be used in
xtv is the input vector at time t in the standard LSTM setting.
place of the max-pooling operation here.
If the branching factor of a tree is at most K and all
Recently, the structure-aware convolution and Structure- children of a node are ordered, i.e., they can be indexed
Aware Convolutional Neural Networks (SACNNs) have from 1 to K , then the N -ary Tree-LSTM can be used. For
been proposed [42]. Univariate functions are used to per- node v , htvk and ctvk denote the hidden state and memory
form as filters and they can deal with both Euclidean and cell of its k -th child at time t respectively. The transition
non-Euclidean structured data. equations are the following:
Gate. There are several works attempting to use the gate K
X
t−1
mechanism like GRU [43] or LSTM [44] in the propagation itv = σ Wi xtv + Uil hvl + bi
step to diminish the restrictions in the former GNN mod- l=1
els and improve the long-term propagation of information K
Ufkl ht−1
X
t
across the graph structure. fvk = σ Wf xtv + vl + b
f
l=1
[45] proposed the gated graph neural network (GGNN) K

which uses the Gate Recurrent Units (GRU) in the prop-
X
otv = σ Wo xtv + Uol ht−1
vl + b
o
(23)
agation step, unrolls the recurrence for a fixed number of l=1
steps T and uses backpropagation through time in order to K
X
compute gradients. utv = tanh Wu xtv + t−1
Uul hvl + bu
Specifically, the basic recurrence of the propagation l=1
K
model is X
ctv = itv utv + t
fvl ct−1
vl
atv =ATv [ht−1
1 . . . ht−1 T
N ] +b l=1
ztv =σ Wz atv + Uz ht−1 htv = otv tanh(ctv )

v
rtv =σ Wr atv + Ur ht−1

v (21)
The introduction of separate parameter matrices for each
= tanh Watv + U rtv ht−1
ft
hv v child k allows the model to learn more fine-grained rep-
htv = 1 − ztv ht−1

+ ztv h
ft resentations conditioning on the states of a unit’s children
v v
than the Child-Sum Tree-LSTM.
The node v first aggregates message from its neighbors, The two types of Tree-LSTMs can be easily adapted to
where Av is the sub-matrix of the graph adjacency matrix A the graph. The graph-structured LSTM in [47] is an example
and denotes the connection of node v with its neighbors. The of the N -ary Tree-LSTM applied to the graph. However, it is
GRU-like update functions incorporate information from a simplified version since each node in the graph has at most
the other nodes and from the previous timestep to update 2 incoming edges (from its parent and sibling predecessor).
each node’s hidden state. a gathers the neighborhood infor- [34] proposed another variant of the Graph LSTM based on
mation of node v , z and r are the update and reset gates. the relation extraction task. The main difference between
LSTMs are also used in a similar way as GRU through graphs and trees is that edges of graphs have their labels.
the propagation process based on a tree or a graph. And [34] utilizes different weight matrices to represent
different labels.
[46] proposed two extensions to the basic LSTM ar- X
chitecture: the Child-Sum Tree-LSTM and the N-ary Tree- itv = σ Wi xtv + Uim(v,k) ht−1
k + bi
LSTM. Like in standard LSTM units, each Tree-LSTM unit
k∈Nv

(indexed by v ) contains input and output gates iv and ov , t
fvk =σ W + Ufm(v,k) hkt−1 + bf
f
xtv
a memory cell cv and hidden state hv . Instead of a single X
forget gate, the Tree-LSTM unit contains one forget gate fvk otv = σ Wo xtv + Uom(v,k) ht−1
k + bo (24)
for each child k , allowing the unit to selectively incorporate k∈Nv
information from each child. The Child-Sum Tree-LSTM X
utv = tanh Wu xtv + Uum(v,k) ht−1
k + bu
transition equations are the following:
k∈Nv
X
ct−1
X
]
h t−1
v = ht−1
k ctv = itv utv + t
fvk k
k∈Nv k∈Nv
htv = otv tanh(ctv )

t−1
itv= σ Wi xtv + Ui h
]
v + bi
where m(v, k) denotes the edge label between node v and
t
fvk = σ Wf xtv + Uf ht−1
k + bf k.

t−1
otv = σ Wo xtv + Uo h]
v + bo (22) [48] proposed the Sentence LSTM (S-LSTM) for im-
proving text encoding. It converts text into a graph and
t−1
utv = tanh Wu xtv + Uu h ]
v + bu
9
utilizes the Graph LSTM to learn the representation. The able thus the operation is efficient; (2) it can be applied to
S-LSTM shows strong representation power in many NLP graph nodes with different degrees by specifying arbitrary
problems. [49] proposed a Graph LSTM network to address weights to neighbors; (3) it can be applied to the inductive
the semantic object parsing task. It uses the confidence- learning problems easily.
driven scheme to adaptively select the starting node and Skip connection. Many applications unroll or stack the
determine the node updating sequence. It follows the same graph neural network layer aiming to achieve better results
idea of generalizing the existing LSTMs into the graph- as more layers (i.e k layers) make each node aggregate more
structured data but has a specific updating sequence while information from neighbors k hops away. However, it has
methods we mentioned above are agnostic to the order of been observed in many experiments that deeper models
nodes. could not improve the performance and deeper models
Attention. The attention mechanism has been success- could even perform worse [2]. This is mainly because more
fully used in many sequence-based tasks such as machine layers could also propagate the noisy information from
translation [50]–[52], machine reading [53] and so on. [54] an exponentially increasing number of expanded neighbor-
proposed a graph attention network (GAT) which incorpo- hood members.
rates the attention mechanism into the propagation step. It A straightforward method to address the problem, the
computes the hidden states of each node by attending over residual network [55], could be found from the computer
its neighbors, following a self-attention strategy. vision community. But, even with residual connections,
[54] proposed a single graph attentional layer and con- GCNs with more layers do not perform as well as the 2-
structs arbitrary graph attention networks by stacking this layer GCN on many datasets [2].
layer. The layer computes the coefficients in the attention [56] proposed a Highway GCN which uses layer-wise
mechanism of the node pair (i, j) by: gates similar to highway networks [57]. The output of a
exp LeakyReLU aT [Whi kWhj ] layer is summed with its input with gating weights:

αij = P T
(25)
k∈Ni exp (LeakyReLU (a [Whi kWhk ])) T(ht ) = σ Wt ht + bt

(29)
where αij is the attention coefficient of node j to i, Ni repre- ht+1 = ht+1 T(ht ) + ht (1 − T(ht ))
sents the neighborhoods of node i in the graph. The input set
By adding the highway gates, the performance peaks at
of node features to the layer is h = {h1 , h2 , . . . , hN }, hi ∈
4 layers in a specific problem discussed in [56].
RF , where N is the number of nodes and F is the num-
ber of features of each node, the layer produces a new [58] studies properties and resulting limitations of
set of node features(of potentially different cardinality F 0 ), neighborhood aggregation schemes. It proposed the Jump
0 0
h0 = {h01 , h02 , . . . , h0N }, h0i ∈ RF , as its output. W ∈ RF ×F Knowledge Network which could learn adaptive, structure-
is the weight matrix of a shared linear transformation which aware representations. The Jump Knowledge Network se-
0
applied to every node, a ∈ R2F is the weight vector of lects from all of the intermediate representations (which
a single-layer feedforward neural network. It is normalized ”jump” to the last layer) for each node at the last layer,
by a softmax function and the LeakyReLU nonlinearity(with which makes the model adapt the effective neighborhood
negative input slop α = 0.2) is applied. size for each node as needed. [58] uses three approaches
of concatenation, max-pooling and LSTM-attention in the
Then the final output features of each node can be experiments to aggregate information. The Jump Knowl-
obtained by (after applying a nonlinearity σ ): edge Network performs well on the experiments in social,
bioinformatics and citation networks. It could also be com-
X
0
hi = σ αij Whj (26) bined with models like Graph Convolutional Networks,
j∈Ni
GraphSAGE and Graph Attention Networks to improve
their performance.
Moreover, the layer utilizes the multi-head attention sim-
ilarly to [52] to stabilize the learning process. It applies K
independent attention mechanisms to compute the hidden 2.2.3 Training Methods
states and then concatenates their features(or computes the
average), resulting in the following two output representa- The original graph convolutional neural network has sev-
tions: eral drawbacks in training and optimization methods.
K
Specifically, GCN requires the full graph Laplacian, which
h0i = k σ
X
k
αij Wk hj (27) is computational-consuming for large graphs. Furthermore,
k=1 j∈Ni The embedding of a node at layer L is computed recursively
1 K
X X by the embeddings of all its neighbors at layer L − 1.
h0i = σ k
αij Wk hj (28) Therefore, the receptive field of a single node grows expo-
K k=1 j∈Ni nentially with respect to the number of layers, so computing
k
gradient for a single node costs a lot. Finally, GCN is trained
where αij is normalized attention coefficient computed by independently for a fixed graph, which lacks the ability for
the k -th attention mechanism. inductive learning.
The attention architecture in [54] has several properties: GraphSAGE [1] is a comprehensive improvement of
(1) the computation of the node-neighbor pairs is paralleliz- original GCN. To solve the problems mentioned above,
10
GraphSAGE replaced full graph Laplacian with learnable [38] and non-spectral approaches [33] in graph convolution,
aggregation functions, which are key to perform message gated graph neural networks [45], interaction networks [4],
passing and generalize to unseen nodes. As shown in Eq.18, molecular graph convolutions [72], deep tensor neural net-
they first aggregate neighborhood embeddings, concatenate works [73] and so on.
with target node’s embedding, then propagate to the next The model contains two phases, a message passing
layer. With learned aggregation and propagation functions, phase and a readout phase. The message passing phase
GraphSAGE could generate embeddings for unseen nodes. (namely, the propagation step) runs for T time steps and is
Also, GraphSAGE uses neighbor sampling to alleviate re- defined in terms of message function Mt and vertex update
ceptive field expansion. function Ut . Using messages mtv , the updating functions of
FastGCN [59] further improves the sampling algorithm. hidden states htv are as follows:
Instead of sampling neighbors for each node, FastGCN X
mt+1 Mt htv , htw , evw

directly samples the receptive field for each layer. FastGCN v =
uses importance sampling, which the importance factor is w∈Nv (31)
t+1 t t+1

calculated as below: hv = Ut hv , mv
1 X 1 where evw represents features of the edge from node v to w.
q(v) ∝ (30)
|Nv | u∈N |Nu | The readout phase computes a feature vector for the whole
v
graph using the readout function R according to
In contrast to fixed sampling methods above, [60] introduces
a parameterized and trainable sampler to perform layer- ŷ = R({hTv |v ∈ G}) (32)
wise sampling conditioned on the former layer. Further-
more, this adaptive sampler could find optimal sampling where T denotes the total time steps. The message function
importance and reduce variance simultaneously. Mt , vertex update function Ut and readout function R could
have different settings. Hence the MPNN framework could
[61] proposed a control-variate based stochastic ap- generalize several different models via different function
proximation algorithms for GCN by utilizing the historical settings. Here we give an example of generalizing GGNN,
activations of nodes as a control variate. This method limits and other models’ function settings could be found in [25].
the receptive field in the 1-hop neighborhood, but use the The function settings for GGNNs are:
historical hidden state as an affordable approximation.
Mt htv , htw , evw = Aevw htw

[62] focused on the limitations of GCN, which include
Ut = GRU htv , mt+1

that GCN requires many additional labeled data for val- v (33)
X
idation and also suffers from the localized nature of the R= T 0 T
σ i(hv , hv ) j(hv )
convolutional filter. To solve the limitations, the authors pro- v∈V
posed Co-Training GCN and Self-Training GCN to enlarge
where Aevw is the adjacency matrix, one for each edge label
the training dataset. The former method finds the nearest
e. The GRU is the Gated Recurrent Unit introduced in [43].
neighbors of training data while the latter one follows a
i and j are neural networks in function R.
boosting-like way.
2.3.2 Non-local Neural Networks

2.3 General Frameworks
[26] proposed the Non-local Neural Networks (NLNN)
Apart from different variants of graph neural networks, sev- for capturing long-range dependencies with deep neural
eral general frameworks are proposed aiming to integrate networks. The non-local operation is a generalization of
different models into one single framework. [25] proposed the classical non-local mean operation [74] in computer
the message passing neural network (MPNN), which uni- vision. A non-local operation computes the response at a
fied various graph neural network and graph convolutional position as a weighted sum of the features at all positions.
network approaches. [26] proposed the non-local neural The set of positions can be in space, time or spacetime.
network (NLNN). It unifies several “self-attention”-style Thus the NLNN can be viewed as a unification of different
methods [52], [54], [63]. [27] proposed the graph network “self-attention”-style methods [52], [54], [63]. We will first
(GN) which unified the MPNN and NLNN methods as well introduce the general definition of non-local operations and
as many other variants like Interaction Networks [4], [64], then some specific instantiations.
Neural Physics Engine [65], CommNet [66], structure2vec
[7], [67], GGNN [45], Relation Network [68], [69], Deep Sets Following the non-local mean operation [74], the generic
[70] and Point Net [71]. non-local operation is defined as:
1 X
h0i = f (hi , hj )g(hj ) (34)
2.3.1 Message Passing Neural Networks C(h) ∀j
[25] proposed a general framework for supervised learn- where i is the index of an output position and j is the index
ing on graphs called Message Passing Neural Networks that enumerates all possible positions. f (hi , hj ) computes
(MPNNs). The MPNN framework abstracts the common- a scalar between i and j representing the relation between
alities between several of the most popular models for them. g(hj ) denotes a transformation of the input hj and a
1
graph-structured data, such as spectral approaches [35] [2], factor C(h) is utilized to normalize the results.
11
There are several instantiations with different f and g set of nodes (of cardinality N v ), where each hi is a node’s
settings. For simplicity, [26] uses the linear transformation attribute. E = {(ek , rk , sk )}k=1:N e is the set of edges (of
as the function g . That means g(hj ) = Wg hj , where Wg is a cardinality N e ), where each ek is the edge’s attribute, rk is
learned weight matrix. Next we list the choices for function the index of the receiver node and sk is the index of the
f in the following. sender node.
Gaussian. The Gaussian function is a natural choice GN block. A GN block contains three “update” func-
according to the non-local mean [74] and bilateral filters [75]. tions, φ, and three “aggregation” functions, ρ,
Thus: T e0k = φe (ek , hrk , hsk , u) ē0i = ρe→h (Ei0 )
f (hi , hj ) = ehi hj (35)
h0i = φh (ē0i , hi , u) ē0 = ρe→u (E 0 ) (40)
Here hTi hj is dot-product similarity and C(h) =
u0 = φu ē0 , h̄0 , u 0 h→u 0

P h̄ = ρ (H )
∀j f (hi , hj ).
Embedded Gaussian. It is straightforward to extend the where Ei0 =S {(e0k , rk , sk )}rk =i, k=1:N e , H 0 =
{h0i }i=1:N v ,
0 0 0
Gaussian function by computing similarity in the embed- and E = i Ei = {(ek , rk , sk )}k=1:N e . The
ρ functions
ding space, which means: must be invariant to permutations of their inputs and
T
should take variable numbers of arguments.
f (hi , hj ) = eθ(hi ) φ(hj )
(36)
Computation steps. The computation steps of a GN
where θ(hi ) = Wθ hi , φ(hj ) = Wφ hj and C(h) = block are as follows:
P
∀j f (hi , hj ).
1) φe is applied per edge, with arguments
It could be found that the self-attention proposed
(ek , hrk , hsk , u), and returns e0k . The set of
in [52] is a special case of the Embedded Gaussian
1 resulting per-edge outputs for each node
version. For a given i, C(h) f (hi , hj ) becomes the soft- 0
i is, E Si 0
= {(e0k , rk , sk )}rk =i, k=1:N e . And
max computation along the dimension j . So that h0 = 0 0
E = i Ei = {(ek , rk , sk )}k=1:N e is the set
softmax(hT WθT Wφ h)g(h), which matches the form of self-
of all per-edge outputs.
attention in [52].
2) ρe→h is applied to Ei0 , and aggregates the edge
Dot product. The function f can also be implemented as updates for edges that project to vertex i, into ē0i ,
a dot-product similarity: which will be used in the next step’s node update.
3) φh is applied to each node i, to compute an updated
f (hi , hj ) = θ(hi )T φ(hj ). (37)
node attribute, h0i . The set of resulting per-node
Here the factor C(h) = N , where N is the number of outputs is, H 0 = {h0i }i=1:N v .
positions in h. 4) ρe→u is applied to E 0 , and aggregates all edge
updates, into ē0 , which will then be used in the next
Concatenation. Here we have:
step’s global update.
f (hi , hj ) = ReLU(wfT [θ(hi )kφ(hj )]). (38) 5) ρh→u is applied to H 0 , and aggregates all node
updates, into h̄0 , which will then be used in the next
where wf is a weight vector projecting the vector to a scalar step’s global update.
and C(h) = N . 6) φu is applied once per graph and computes an
[26] wraps the non-local operation mentioned above update for the global attribute, u0 .
into a non-local block as:
Note here the order is not strictly enforced. For example,
zi = Wz h0i + hi , (39) it is possible to proceed from global, to per-node, to per-
edge updates. And the φ and ρ functions need not be neural
where h0i is given in Eq.34 and “+hi ” denotes the residual networks though in this paper we only focus on neural
connection [55]. Hence the non-local block could be insert network implementations.
into any pre-trained model, which makes the block more
applicable. Design Principles. The design of Graph Network based
on three basic principles: flexible representations, config-
urable within-block structure and composable multi-block
2.3.3 Graph Networks architectures.
[27] proposed the Graph Network (GN) framework which • Flexible representations. The GN framework sup-
generalizes and extends various graph neural network, ports flexible representations of the attributes as well
MPNN and NLNN approaches [18], [25], [26]. We first as different graph structures. The global, node and
introduce the graph definition in [27] and then we describe edge attributes can use arbitrary representational
the GN block, a core GN computation unit, and its compu- formats but real-valued vectors and tensors are most
tational steps, and finally we will introduce its basic design common. One can simply tailor the output of a
principles. GN block according to specific demands of tasks.
Graph definition. In [27], a graph is defined as a 3-tuple For example, [27] lists several edge-focused [76], [77],
G = (u, H, E) (here we use H instead of V for notation’s node-focused [3], [4], [65], [78] and graph-focused [4],
consistency). u is a global attribute, H = {hi }i=1:N v is the [25], [69] GNs. In terms of graph structures, the
12
framework can be applied to both structural scenar- 3.1.1 Physics

ios where the graph structure is explicit and non-
structural scenarios where the relational structure Modeling real-world physical systems is one of the most
should be inferred or assumed. basic aspects of understanding human intelligence. By rep-
• Configurable within-block structure. The functions resenting objects as nodes and relations as edges, we can
and their inputs within a GN block can have different perform GNN-based reasoning about objects, relations, and
settings so that the GN framework provides flex- physics in a simplified but effective way.
ibility in within-block structure configuration. For [4] proposed Interaction Networks to make predictions
example, [77] and [3] use the full GN blocks. Their and inferences about various physical systems. The model
φ implementations use neural networks and their ρ takes objects and relations as input, reasons about their
functions use the elementwise summation. Based on interactions, and applies the effects and physical dynamics
different structure and functions settings, a variety of to predict new states. They separately model relation-centric
models (such as MPNN, NLNN and other variants) and object-centric models, making it easier to generalize
could be expressed by the GN framework. And more across different systems.
details could be found in [27]. Visual Interaction Networks [64] could make predictions
• Composable multi-block architectures. GN blocks from pixels. It learns a state code from two consecutive input
could be composed to construct complex architec- frames for each object. Then, after adding their interaction
tures. Arbitrary numbers of GN blocks could be com- effect by an Interaction Net block, the state decoder converts
posed in sequence with shared or unshared param- state codes to next step’s state.
eters. [27] utilizes GN blocks to construct an encode-
process-decode architecture and a recurrent GN-based [3] proposed a Graph Network based model which
architecture. These architectures are demonstrated could either perform state prediction or inductive inference.
in Fig. 3. Other techniques for building GN based The inference model takes partially observed information
architectures could also be useful, such as skip con- as input and constructs a hidden graph for implicit system
nections, LSTM- or GRU-style gating schemes and so classification.
on.
3.1.2 Chemistry and Biology
Molecular Fingerprints Calculating molecular fingerprints,
which means feature vectors that represent moleculars, is
a core step in computer-aided drug design. Conventional
3 A PPLICATIONS molecular fingerprints are hand-made and fixed. By ap-
plying GNN to molecular graphs, we can obtain better
Graph neural networks have been explored in a wide range fingerprints.
of problem domains across supervised, semi-supervised,
[33] proposed neural graph fingerprints which calculate sub-
unsupervised and reinforcement learning settings. In this
structure feature vectors via GCN and sum to get overall
section, we simply divide the applications in three scenarios:
representation. The aggregation function is
(1) Structural scenarios where the data has explicit relational X
structure, such as physical systems, molecular structure and htNv = CONCAT(htu , euv ) (41)
knowledge graphs; (2) Non-structural scenarios where the u∈N (v)
relational structure is not explicit include image, text, etc; (3)
Other application scenarios such as generative models and Where euv is the edge feature of edge (u, v). Then update
combinatorial optimization problems. Note that we only list node representation by
several representative applications instead of providing an ht+1 = σ(Wt
deg(v) t
hNv ) (42)
v
exhaustive list. The summary of the applications could be
found in Table 3. Where deg(v) is the degree of node v and WtN
is a learned
matrix for each time step t and node degree N .
[72] further explicitly models atom and atom pairs
3.1 Structural Scenarios independently to emphasize atom interactions. It introduces
edge representation etuv instead of aggregation function, i.e.
t t
P
In the following subsections, we will introduce GNN’s hNv = u∈N (v) euv . The node update function is
applications in structural scenarios, where the data are natu-
ht+1
v = ReLU(W1 (ReLU(W0 htu ), htNv )) (43)
rally performed in the graph structure. For example, GNNs
are widely being used in social network prediction [1], [2], while the edge update function is
traffic prediction [56], [118], recommender systems [119],
[120] and graph representation [121]. Specifically, we are et+1 t t t
uv = ReLU(W4 (ReLU(W2 euv ), ReLU(W3 (hv , hu ))))
discussing how to model real-world physical systems with (44)
object-relationship graphs, how to predict chemical prop- Protein Interface Prediction [5] focused on the task
erties of molecules and biological interaction properties of named protein interface prediction, which is a challenging
proteins and the methods of reasoning about the out-of- problem with important applications in drug discovery
knowledge-base(OOKB) entities in knowledge graphs. and design. The proposed GCN based method respectively
13
TABLE 3
Applications of graph neural networks.
Area Application Algorithm Deep Learning Model References
[1], [21], [38]
GCN Graph Convolutional Network
[2], [20], [36]
GAT Graph Attention Network [54]
Text classification
DGCNN Graph Convolutional Network [79]
Text GCN Graph Convolutional Network [80]
Sentence LSTM Graph LSTM [48]
Sequence Labeling (POS, NER) Sentence LSTM Graph LSTM [48]
Sentiment classification Tree LSTM Graph LSTM [46]
Semantic role labeling Syntactic GCN Graph Convolutional Network [81]
Syntactic GCN Graph Convolutional Network [82], [83]
Neural machine translation
Text GGNN Gated Graph Neural Network [31]
Tree LSTM Graph LSTM [84]
Relation extraction Graph LSTM Graph LSTM [34], [85]
GCN Graph Convolutional Network [86]
Event extraction Syntactic GCN Graph Convolutional Network [87], [88]
Sentence LSTM Graph LSTM [89]
AMR to text generation
GGNN Gated Graph Neural Network [31]
Multi-hop reading comprehension Sentence LSTM Graph LSTM [90]
RN MLP [69]
Relational reasoning Recurrent RN Recurrent Neural Network [91]
IN Graph Neural Network [4]
Social Relationship Understanding GRM Gated Graph Neural Network [92]
GCN Graph Convolutional Network [93], [94]
Image classification
ADGPM Graph Convolutional Network [29]
GSNN Gated Graph Neural Network [96]
Visual Question Answering GGNN Gated Graph Neural Network [92], [97], [98]
Object Detection RN Graph Attention Network [99], [100]
Image
GPNN Graph Neural Network [101]
Interaction Detection
Structural-RNN Graph Neural Network [102]
Region Classification GCNN Graph CNN [103]
Graph LSTM Graph LSTM [49], [104]
Semantic Segmentation
DGCNN Graph CNN [106]
3DGNN Graph Neural Network [107]
IN Graph Neural Network [4]
Physics Systems VIN Graph Neural Network [64]
GN Graph Networks [3]
NGF Graph Convolutional Network [33]
Science Molecular Fingerprints
Protein Interface Prediction GCN Graph Convolutional Network [5]
Side Effects Prediction Decagon Graph Convolutional Network [108]
Disease Classification PPIN Graph Convolutional Network [109]
Knowledge KB Completion GNN Graph Neural Network [6]
Graph KG Alignment GCN Graph Convolutional Network [110]
structure2vec Graph Convolutional Network [7]
GNN Graph Neural Network [111]
Combinatorial Optimization
AM Graph Attention Network [113]
NetGAN Long short-term memory [114]
GraphRNN Rucurrent Neural Network [111]
Graph Generation Regularizing VAE Variational Autoencoder [115]
GCPN Graph Convolutional Network [116]
MolGAN Relational-GCN [117]
14
(a) Sequential GN blocks (b) Encode-process-decode (c) Recurrent GN blocks
Fig. 3. Examples of architectures composed by GN blocks. (a) The sequential processing architecture; (b) The encode-process-decode architecture;
(c) The recurrent architecture.
(a) physics (b) molecule
(c) image (d) text
(e) social network (f) generation
Fig. 4. Application Scenarios
learns ligand and receptor protein residue representation 3.1.3 Knowledge graph
and merges them for pairwise classification.
[6] utilizes GNNs to solve the out-of-knowledge-base
GNN can also be used in biomedical engineering. With (OOKB) entity problem in knowledge base completion
Protein-Protein Interaction Network, [109] leverages graph (KBC). The OOKB entities in [6] are directly connected to
convolution and relation network for breast cancer subtype the existing entities thus the embeddings of OOKB entities
classification. [108] also suggests a GCN based model for can be aggregated from the existing entities. The method
polypharmacy side effects prediction. Their work models achieves satisfying performance both in the standard KBC
the drug and protein interaction network and separately setting and the OOKB setting.
deals with edges in different types.
[110] utilizes GCNs to solve the cross-lingual knowledge
graph alignment problem. The model embeds entities from
different languages into a unified embedding space and
15
aligns them based on the embedding similarity. semantic information. So it is natural to generate graphs for
reasoning tasks.
3.2 Non-structural Scenarios A typical visual reasoning task is visual question answer-
ing(VQA), [97] respectively constructs image scene graph
In this section we will talk about applications on non- and question syntactic graph. Then it applies GGNN to train
structural scenarios such as image, text, programming the embeddings for predicting the final answer. Despite
source code [45], [122] and multi-agent systems [63], [66], spatial connections among objects, [124] builds the rela-
[76]. We will only give detailed introduction to the first two tional graphs conditioned on the questions. With knowledge
scenarios due to the length limit. Roughly, there are two graphs, [92], [98] could perform finer relation exploration
ways to apply the graph neural networks on non-structural and more interpretable reasoning process.
scenarios: (1) Incorporate structural information from other
Other applications of visual reasoning include object
domains to improve the performance, for example using in-
detection, interaction detection, and region classification.
formation from knowledge graphs to alleviate the zero-shot
In object detection [99], [100], GNNs are used to calculate
problems in image tasks; (2) Infer or assume the relational
RoI features; In interaction detection [101], [102], GNNs are
structure in the scenario and then apply the model to solve
message passing tools between human and objects; In region
the problems defined on graphs, such as the method in [48]
classification [103], GNNs perform reasoning on graphs
which models text into graphs.
which connects regions and classes.
3.2.1 Image Semantic Segmentation Semantic segmentation is a cru-

cial step toward image understanding. The task here is to
Image Classification Image classification is a very basic assign a unique label (or category) to every single pixel in
and important task in the field of computer vision, which the image, which can be considered as a dense classification
attracts much attention and has many famous datasets problem. However, regions in images are often not grid-like
like ImageNet [123]. Recent progress in image classifica- and need non-local information, which leads to the failure
tion benefits from big data and the strong power of GPU of traditional CNN. Several works utilized graph-structured
computation, which allows us to train a classifier without data to handle it.
extracting structural information from images. However, [49] proposed Graph-LSTM to model long-term de-
zero-shot and few-shot learning become more and more pendency together with spatial connections by building
popular in the field of image classification, because most graphs in form of distance-based superpixel map and ap-
models can achieve similar performance with enough data. plying LSTM to propagate neighborhood information glob-
There are several works leveraging graph neural networks ally. Subsequent work improved it from the perspective of
to incorporate structural information in image classification. encoding hierarchical information [104].
First, knowledge graphs can be used as extra information
to guide zero-short recognition classification [29], [94]. [94] Furthermore, 3D semantic segmentation (RGBD seman-
builds a knowledge graph where each node corresponds to tic segmentation) and point clouds classification utilize
an object category and takes the word embeddings of nodes more geometric information and therefore are hard to model
as input for predicting the classifier of different categories. by a 2D CNN. [107] constructs a K nearest neighbors (KNN)
As over-smoothing effect happens with the deep depth graph and uses a 3D GNN as propagation model. After
of convolution architecture, the 6-layer GCN used in [94] unrolling for several steps, the prediction model takes the
would wash out much useful information in the representa- hidden state of each node as input and predict its semantic
tion. To solve the smoothing problem in the propagation of label.
GCN, [29] managed to use single layer GCN with a larger As there are always too many points, [105] solved large-
neighborhood which includes both one-hop and multi-hops scale 3D point clouds segmentation by building superpoint
nodes in the graph. And it proved effective in building a graphs and generating embeddings for them. To classify
zero-shot classifier with existing ones. supernodes, [105] leverages GGNN and graph convolution.
Except for the knowledge graph, the similarity between [106] proposed to model point interactions through
images in the dataset is also helpful for the few-shot learning edges. They calculate edge representation vectors by feeding
[93]. [93] proposed to build a weighted full-connected image the coordinates of its terminal nodes. Then node embed-
network based on the similarity and do message passing dings are updated by edge aggregation.
in the graph for few-shot recognition. As most knowledge
graphs are large for reasoning, [96] selects some related 3.2.2 Text
entities to build a sub-graph based on the result of ob-
ject detection and applies GGNN to the extracted graph The graph neural networks could be applied to several tasks
for prediction. Besides, [95] proposed to construct a new based on texts. It could be applied to both sentence-level
knowledge graph where the entities are all the categories. tasks(e.g. text classification) as well as word-level tasks(e.g.
And, they defined three types of label relations: super- sequence labeling). We will introduce several major applica-
subordinate, positive correlation, and negative correlation tions on text in the following.
and propagate the confidence of labels in the graph directly. Text classification Text classification is an important and
Visual Reasoning Computer-vision systems usually classical problem in natural language processing. The classi-
need to perform reasoning by incorporating both spatial and cal GCN models [1], [2], [20], [21], [36], [38] and GAT model
16
[54] are applied to solve the problem, but they only use that is tailored for relation extraction and applied a pruning
the structural information between the documents and they strategy to the input trees.
don’t use much text information. [79] proposed a graph- Cross-sentence N-ary relation extraction detects relations
CNN based deep learning model to first convert texts to among n entities across multiple sentences. [34] explores a
graph-of-words, and then use graph convolution operations general framework for cross-sentence n-ary relation extrac-
in [40] to convolve the word graph. [48] proposed the tion based on graph LSTMs. It splits the input graph into
Sentence LSTM to encode text. It views the whole sentence two DAGs while important information could be lost in
as a single state, which consists of sub-states for individual the splitting procedure. [85] proposed a graph-state LSTM
words and an overall sentence-level state. It uses the global model. It keeps the original graph structure and speeds up
sentence-level representation for classification tasks. These computation by allowing more parallelization.
methods either view a document or a sentence as a graph of
word nodes or rely on the document citation relation to con- Event extraction Event extraction is an important in-
struct the graph. [80] regards the documents and words as formation extraction task to recognize instances of speci-
nodes to construct the corpus graph (hence heterogeneous fied types of events in texts. [87] investigates a convolu-
graph) and uses the Text GCN to learning embeddings of tional neural network (which is the Syntactic GCN exactly)
words and documents. Sentiment classification could also based on dependency trees to perform event detection.
be regarded as a text classification problem and a Tree-LSTM [88] proposed a Jointly Multiple Events Extraction (JMEE)
approach is proposed by [46]. framework to jointly extract multiple event triggers and
arguments by introducing syntactic shortcut arcs to enhance
Sequence labeling As each node in GNNs has its hidden information flow to attention-based graph convolution net-
state, we can utilize the hidden state to address the sequence works to model graph information.
labeling problem if we consider every word in the sentence
as a node. [48] utilizes the Sentence LSTM to label the Other applications GNNs could also be applied to
sequence. It has conducted experiments on POS-tagging and many other applications. There are several works focus on
NER tasks and achieves state-of-the-art performance. the AMR to text generation task. A Sentence-LSTM based
method [89] and a GGNN based method [31] have been
Semantic role labeling is another task of sequence label- proposed in this area. [46] uses the Tree LSTM to model
ing. [81] proposed a Syntactic GCN to solve the problem. the semantic relatedness of two sentences. And [90] exploits
The Syntactic GCN which operates on the direct graph with the Sentence LSTM to solve the multi-hop reading compre-
labeled edges is a special variant of the GCN [2]. It integrates hension problem. Another important direction is relational
edge-wise gates which let the model regulate contributions reasoning, relational networks [69], interaction networks [4]
of individual dependency edges. The Syntactic GCNs over and recurrent relational networks [91] are proposed to solve
syntactic dependency trees are used as sentence encoders to the relational reasoning task based on text. The works cited
learn latent feature representations of words in the sentence. above are not an exhaustive list, and we encourage our
[81] also reveals that GCNs and LSTMs are functionally readers to find more works and application domains of
complementary in the task. graph neural networks that they are interested in.
Neural machine translation The neural machine trans-
lation task is usually considered as a sequence-to-sequence
task. [52] introduces the attention mechanisms and replaces 3.3 Other Scenarios
the most commonly used recurrent or convolutional layers.
Besides structural and non-structural scenarios, there are
In fact, the Transformer assumes a fully connected graph
some other scenarios where graph neural networks play
structure between linguistic entities.
an important role. In this subsection, we will introduce
One popular application of GNN is to incorporate the generative graph models and combinatorial optimization
syntactic or semantic information into the NMT task. [82] with GNNs.
utilizes the Syntactic GCN on syntax-aware NMT tasks.
[83] incorporates information about the predicate-argument 3.3.1 Generative Models
structure of source sentences (namely, semantic-role repre-
sentations) using Syntactic GCN and compares the results Generative models for real-world graphs has drawn signifi-
of incorporating only syntactic or semantic information or cant attention for its important applications including mod-
both of the information into the task. [31] utilizes the GGNN eling social interactions, discovering new chemical struc-
in syntax-aware NMT. It converts the syntactic dependency tures, and constructing knowledge graphs. As deep learning
graph into a new structure called the Levi graph by turning methods have powerful ability to learn the implicit distribu-
the edges into additional nodes and thus edge labels can be tion of graphs, there is a surge in neural graph generative
represented as embeddings. models recently.
Relation extraction Extracting semantic relations be- NetGAN [114] is one of the first work to build neural
tween entities in texts is an important and well-studied task. graph generative model, which generates graphs via ran-
Some systems treat this task as a pipeline of two separated dom walks. It transformed the problem of graph generation
tasks, named entity recognition and relation extraction. [84] to the problem of walk generation which takes the random
proposed an end-to-end relation extraction model by using walks from a specific graph as input and trains a walk
bidirectional sequential and tree-structured LSTM-RNNs. generative model using GAN architecture. While the gen-
[86] proposed an extension of graph convolutional networks erated graph preserves important topological properties of
17
the original graph, the number of nodes is unable to change [112] uses GNNs directly as classifiers who can perform
in the generating process, which is as same as the original dense prediction over a graph with pairwise edges. The
graph. GraphRNN [125] managed to generate the adjacency rest of the model help GNNs do diverse choices and train
matrix of a graph by generating the adjacency vector of efficiently.
each node step by step, which can output required networks
having different numbers of nodes.
Instead of generating adjacency matrix sequentially, 4 O PEN PROBLEMS
MolGAN [117] predicts discrete graph structure (the adja-
cency matrix) at once and utilizes a permutation-invariant Although GNNs have achieved great success in different
discriminator to solve the node variant problem in the fields, it is remarkable that GNN models are not good
adjacency matrix. Besides, it applies a reward network for enough to offer satisfying solutions for any graph in any
RL-based optimization towards desired chemical proper- condition. In this section, we will state some open problems
ties. What’s more, [115] proposed constrained variational for further researches.
autoencoders to ensure the semantic validity of generated Shallow Structure Traditional deep neural networks can
graphs. And, GCPN [116] incorporated domain-specific stack hundreds of layers to get better performance, because
rules through reinforcement learning. deeper structure has more parameters, which improve the
[126] proposed a model which generates edges and expressive power significantly. However, graph neural net-
nodes sequentially and utilizes a graph neural network to works are always shallow, most of which are no more than
extract the hidden state of the current graph which is used three layers. As experiments in [62] show, stacking multiple
to decide the action in the next step during the sequential GCN layers will result in over-smoothing, that is to say, all
generative process. vertices will converge to the same value. Although some
researchers have managed to tackle this problem [45], [62],
3.3.2 Combinatorial Optimization it remains to be the biggest limitation of GNN. Designing
real deep GNN is an exciting challenge for future research,
Combinatorial optimization problems over graphs are set and will be a considerable contribution to the understanding
of NP-hard problems which attract much attention from of GNN.
scientists of all fields. Some specific problems like travel-
ing salesman problem (TSP) and minimum spanning trees Dynamic Graphs Another challenging problem is how
(MST) have got various heuristic solutions. Recently, using to deal with graphs with dynamic structures. Static graphs
a deep neural network for solving such problems has been are stable so they can be modeled feasibly, while dynamic
a hotspot, and some of the solutions further leverage graph graphs introduce changing structures. When edges and
neural network because of their graph structure. nodes appear or disappear, GNN can not change adaptively.
Dynamic GNN is being actively researched on and we
[127] first proposed a deep-learning approach to tackle believe it to be a big milestone about the stability and
TSP. Their method consists of two parts: a Pointer Net- adaptability of general GNN.
work [128] for parameterizing rewards and a policy gradi-
ent [129] module for training. This work has been proved Non-Structural Scenarios Although we have discussed
to be comparable with traditional approaches. However, the applications of GNN on non-structural scenarios, we
Pointer Networks are designed for sequential data like texts, found that there is no optimal methods to generate graphs
while order-invariant encoders are more appropriate for from raw data. In image domain, some work utilizes CNN
such work. to obtain feature maps then upsamples them to form su-
perpixels as nodes [49], while other ones directly leverage
[130] improved the approach by including graph neural some object detection algorithms to get object nodes. In
networks. They first obtain the node embeddings from text domain [103], some work employs syntactic trees as
structure2vec [67], which is a variant of GCN, then feed syntactic graphs while others adopt fully connected graphs.
them into a Q-learning module for making decisions. This Therefore, finding the best graph generation approach will
work achieved better performance than previous algo- offer a wider range of fields where GNN could make contri-
rithms, which proved the representation power of graph bution.
neural networks.
Scalability How to apply embedding methods in web-
[113] proposed an attention-based encoder-decoder al- scale conditions like social networks or recommendation
gorithm. Such a method can be regarded as a graph at- systems has been a fatal problem for almost all graph em-
tention network whose readout phase is an attention-based bedding algorithms, and GNN is not an exception. Scaling
decoder instead of a reinforcement learning module, which up GNN is difficult because many of the core steps are
is theoretically efficient for training. computational consuming in big data environment. There
[111] focused on Quadratic Assignment Problem i.e. are several examples about this phenomenon: First, graph
measuring the similarity of two graphs. The GNN based data are not regular Euclidean, each node has its own
model learns node embeddings for each graph indepen- neighborhood structure so batches can not be applied. Then,
dently and matches them using attention mechanism. calculating graph Laplacian is also unfeasible when there
This method offers intriguingly good performance even in are millions of nodes and edges. Moreover, we need to point
regimes where standard relaxation-based techniques appear out that scaling determines whether an algorithm is able to
to suffer. be applied into practical use. Several work has proposed
18
their solutions to this problem [120] and we are paying close [17] T. Kawamoto, M. Tsubaki, and T. Obuchi, “Mean-field theory of
attention to the progress. graph neural networks in graph partitioning,” in NeurIPS 2018,
2018, pp. 4366–4376.
[18] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfar-
dini, “The graph neural network model,” IEEE TNN 2009, vol. 20,
no. 1, pp. 61–80, 2009.
5 C ONCLUSION [19] ——, “Computational capabilities of graph neural networks,”
IEEE TNN 2009, vol. 20, no. 1, pp. 81–102, 2009.
Over the past few years, graph neural networks have be- [20] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M.
come powerful and practical tools for machine learning Bronstein, “Geometric deep learning on graphs and manifolds
tasks in graph domain. This progress owes to advances in using mixture model cnns,” CVPR 2017, pp. 5425–5434, 2017.
[21] J. Atwood and D. Towsley, “Diffusion-convolutional neural net-
expressive power, model flexibility, and training algorithms. works,” in NIPS 2016, 2016, pp. 1993–2001.
In this survey, we conduct a comprehensive review of graph [22] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst,
neural networks. For GNN models, we introduce its variants “Geodesic convolutional neural networks on riemannian mani-
categorized by graph types, propagation types, and training folds,” in ICCV workshops 2015, 2015, pp. 37–45.
[23] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein, “Learning
types. Moreover, we also summarize several general frame- shape correspondence with anisotropic convolutional neural net-
works to uniformly represent different variants. In terms works,” in NIPS 2016, 2016, pp. 3189–3197.
of application taxonomy, we divide the GNN applications [24] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-
into structural scenarios, non-structural scenarios, and other dergheynst, “Geometric deep learning: going beyond euclidean
data,” IEEE SPM 2017, vol. 34, no. 4, pp. 18–42, 2017.
scenarios, then give a detailed review for applications in
[25] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
each scenario. Finally, we suggest four open problems indi- “Neural message passing for quantum chemistry,” arXiv preprint
cating the major challenges and future research directions of arXiv:1704.01212, 2017.
graph neural networks, including model depth, scalability, [26] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural
networks,” arXiv preprint arXiv:1711.07971, vol. 10, 2017.
the ability to deal with dynamic graphs and non-structural
[27] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
scenarios. V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
R. Faulkner et al., “Relational inductive biases, deep learning, and
graph networks,” arXiv preprint arXiv:1806.01261, 2018.
R EFERENCES [28] M. A. Khamsi and W. A. Kirk, An introduction to metric spaces and
fixed point theory. John Wiley & Sons, 2011, vol. 53.
[1] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representa- [29] M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P.
tion learning on large graphs,” NIPS 2017, pp. 1024–1034, 2017. Xing, “Rethinking knowledge graph propagation for zero-shot
[2] T. N. Kipf and M. Welling, “Semi-supervised classification with learning,” arXiv preprint arXiv:1805.11724, 2018.
graph convolutional networks,” ICLR 2017, 2017. [30] Y. Zhang, Y. Xiong, X. Kong, S. Li, J. Mi, and Y. Zhu, “Deep
[3] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, collective classification in heterogeneous information networks,”
M. Riedmiller, R. Hadsell, and P. Battaglia, “Graph networks in WWW 2018, 2018, pp. 399–408.
as learnable physics engines for inference and control,” arXiv
[31] D. Beck, G. Haffari, and T. Cohn, “Graph-to-sequence learning
preprint arXiv:1806.01242, 2018.
using gated graph neural networks,” in ACL 2018, 2018, pp. 273–
[4] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al., “Interaction
283.
networks for learning about objects, relations and physics,” in
[32] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov,
NIPS 2016, 2016, pp. 4502–4510.
and M. Welling, “Modeling relational data with graph convolu-
[5] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur, “Protein interface
tional networks,” in ESWC 2018. Springer, 2018, pp. 593–607.
prediction using graph convolutional networks,” in NIPS 2017,
2017, pp. 6530–6539. [33] D. K. Duvenaud, D. Maclaurin, J. Aguileraiparraguirre,
R. Gomezbombarelli, T. D. Hirzel, A. Aspuruguzik, and R. P.
[6] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto, “Knowl-
Adams, “Convolutional networks on graphs for learning molec-
edge transfer for out-of-knowledge-base entities : A graph neural
ular fingerprints,” NIPS 2015, pp. 2224–2232, 2015.
network approach,” in IJCAI 2017, 2017, pp. 1802–1808.
[7] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song, “Learn- [34] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W.-t. Yih,
ing combinatorial optimization algorithms over graphs,” arXiv “Cross-sentence n-ary relation extraction with graph lstms,”
preprint arXiv:1704.01665, 2017. arXiv preprint arXiv:1708.03743, 2017.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based [35] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks
learning applied to document recognition,” Proceedings of the and locally connected networks on graphs,” ICLR 2014, 2014.
IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [36] M. Henaff, J. Bruna, and Y. Lecun, “Deep convolutional networks
[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature on graph-structured data.” arXiv: Learning, 2015.
2015, vol. 521, no. 7553, p. 436, 2015. [37] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets
[10] F. R. Chung and F. C. Graham, Spectral graph theory. American on graphs via spectral graph theory,” Applied and Computational
Mathematical Soc., 1997, no. 92. Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti- [38] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
mation of word representations in vector space,” arXiv preprint neural networks on graphs with fast localized spectral filtering,”
arXiv:1301.3781, 2013. NIPS 2016, pp. 3844–3852, 2016.
[12] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning [39] Y. C. Ng, N. Colombo, and R. Silva, “Bayesian semi-supervised
of social representations,” in SIGKDD 2014. ACM, 2014, pp. 701– learning with graph gaussian processes,” in NeurIPS 2018, 2018,
710. pp. 1690–1701.
[13] A. Grover and J. Leskovec, “node2vec: Scalable feature learning [40] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional
for networks,” in SIGKDD. ACM, 2016, pp. 855–864. neural networks for graphs,” in ICML 2016, 2016, pp. 2014–2023.
[14] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: [41] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep
Large-scale information network embedding,” in WWW 2015, residual networks,” in ECCV 2016. Springer, 2016, pp. 630–645.
2015, pp. 1067–1077. [42] J. Chang, J. Gu, L. Wang, G. Meng, S. Xiang, and C. Pan,
[15] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network “Structure-aware convolutional neural networks,” in NeurIPS
representation learning with rich text information.” in IJCAI 2015, 2018, 2018, pp. 11–20.
2015, pp. 2111–2117. [43] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau,
[16] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
learning on graphs: Methods and applications,” arXiv preprint resentations using rnn encoder–decoder for statistical machine
arXiv:1709.05584, 2017. translation,” EMNLP 2014, pp. 1724–1734, 2014.
19
[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [73] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. A. Tkatchenko, “Quantum-chemical insights from deep tensor
[45] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph neural networks,” Nature communications, vol. 8, p. 13890, 2017.
sequence neural networks,” arXiv: Learning, 2016. [74] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for
[46] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic image denoising,” in CVPR 2005, vol. 2. IEEE, 2005, pp. 60–65.
representations from tree-structured long short-term memory [75] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color
networks,” IJCNLP 2015, pp. 1556–1566, 2015. images,” in Computer Vision 1998. IEEE, 1998, pp. 839–846.
[47] V. Zayats and M. Ostendorf, “Conversation modeling on reddit [76] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neu-
using a graph-structured lstm,” TACL 2018, vol. 6, pp. 121–132, ral relational inference for interacting systems,” arXiv preprint
2018. arXiv:1802.04687, 2018.
[48] Y. Zhang, Q. Liu, and L. Song, “Sentence-state lstm for text [77] J. B. Hamrick, K. R. Allen, V. Bapst, T. Zhu, K. R. McKee, J. B.
representation,” ACL 2018, vol. 1, pp. 317–327, 2018. Tenenbaum, and P. W. Battaglia, “Relational inductive bias for
[49] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object physical construction in humans and machines,” arXiv preprint
parsing with graph lstm,” ECCV 2016, pp. 125–143, 2016. arXiv:1806.01203, 2018.
[50] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation [78] T. Wang, R. Liao, J. Ba, and S. Fidler, “Nervenet: Learning
by jointly learning to align and translate,” ICLR 2015, 2015. structured policy with graph neural networks,” 2018.
[51] J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin, “A convolu- [79] H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and
tional encoder model for neural machine translation,” ACL 2017, Q. Yang, “Large-scale hierarchical text classification with recur-
vol. 1, pp. 123–135, 2017. sively regularized deep graph-cnn,” in WWW 2018, 2018, pp.
[52] A. Vaswani, N. Shazeer, N. Parmar, L. Jones, J. Uszkoreit, A. N. 1063–1072.
Gomez, and L. Kaiser, “Attention is all you need,” NIPS 2017, pp. [80] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for
5998–6008, 2017. text classification,” arXiv preprint arXiv:1809.05679, 2018.
[53] J. Cheng, L. Dong, and M. Lapata, “Long short-term memory- [81] D. Marcheggiani and I. Titov, “Encoding sentences with graph
networks for machine reading,” EMNLP 2016, pp. 551–561, 2016. convolutional networks for semantic role labeling,” in EMNLP
[54] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and 2017, September 2017, pp. 1506–1515.
Y. Bengio, “Graph attention networks,” ICLR 2018, 2018. [82] J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan,
[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for “Graph convolutional encoders for syntax-aware neural machine
image recognition,” CVPR 2016, pp. 770–778, 2016. translation,” EMNLP 2017, pp. 1957–1967, 2017.
[56] A. Rahimi, T. Cohn, and T. Baldwin, “Semi-supervised user [83] D. Marcheggiani, J. Bastings, and I. Titov, “Exploiting semantics
geolocation via graph convolutional networks,” ACL 2018, vol. 1, in neural machine translation with graph convolutional net-
pp. 2009–2019, 2018. works,” arXiv preprint arXiv:1804.08313, 2018.
[57] J. G. Zilly, R. K. Srivastava, J. Koutnik, and J. Schmidhuber, [84] M. Miwa and M. Bansal, “End-to-end relation extraction us-
“Recurrent highway networks.” ICML 2016, pp. 4189–4198, 2016. ing lstms on sequences and tree structures,” arXiv preprint
[58] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka, arXiv:1601.00770, 2016.
“Representation learning on graphs with jumping knowledge [85] L. Song, Y. Zhang, Z. Wang, and D. Gildea, “N-ary relation ex-
networks,” ICML 2018, pp. 5449–5458, 2018. traction using graph state lstm,” arXiv preprint arXiv:1808.09101,
[59] J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph 2018.
convolutional networks via importance sampling,” arXiv preprint
[86] Y. Zhang, P. Qi, and C. D. Manning, “Graph convolution over
arXiv:1801.10247, 2018.
pruned dependency trees improves relation extraction,” arXiv
[60] W. Huang, T. Zhang, Y. Rong, and J. Huang, “Adaptive sampling preprint arXiv:1809.10185, 2018.
towards fast graph representation learning,” in NeurIPS 2018,
[87] T. H. Nguyen and R. Grishman, “Graph convolutional networks
2018, pp. 4563–4572.
with argument-aware pooling for event detection,” 2018.
[61] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph
[88] X. Liu, Z. Luo, and H. Huang, “Jointly multiple events extrac-
convolutional networks with variance reduction.” in ICML 2018,
tion via attention-based graph information aggregation,” arXiv
2018, pp. 941–949.
[62] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph con-
volutional networks for semi-supervised learning,” arXiv preprint [89] L. Song, Y. Zhang, Z. Wang, and D. Gildea, “A graph-
arXiv:1801.07606, 2018. to-sequence model for amr-to-text generation,” arXiv preprint
[63] Y. Hoshen, “Vain: Attentional multi-agent predictive modeling,” arXiv:1805.02473, 2018.
in NIPS 2017, 2017, pp. 2701–2711. [90] L. Song, Z. Wang, M. Yu, Y. Zhang, R. Florian, and D. Gildea,
[64] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and “Exploring graph-structured passage representation for multi-
A. Tacchetti, “Visual interaction networks: Learning a physics hop reading comprehension with graph neural networks,” arXiv
simulator from video,” in NIPS 2017, 2017, pp. 4539–4547. preprint arXiv:1809.02040, 2018.
[65] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum, [91] R. B. Palm, U. Paquet, and O. Winther, “Recurrent relational
“A compositional object-based approach to learning physical networks,” NeurIPS 2018, 2018.
dynamics,” arXiv preprint arXiv:1612.00341, 2016. [92] Z. Wang, T. Chen, J. Ren, W. Yu, H. Cheng, and L. Lin, “Deep
[66] S. Sukhbaatar, R. Fergus et al., “Learning multiagent communica- reasoning with knowledge graph for social relationship under-
tion with backpropagation,” in NIPS 2016, 2016, pp. 2244–2252. standing,” arXiv preprint arXiv:1807.00504, 2018.
[67] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent [93] V. Garcia and J. Bruna, “Few-shot learning with graph neural
variable models for structured data,” in ICML 2016, 2016, pp. networks,” arXiv preprint arXiv:1711.04043, 2017.
2702–2711. [94] X. Wang, Y. Ye, and A. Gupta, “Zero-shot recognition via seman-
[68] D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and tic embeddings and knowledge graphs,” in CVPR 2018, 2018, pp.
P. Battaglia, “Discovering objects and their relations from entan- 6857–6866.
gled scene representations,” arXiv preprint arXiv:1702.05068, 2017. [95] C.-W. Lee, W. Fang, C.-K. Yeh, and Y.-C. F. Wang, “Multi-label
[69] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, zero-shot learning with structured knowledge graphs,” arXiv
P. Battaglia, and T. Lillicrap, “A simple neural network module preprint arXiv:1711.06526, 2017.
for relational reasoning,” in NIPS 2017, 2017, pp. 4967–4976. [96] K. Marino, R. Salakhutdinov, and A. Gupta, “The more you
[70] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut- know: Using knowledge graphs for image classification,” arXiv
dinov, and A. J. Smola, “Deep sets,” in NIPS 2017, 2017, pp. 3391– preprint arXiv:1612.04844, 2016.
3401. [97] D. Teney, L. Liu, and A. van den Hengel, “Graph-structured
[71] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning representations for visual question answering,” arXiv preprint,
on point sets for 3d classification and segmentation,” CVPR 2017, 2017.
vol. 1, no. 2, p. 4, 2017. [98] M. Narasimhan, S. Lazebnik, and A. Schwing, “Out of the box:
[72] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley, Reasoning with graph convolution nets for factual visual ques-
“Molecular graph convolutions: moving beyond fingerprints,” tion answering,” in NeurIPS 2018, 2018, pp. 2659–2670.
Journal of computer-aided molecular design, vol. 30, no. 8, pp. 595– [99] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for
608, 2016. object detection,” in CVPR 2018, vol. 2, no. 3, 2018.
20
[100] J. Gu, H. Hu, L. Wang, Y. Wei, and J. Dai, “Learning region [126] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia,
features for object detection,” arXiv preprint arXiv:1803.07066, “Learning deep generative models of graphs,” arXiv preprint
2018. arXiv:1803.03324, 2018.
[101] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human- [127] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural
object interactions by graph parsing neural networks,” arXiv combinatorial optimization with reinforcement learning,” arXiv
preprint arXiv:1808.07962, 2018. preprint arXiv:1611.09940, 2016.
[102] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: [128] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in
Deep learning on spatio-temporal graphs,” in CVPR 2016, 2016, NIPS 2015, 2015, pp. 2692–2700.
pp. 5308–5317. [129] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduc-
[103] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta, “Iterative visual tion. MIT press, 2018.
reasoning beyond convolutions,” arXiv preprint arXiv:1803.11189, [130] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song, “Learning
2018. combinatorial optimization algorithms over graphs,” in NIPS
[104] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing, 2017, 2017, pp. 6348–6358.
“Interpretable structure-evolving lstm,” in CVPR 2017, 2017, pp.
2175–2184.
[105] L. Landrieu and M. Simonovsky, “Large-scale point cloud se-
mantic segmentation with superpoint graphs,” arXiv preprint
arXiv:1711.09869, 2017.
[106] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.
Solomon, “Dynamic graph cnn for learning on point clouds,”
arXiv preprint arXiv:1801.07829, 2018.
[107] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun, “3d graph neural
networks for rgbd semantic segmentation,” in CVPR 2017, 2017,
pp. 5199–5208.
[108] M. Zitnik, M. Agrawal, and J. Leskovec, “Modeling polyphar-
macy side effects with graph convolutional networks,” arXiv
[109] S. Rhee, S. Seo, and S. Kim, “Hybrid approach of relation network
and localized graph convolutional filtering for breast cancer
subtype classification,” arXiv preprint arXiv:1711.05859, 2017.
[110] Z. Wang, Q. Lv, X. Lan, and Y. Zhang, “Cross-lingual knowledge
graph alignment via graph convolutional networks,” in EMNLP
2018, 2018, pp. 349–357.
[111] A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna, “Revised note
on learning quadratic assignment with graph neural networks,”
in IEEE DSW 2018. IEEE, 2018, pp. 1–5.
[112] Z. Li, Q. Chen, and V. Koltun, “Combinatorial optimization
with graph convolutional networks and guided tree search,” in
NeurIPS 2018, 2018, pp. 537–546.
[113] W. Kool and M. Welling, “Attention solves your tsp,” arXiv
[114] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann,
“Netgan: Generating graphs via random walks,” arXiv preprint
arXiv:1803.00816, 2018.
[115] T. Ma, J. Chen, and C. Xiao, “Constrained generation of semanti-
cally valid graphs via regularizing variational autoencoders,” in
NeurIPS 2018, 2018, pp. 7113–7124.
[116] J. You, B. Liu, R. Ying, V. Pande, and J. Leskovec, “Graph
convolutional policy network for goal-directed molecular graph
generation,” arXiv preprint arXiv:1806.02473, 2018.
[117] N. De Cao and T. Kipf, “Molgan: An implicit generative model
for small molecular graphs,” arXiv preprint arXiv:1805.11973,
2018.
[118] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “Traffic graph con-
volutional recurrent neural network: A deep learning framework
for network-scale traffic learning and forecasting,” 2018.
[119] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolu-
tional matrix completion,” arXiv preprint arXiv:1706.02263, 2017.
[120] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and
J. Leskovec, “Graph convolutional neural networks for web-scale
recommender systems,” arXiv preprint arXiv:1806.01973, 2018.
[121] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec,
“Hierarchical graph representation learning with differentiable
pooling,” in NeurIPS 2018, 2018, pp. 4805–4815.
[122] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to
represent programs with graphs,” arXiv preprint arXiv:1711.00740,
2017.
[123] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet
large scale visual recognition challenge,” IJCV 2015, vol. 115,
no. 3, pp. 211–252, 2015.
[124] W. Norcliffe-Brown, S. Vafeias, and S. Parisot, “Learning condi-
tioned graph structures for interpretable visual question answer-
ing,” in NeurIPS 2018, 2018, pp. 8344–8353.
[125] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec, “Graphrnn:
Generating realistic graphs with deep auto-regressive models,” in
ICML 2018, 2018, pp. 5694–5703.

Graph Neural Network Review and Applications

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Graph Neural Network Review and Applications

Transféré par

Droits d'auteur :

Formats disponibles

1

Graph Neural Networks:

Index Terms—Deep Learning, Graph Neural Network

1 I NTRODUCTION them to construct highly expressive representations, which

release the limitations. And finally, we introduce several TABLE 1

2.2.1 Graph Types i.e. as a linear combination of basis transformations Vb ∈

(a) Graph Types (b) Training Methods

(c) Propagation Steps

Fig. 2. An overview of variants of graph neural networks.

Name Variant Aggregator Updater

Spectral 1st -order N0 = X H = N0 Θ0 + N1 Θ1

exp(LeakyReLU(aT [Whv kWhk ]))

ztv = σ(Wz htNv + Uz hvt−1 )

itv = σ(Wi xtv + Ui htNv + bi )

PK itv = σ(Wi xtv + htiNv + b )

2.3.2 Non-local Neural Networks

framework can be applied to both structural scenar- 3.1.1 Physics

(a) Sequential GN blocks (b) Encode-process-decode (c) Recurrent GN blocks

(a) physics (b) molecule

(c) image (d) text

(e) social network (f) generation

Fig. 4. Application Scenarios

3.2.1 Image Semantic Segmentation Semantic segmentation is a cru-

Vous aimerez peut-être aussi