Diffusion Maps For Textual Network Embedding

Diffusion Maps for Textual Network Embedding
Xinyuan Zhang, Yitong Li, Dinghan Shen, Lawrence Carin

Duke University, Durham, NC 27707
Email: xy.zhang@duke.edu
Abstract
Textual network embedding leverages rich text information associated with
𝒗𝒗𝒔𝒔𝟏𝟏 𝒗𝒗𝒔𝒔𝟐𝟐 𝒗𝒗𝒔𝒔𝟑𝟑 𝒗𝒗𝒔𝒔𝟒𝟒
where W ∈ RH×d is the weight matrix, f is a nonlinear differentiable Results
Structure function, and represents element-wise multiplication. With longer
the network to learn low-dimensional vectorial representations of vertices. Embedding Table
Rather than using typical natural language processing (NLP) approaches, re- 1 𝑬𝑬𝒔𝒔 𝑷𝑷𝟎𝟎
paths discounted more than shorter paths, the text embedding matrix Table 1: AUC scores for link prediction on Cora(top) and Zhihu(bottom).
cent research exploits the relationship of texts on the same edge to graphically
2 4 𝒙𝒙1
𝜆𝜆0 Vt is given by % of edges 15% 25% 35% 45% 55% 65% 75% 85% 95%
embed text. However, these models neglect to measure the complete level of
connectivity between any two texts in the graph. We present diffusion maps for 3 Word 𝒙𝒙2 𝜆𝜆1 𝑷𝑷𝟏𝟏 H−1
X Deep Walk 56.0 63.0 70.2 75.5 80.1 85.2 85.3 87.8 90.3
textual network embedding (DMTE), integrating global structural information
Embedding
Table 𝑬𝑬𝒘𝒘 𝒙𝒙3 * Vt =
∗(:,h,:)
λhVt . (5) LINE 55.0 58.6 66.4 73.0 77.6 82.8 85.6 88.4 89.3
…
𝒙𝒙4 𝒗𝒗𝒕𝒕𝟏𝟏 𝒗𝒗𝒕𝒕𝟐𝟐 𝒗𝒗𝒕𝒕𝟑𝟑 𝒗𝒗𝒕𝒕𝟒𝟒 node2vec 55.9 62.4 66.1 75.0 78.7 81.6 85.9 87.3 88.2
of the graph to capture the semantic relatedness between texts, with a diffusion- 𝜆𝜆𝐻𝐻−1 𝑷𝑷𝑯𝑯−𝟏𝟏 h=0
convolution operation applied on the text inputs. In addition, a new objective Through the diffusion process, text representations, i.e., rows of Vt are TADW 86.6 88.2 90.2 90.8 90.0 93.0 91.0 93.4 92.7
function is designed to efficiently preserve the high-order proximity using the TriDNR 85.9 88.6 90.5 91.2 91.3 92.4 93.0 93.6 93.7
Figure 1: An illustration of our framework for textual network embedding. not embedded independently. With the whole graph being smoothed,
graph diffusion. Experimental results show that the proposed approach outper- CENE 72.1 86.5 84.6 88.1 89.4 89.2 93.9 95.0 95.9
forms state-of-the-art methods on the vertex-classification and link-prediction indirect relationships between texts that are not on the same edge can CANE 86.8 91.5 92.2 93.9 94.6 94.9 95.6 96.6 97.7
tasks. be considered to learn embeddings.
Diffusion Process DMTE (w/o diffusion) 87.4 91.2 92.0 93.2 93.9 94.6 95.5 95.9 96.7
DMTE (text only) 82.6 84.0 85.7 87.3 89.1 91.1 92.0 92.9 94.2
Initially the network only has a few active vertices, due to sparsity. Objective Function DMTE (Bi-LSTM) 86.3 88.2 90.7 92.7 94.1 94.8 96.0 97.3 98.1
Problem Defination Through the diffusion process, information is delivered from active DMTE (WAvg) 91.3 93.1 93.7 95.0 96.0 97.1 97.4 98.2 98.8
vertices to inactive ones by filling information gaps between vertices; Given the set of edges E, the goal of DMTE is to maximize the fol-
Deep Walk 56.6 58.1 60.1 60.0 61.8 61.9 63.3 63.7 67.8
Definition 1. A textual information network is G = (V, E, T ), vertices may be connected by indirect, multi-step paths. We introduce lowing overall objective function: LINE 52.3 55.9 59.9 60.9 64.3 66.0 67.7 69.3 71.1
where V = {vi}i=1,··· ,N is the set of vertices, E = {ei,j }N X X
i,j=1 is the the transition matrix P and its power series for the diffusion process. L= L(e) = αttLtt(e) + αssLss(e) + αstLst(e) + αtsLts(e). node2vec 54.2 57.1 57.3 58.3 58.7 62.5 66.2 67.6 68.5
set of edges, and T = {ti}i=1,··· ,N is the set of texts associated with TADW 52.3 54.2 55.6 57.3 60.8 62.4 65.2 63.8 69.0
0.54 e∈E e∈E
vertices. Each edge ei,j has a weight si,j representing the relationship (6) TriDNR 53.8 55.7 57.9 59.5 63.0 64.6 66.0 67.5 70.3
1 1
between vertices vi and vj . If vi and vj are not linked, si,j = 0. If there 0.42 0.11
CENE 56.2 57.4 60.3 63.0 66.3 66.0 70.2 69.8 73.8
The objective function consists of four parts, which measure both
1.0 0.05
exists an edge between vi and vj , si,j = 1 for an unweighted graph, CANE 56.8 59.3 62.9 64.5 68.9 70.4 71.4 73.6 75.4
0.38 0.42 0.10
0.5 0.02
4 0.51
4 the structure and text embeddings. Each part is to measure the log-
and si,j > 0 for a weighted graph. A path is a sequence of edges that 0.11 0.08 0.07
DMTE (w/o diffusion) 56.2 58.4 61.3 64.0 68.5 69.7 71.5 73.3 75.1
2 likelihood of generating vi conditioned on vj , where vi and vj are on
0.23
0.33 2
connect two vertices. The text of vertex vi, ti, is comprised of a word 0.89 0.32 0.11 DMTE (text only) 55.9 57.2 58.8 61.6 65.3 67.6 69.5 71.0 74.1
the same directed edge.
0.62 0.10
0.59
sequence < w1, · · · , w|ti| >. 3
0.07 DMTE (Bi-LSTM) 56.3 60.3 64.9 69.8 73.2 76.4 78.7 80.3 82.2
3
exp(v ti · v tj ) DMTE (WAvg) 58.4 63.2 67.5 71.6 74.0 76.7 78.5 79.8 81.5
0.40 Ltt(e) = si,j log p(v ti|v tj ) = si,j log P t t , (7)
Definition 2. Let S ∈ RN ×N be the adjacency matrix of a graph (a) Original graph (b) Forth order diffusion graph. v tk ∈Vt exp(v k · v j )
whose entry si,j ≥ 0 is the weight of edge ei,j . The transition ma- exp(v s · us) Figure 3: Left: The link prediction results w.r.t. the number of hops H at different
Figure 2: (Left) Original
Figure 2: graph onlyexample
A simple have connected edge process
of diffusion e1,2 , e1,3in
, ea3,4directed
and e1,4graph.
. Here we plot it s s i j training ratios. Right: F1-Macro scores for multi-label classification on DBLP.
trix P ∈ RN ×N is obtained by normalizing rows of S to sum to one, as directed graph because we normalize the outgoing edges weight. (Right) Forth power diffusion Lss(e) = si,j log p(v i |uj ) = si,j log P s s , (8)
v sk ∈Vs exp(v k · uj )
0.9
graph.
with pi,j representing the transition probability from vertex vi to vertex
0.92 0.94 DeepWalk LINE TADW
TriDNR CANE DMTE
s · vt )
Text Embedding
0.91
exp(v 0.935
0.8
vj within one step. Then an h-step transition matrix can be computed Lst(e) = si,j log p(v si|v tj ) = si,j log P
i j
(9)
0.9
AUC
AUC
s t , 0.93
on the graph. Figure 2 gives an example of the smoothing effect of diffusion graph. This example only v sk ∈Vs exp(v k · v j )
0.89
with P to the h-th power, i.e., Ph. The entry phi,j refers to the transition 137 0.7
F1-Macro Score
35%
A four
word sequence w1, · · ·so, w > is mapped intoThea set of dgraph
t-
15% 0.925
contains nodes. The edgest are =< normalized the|t|graph becomes directed. original
0.88
138
probability from vertex vi to vertex vj within exactly h steps. 139 only have edge pair e1,2real-valued
dimensional , e1,3 , e3,4 andvectors
e1,4 . However,
< w1the , · indirect relationship
· · , w|t| between up
> by looking other edge
the exp(v t · us)
i j
0.87
H=1 H=2 H=3 H=4 H=5 H=6
0.92
H=1 H=2 H=3 H=4 H=5 H=6 0.6
140 pairs are not considered. Diffusion graph can smoothing the whole graph with higher order. Thus Lts(e) = si,j log p(v ti|usj) = si,j log P . (10)
141
word relationships,
those indirect embeddinglike matrix
(n2 , n4E . We
can
), w can
also be obtain aAssimple
considered. text
we can see representa-
from figure 2(b), the
t s
v tk ∈Vt exp(v k · uj )
0.965 0.98
0.5
Definition 3. A network embedding aims to learn a low-dimensional

0.96 0.975
142 tiondiffusion
forth order dt ofbecomes
xi ∈ Rgraph vertexfully vi by taking When
connected. the average
the order of
goesword vectors.
to infinity, The
it corresponds 0.955
Note that p(·|usj) computes the probability conditioned on the diffusion
AUC
0.97
AUC
vector v i ∈ Rd for vertex vi ∈ V , where d |V | is the dimension to the convergence point of a random walk. 0.4
143
input texts can be represented by matrix X ∈ R N ×dt . 0.945
0.95
0.965
of the embedding. The embedding matrix V for the complete graph map of vertex vj , and p(·|v tj ) computes the probability conditioned on 0.94 55% 0.96
75% 0.3
144 4.2 Text Embedding 0.935 0.955 10% 30% 50% 70%
is the concatenation of {v 1, v 2, · · · , v N }. The distance between ver- 1 X |t| the text embedding of vertex vj . H=1 H=2 H=3 H=4 H=5 H=6 H=1 H=2 H=3 H=4 H=5 H=6 Label Percentage
tices on the graph and context similarity should be preserved in the 145 A word sequence t x =<=w1 , · · · , ww , is mappedXinto
|t| i> = axset
1⊕ ofx
dt2-dimensional
⊕ · · · ⊕ xreal-valued
N . (1)
vectors
146 |t| up the word embedding matrix Ew . Here Ew ∈ R
< w1 , · · · , w|t| > by looking |w|×d
is randomly t
Table 2: Top-5 similar vertex search based on embeddings learned by DMTE.
representation space. i=1
147 initialized and further learned during training and |w| is the vocabulary size of the dataset. We can
obtain Alternatively,
Experiments Query: The K-D-B-Tree: A Search Structure For Large Multidimensional Dynamic Indexes.
148 we can use
a simple text representation xi the
∈ Rbi-directional
dt
of vertice vi byLSTM. Text
taking the inputs
average are rep-
of word vectors. 1. The R+-Tree: A Dynamic Index for Multi-Dimensional Objects.
Definition 4. The diffusion map of vertex vi is ui, the i-th row of 149 Although the word order is not preserved in such representation, [5] has shown that word embedding 2. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries.
150 averageresented byperform
models can the mean of all well
surprisingly hidden states.
and avoid over-fitting efficiently in many NLP tasks. We evaluate the proposed method for the multi-label classification and 3. Segment Indexes: Dynamic Indexing Techniques for Multi-Dimensional Interval Data.
the diffusion embedding matrix U, which maps from vertices and their 151 Given the fixed-length
→
− vectors of each text, the input texts
←− can be represented by matrix X ∈ R t
N ×d
link prediction tasks. 4. Generalized Search Trees for Database Systems.
embeddings to the results of a diffusion process that begins at vertex vi. 152 where the i-th
h row
i = isLST
xi . M (wi, hi−1), h i = LST M (wi, hi+1) (2) 5. High Performance Clustering Based on the Similarity Join.
PH−1 • Given a pair of vertices, link prediction seeks to predict the exis-
U is computed by U = h=0 λhPhV, where λh is the importance |t|
X 1 ←
|t|
1 x=→ − wi− X = x1 ⊕ x2 ⊕ · · · ⊕ xN . tence of an unobserved edge using the trained representations.
coefficient that typically decreases as the value of h increases. The x= ( ch i ⊕ h, i), X = x1 ⊕ x2 ⊕ · · · ⊕ x N . (3)
high-order proximity in the network is preserved in diffusion maps. |t| i=1 • Multi-label classification seeks to classify each vertex into a set of
i=1
However, in this text representation matrix each embedding is completely independent without labels using the learned vertex representation as features.
153
154 leveraging the semantic
However, relatedness
the above indicated from
embeddings the graph.
do not To address
leverage this issue,related-
the semantic we employ Conclusions
155 diffusion convolutional operator [1] to measure the level of connectivity between any of two texts in
Method 156 ness indicated from the graph. To address this issue, we employ the
the netwrok.
Dataset • We propose DMTE which integrates global structural information of
157 Let P diffusion
∗
∈ RN ×H×Nconvolutional operator
be a tensor containing H hops to ofmeasure theoflevel
power series P, i.e.,oftheconnectivity
concatenation of
We employ a diffusion process to build long-distance semantic relat- 158 {P0 , Pbetween
1
any
, · · · , PH−1 }. of
Vt∗two
∈ RN texts
×H×d
inisthe network.
the tensor version of text embedding represention after • DBLP is a citation network that consists of 60744 papers in 4 research areas: the graph to capture the level of connectivity between any two texts,
edness in text embeddings, and capture global structural information 159 diffusionLet ∗
convolutional
P ∈R N ×H×N
operation. The activation
∗(i,j,k)
be a tensort containing
V for nodeH i, hop j, and
hops offeature
powerk isseries
given by database, data mining, artificial intelligence, and computer vision. The network by applying a diffusion convolutional operation on the text inputs.
0,NP1, · · · , PH−1}. V∗ ∈ RN ×H×d has 52890 edges indicating the citation relationship between papers.
in the objective function. To incorporate both the structure and textual of P, i.e., the concatenation
∗(i,j,k)
of {P t • We design a new objective that preserves high-order proximity, by
is the tensor version V = f (W (j,k)
P ∗(i,j,n) (n,k)
of the text embedding representation, after the(2)
· X ) • Cora is a citation network that consists of 2277 machine learning papers in 7
information of the network, we adopt two types of embeddings v si and t
n=1 classes and 5214 edges indicating the citation relationship between papers. including a diffusion map in the conditional probability.
v ti for each vi vertex. In this work, v i is learned by an unsupervised diffusion convolutional operation.
160 where W ∈ R is the weight matrix and f is a non-linear differentiable function. The activations
H×d • Zhihu is a Q&A based community social network in China. In our experiments,
approach, and it can be used directly as a feature vector of vertex vi for 161 can be expressed equavalently using tensor
∗ notations. ∗ 10000 active users are collected as vertices and 43894 edges. The description of
• Experimental results on the vertex-classification and link-prediction
Vt = f (W P X), (4) tasks show the superiority of the proposed approach.
various tasks. Vt∗ = f (W P∗ X) (3) their interested topics are used as text information.
162 where represents element-wise multiplication. This tensor representation considers all paths
163 between two texts in the network and thus includes long-distance semantic relationship. With longer

Diffusion Maps For Textual Network Embedding

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Diffusion Maps For Textual Network Embedding

Transféré par

Droits d'auteur :

Formats disponibles

Diffusion Maps for Textual Network Embedding

Xinyuan Zhang, Yitong Li, Dinghan Shen, Lawrence Carin

Definition 3. A network embedding aims to learn a low-dimensional

Note that p(·|usj) computes the probability conditioned on the diffusion

Vous aimerez peut-être aussi