Académique Documents
Professionnel Documents
Culture Documents
Spécialité de doctorat :
Discipline : Informatique
SEBA Hamida, Maitre de conférences, HDR, Université Claude Bernard Lyon1 Directrice de
thèse
81,9(56,7(&/$8'(%(51$5'/<21
3UpVLGHQWGHO¶8QLYHUVLWp 0OH3URIHVVHXU)UpGpULF)/(85<
3UpVLGHQWGX&RQVHLO$FDGpPLTXH 0OH3URIHVVHXU+DPGD%(1+$','
9LFHSUpVLGHQWGX&RQVHLOG¶$GPLQLVWUDWLRQ 0OH3URIHVVHXU'LGLHU5(9(/
9LFHSUpVLGHQWGX&RQVHLO)RUPDWLRQHW9LH8QLYHUVLWDLUH 0OH3URIHVVHXU3KLOLSSH&+(9$/,(5
9LFHSUpVLGHQWGHOD&RPPLVVLRQ5HFKHUFKH 0)DEULFH9$//e(
'LUHFWULFH*pQpUDOHGHV6HUYLFHV 0PH'RPLQLTXH0$5&+$1'
&20326$17(66$17(
)DFXOWpGH0pGHFLQH/\RQ(VW±&ODXGH%HUQDUG 'LUHFWHXU0OH3URIHVVHXU*52'(
)DFXOWpGH0pGHFLQHHWGH0DwHXWLTXH/\RQ6XG±&KDUOHV 'LUHFWHXU0PHOD3URIHVVHXUH&%85,//21
0pULHX[
'LUHFWHXU0OH3URIHVVHXU'%285*(2,6
)DFXOWpG¶2GRQWRORJLH
'LUHFWHXU0PHOD3URIHVVHXUH&9,1&,*8(55$
,QVWLWXWGHV6FLHQFHV3KDUPDFHXWLTXHVHW%LRORJLTXHV
'LUHFWHXU0;3(5527
,QVWLWXWGHV6FLHQFHVHW7HFKQLTXHVGHOD5pDGDSWDWLRQ
'LUHFWHXU0PHOD3URIHVVHXUH$06&+277
'pSDUWHPHQWGHIRUPDWLRQHW&HQWUHGH5HFKHUFKHHQ%LRORJLH
+XPDLQH
&20326$17(6(7'(3$57(0(176'(6&,(1&(6(77(&+12/2*,(
)DFXOWpGHV6FLHQFHVHW7HFKQRORJLHV 'LUHFWHXU0)'(0$5&+,
'pSDUWHPHQW%LRORJLH 'LUHFWHXU0OH3URIHVVHXU)7+(9(1$5'
'pSDUWHPHQW&KLPLH%LRFKLPLH 'LUHFWHXU0PH&)(/,;
'pSDUWHPHQW*(3 'LUHFWHXU0+DVVDQ+$00285,
'pSDUWHPHQW,QIRUPDWLTXH 'LUHFWHXU0OH3URIHVVHXU6$..28&+(
'pSDUWHPHQW0DWKpPDWLTXHV 'LUHFWHXU0OH3URIHVVHXU*720$129
'pSDUWHPHQW0pFDQLTXH 'LUHFWHXU0OH3URIHVVHXU+%(1+$','
'pSDUWHPHQW3K\VLTXH 'LUHFWHXU0OH3URIHVVHXU-&3/(1(7
8)56FLHQFHVHW7HFKQLTXHVGHV$FWLYLWpV3K\VLTXHVHW6SRUWLYHV 'LUHFWHXU0<9$1328//(
2EVHUYDWRLUHGHV6FLHQFHVGHO¶8QLYHUVGH/\RQ 'LUHFWHXU0%*8,'(5'21,
3RO\WHFK/\RQ 'LUHFWHXU0OH3URIHVVHXU(3(55,1
(FROH6XSpULHXUHGH&KLPLH3K\VLTXH(OHFWURQLTXH 'LUHFWHXU0*3,*1$8/7
,QVWLWXW8QLYHUVLWDLUHGH7HFKQRORJLHGH/\RQ 'LUHFWHXU0OH3URIHVVHXU&9,721
(FROH6XSpULHXUHGX3URIHVVRUDWHWGHO¶(GXFDWLRQ 'LUHFWHXU0OH3URIHVVHXU$028*1,277(
,QVWLWXWGH6FLHQFH)LQDQFLqUHHWG $VVXUDQFHV 'LUHFWHXU01/(%2,61(
Abstract
4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Graph Dataset Characteristics. . . . . . . . . . . . . . . . . . . . 101
Abstract iii
Résumé v
List of Tables ix
1 Introduction 1
1.1 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References 119
Contents
1.1 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . 7
Graphs are data structures composed of a set of vertices and a set of edges
where an edge connects two vertices. A graph is an effective way of formalizing
problems and representing objects. They are used to represent complex and
heterogeneous linked data in various domains. With graphs, vertices represent
objects and edges represent relations between these objects.
Graphs are very flexible, they allow adding new kinds of relationships, new
vertices to an existing structure without disturbing existing queries and applic-
ation functionalities. This is why graphs are widely used in data modeling,
especially for massive data. However this is not new, the first data modeling tool
is graph-based. The hierarchical data model was the first data modeling tool to
be created. First appearing in 1966 as an improvement of general file-processing
systems. The main improvement was the possibility to create relationships
between information in a data model, which is insured by graphs. The main
characteristic of a hierarchical data model is the treelike structure. It consists
of a collection of records that are connected to each other through links with a
parent-child relationship. Figure 1.1 shows an example of a hierarchical data
model.
incrementally matching all query vertices to candidate data vertices. The most
known algorithms to subgraph isomorphism search are: Ulmann’s algorithm
[49], VF2 [41], GADDI [55], Spath [57], QuickSI [45], and GraphQL [25].
These algorithms use different join orders, pruning rules, and auxiliary informa-
tion to prune-out falsepositive candidates as early as possible. But none of these
algorithms is designed to handle all types of graphs with all sizes. For example,
QuickSI [45] is designed for handling small graphs, GraphQL [25] and SPath
[57] are designed for handling large graphs. Some of these proposed methods,
mainly because of the complexity of the subgraph isomorphism problem, show
exponential time behavior.
The porposed algorithms are mainly built arround two basic tasks: filter and
search (or filtering and verification). The more powerfull the filtering is, the more
powerfull is the algorithm that searches for subgraph isomorphism. Filtering
is a manner to reduce the search space by eleminating non relevant vertices.
However, this filtering can lead to high cost, so it must be efficient without being
time consuming. Since reducing the search space is very important, a simplified
representation of graphs will be very usefull. This is the main focus of our work.
• A compression of the whole graph: in this case our data graph is sum-
marized and we achieve subgraph isomorphism search on the compressed
version of the graph. The main idea in graph summarizing approaches it to
find a short representation of the input graph, in the form of a compressed
graph. Summarizing a graph can be very usefull:
In the two cases, the aim is to be able to deal with massive data.
Contents
2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 VF2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.4 GADDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.5 QuickSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.6 Turbo-iso . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.7 CFL-match . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
– V (G1 ) ⊆ V (G2 )
Definition 4. Partial Subgraph. For two graphs G1 and G2 , where G1 = (V (G1 ), E(G1 ), 1 , Σ),
G2 = (V (G2 ), E(G2 ), 2 , Σ). G1 is a partial subgraph of G2 , denoted as G1 ⊆p G2 , if
G1 is a graph that does not contain all the edges of G1 having their ends in V (G1 ).
As a data structure, graphs are increasingly used to model data and complex
objects. They allow to convey as much information as possible, to ensure an effi-
cient representation of complex objects and also a relevant comparison between
two objects. Thus various real applications such as social networks and protein
interactions use graphs as a model of representation. Graphs can also represent
complex relationships such as the organization of entities in images which can
be used to identify objects and scenes.
In many cases, the success of an application based on a graph representation
of data is directly dependent on the efficiency of an underlying graph query
processing. Talking about graph query processing lead directly to one of the
most popular problem in graph theory, which is graph and subgraph matching.
Graph matching consists to find the correspondence between the vertices of two
graphs which provides the best alignment of their structures. Generally, graph
matching methods can be divided into two broad categories: Exact and inexact
matching according to their results. In other words, exact graph matching
returns graphs or subgraphs that match exactly a given graph, however, inexact
matching returns a ranked list of the most similar matches.
Exact graph matching approaches aim to find out if an exact mapping
between the vertices, and the edges of the compared graphs is possible. This
requires a strict correspondence between the two objects being matched, or at
least between subparts of them.
Graph Isomorphism is a variant of exact graph matching defined as follows:
2. ∀(u, v) ∈ E(G1 ) : (h(u), h(v)) ∈ E(G2 ) and 1 ((u, v)) = 2 ((h(u), h(v)))
3. ∀(h(u), h(v)) ∈ E(G2 ) : (u, v) ∈ E(G1 ) and 2 ((h(u), h(v))) = 1 ((u, v))
In exact matching we can find also other forms like maximum common
subgraph, monomorphism, and homomorphism.
• The maximum common subgraph is the problem of finding the largest part
of two graphs that is identical in terms of structure, which is refered to the
maximum common subgraph.
Definition 6. Given two graphs G = (V (G), E(G), , Σ) and Q = (V (Q), E(Q), , Σ),
Q is subgraph isomorphic to G if there is an injective function f : Q → G such that:
2. ∀(u, v) ∈ E(Q), (f (u), f (v)) ∈ E(G) and (u, v) = (f (u), f (v)).
Figure 2.3 depicts an example where the data graph contains two occurrences
of the query graph.
The basic solution to enumerating the occurrences of Q into G is to directly
compare the vertices of the query with the vertices of the data graph. This
comparison constructs a search tree. In this search tree, each internal vertex
maps a vertex of the query to a vertex of the data graph. Each path from the root
to a leaf in the search tree represents either a unsuccessful mapping, between the
query and a subgraph, that have been dropped by the algorithm, or a successful
one that corresponds to a subgraph that is isomorphic to the query. Figure 2.4
presents a part of the obtained recursion tree for the query graph and the data
graph of Figure 2.3.
Exploring this recursing tree is the main task of subgraph isomorphism search al-
gorithms. Several existing algorithms use backtracking to explore the search tree,
but they do not explore the whole tree. To avoid exploring the whole tree, they
use filtering methods that prune unpromising branches of the tree. Nevertheless,
even with pruning functions, this method rises two main challenges:
• It is memory consuming: besides storing the data graph which can have a
significant size, exploring the search tree has a high memory consumption
and involves complex data structures to support backtracking.
algorithms that we found in the literature are Turboiso [23] and CFL-match
[6]. In our contributions, we compare with these two algorithms to show the
performance of our methods.
There are also several works that surveys existing algorithms. For instance,
in [33], authors implements five algorithms VF2 [41], QuickSI[45], GraphQL
[25], GADDI [55], and SPath [57] in a common framework. They use a generic
subgraph isomorphism algorithm as a backtracking algorithm which finds solu-
tions by incrementing partial solutions, or abandoning them. In this generic
algorithm, authors selects first a group of candidate vertices. After that, the
algorithm performs a recursive subroutine to find mapping pairs of vertices.
vertices, v ∈ C(u), that have smaller degree than the query vertex through all
adjacent query vertices of u. If the adjacent vertex u is already in the list of
embeddings Em(G), which means that: (u , v ) ∈ Em(G), then it checks if there is
a corresponding edge (v, v ) in the data graph G.
The complexity of Ullmann’s algorithm depends on the size of the graph.
Supppose that the size of the data graph G is n, and that of the query graph is
m. The complexity of the algorithm in the best case will be: O(nm). But it can
go up to O(mn n2 ) in the most case. As a result, the processing time explodes
exponentially . So it is very expensive.
2.4.2 VF2
• Prune out any non connected vertex v ∈ C(u) from already matched data
vertices.
• Prune out any vertex v ∈ C(u) such that, |Cq ∩adj(u)| > |Cg ∩adj(v)|. Where,
Cq is a set of adjacent and not-yet-matched query vertices connected from
the set of matched query vertices, Mq . Cg is a set of adjacent and not-yet-
matched data vertices from the set of matched data vertices, Mg .
• Prune out any vertex v in C(u) such that, |adj(u) Cq Mq | > |adj(v) Cg Mg |.
In Spath [57], paths are used as patterns of comparison. Basically, paths are
used in the matching phase instead of matching each single vertex. SPath
uses neighborhood signatures to minimize candidate sets. In the data graph
processing, the algorithm computes a neighborhood signature for each vertex. A
neighborhood signature (N S(u)) of a vertex u is computed as follows: N S(u) =
{Sk (u)|k ≤ k0 } where, Sk (u) is the k-distance set of u. Each element in Sk (u) is
relative to a label l denoted as Skl (u). An element contains a set of vertices vi
where vi satisfies: d(u, vi ), and l(v) = l(u).
In the other hand, neighborhood signatures are also computed for each
vertex v in the query graph. After that, a filtering mechanism is used in order
to minimize the matching candidate set, C(v). To test if a given vertex u in
C(v) must be pruned or not, authors compares N S(v) and N S(u) to see if there
2.4.4 GADDI
of two vertices v1 and v2 in the data graph. After finding a DS(G) for a given
substructure P. A new NDS (Neighboring Discriminating Substructure) distance
denoted as dN DS (G, v1 , v2 , P), is calculated for each pair of neighboring vertices
as an index structure.
This distance is the number of matches of P in the intersecting subgraph
Int(G, v1 , v2 ). In the example of Figure 2.6, Length =4 (Length is the upper
bound of the shortest distance between a pair of vertices to be indexed), k=3 (the
k-neighboring), P is a substructure. The distance between the two filled vertices
is three because there are three matches in their intersecting subgraphs. Matches
of the discriminative substructure in the interscting subgraphs are marked by
dashed lines.
The matching phase is processed as follows :
2.4.5 QuickSI
replace it with another one after using it against all features with the shared
prefix.
To control the index size, only frequent and discriminative trees are chosen.
|{g|f ⊂g∧g∈D}}|
The frequency of a feature f , is computed by f rq(f ) = |D|
, f is frequent
iff f rq(f ) ≥ δ, where δ ∈ [0, 1]. Also a discriminative measure dis(f ) is defined as:
|f .list|
dis(f ) = |
{f .list|f ⊂f ∧f ∈I }|
. f is discriminative iff dis(f ) < 1 − γ, where γ ∈ [0, 1].
2.4.6 Turbo-iso
2.4.7 CFL-match
CFL-match [6] is the most recent method to search for subgraph matching.
First the query graph is decomposed into three substructures. Then subgraph
matching is performed on each of these substructures. This algorithm is based
on a core-forest-leaf decomposition of the query graph. It generates a matching
order that conducts non-tree edges checkings at earlier levels. It aims to postpone
cartesian products. CFL-match [6] uses first a costly filtering method. It uses the
neighborhood label frequency filter to ensure that a data vertex is a candidate.
CFL-match proposes after that another way to filter in order to reduce the time
processing of the first filter. The new filtering mechanism is the maximum
Neighbor-Degree filter. It can be verified in constant time for each candidate
data vertex. In the Core-Forest Decomposition, authors use a spanning tree QT
of the query Q in which, the group of edges of Q that are not in the set of edges
of QT are called non-tree edges. The rest of edges are called tree edges. For
each set of non-tree edges of a spanning tree in Q, the core-forest decomposition
computes a small dense subgraph that contains the set of non-tree edges. This
subgraph is the minimal connected subgraph. The subgraph composed of the
non-tree edges of Q is called the core-structure of Q. For the rest of edges (tree
edges), the subgraph is called forest-structure of Q and denoted T . After the
core-forest decomposition, there is the forest-leaf decomposition. Here, the
Other methods that are more general or that use specific computing plateforms
exist for subgraph isomorphism search. We survey, in the following, some
of them. The Nauty algorithm, of Brendan McKay [38], detects isomorphism
between untyped graphs that may be directed or undirected. Nauty uses trans-
formations to reduce graphs to a canonical form that may be checked relatively
quickly for isomorphism. Specifically, the algorithm computes invariants for
each vertex in a graph (e.g., degree and counts of adjacent vertices) that are
used for candidate selection. Nauty partitions a graph into non-overlapping
subsets such that the vertices in a particular subset share identical invariant
values. Subsets having the same invariant values can then be compared across
graphs. If all subsets are isomorphic between two graphs, then the two graphs
must be isomorphic. Alternatively, if two graphs contain subsets with differing
invariants, there is no need to test isomorphism between the sets directly.
Authors in [47] propose a subgraph matching algorithm for very large graphs
deployed on a distributed memory store. They use the Trinity memory cloud,
which provides a unified adress space for a set of machines, as if a large graph is
stored in one machine.
The subgraph matching of this method is showed on Figure 2.10. It needs three
steps:
used, this function returns a set of STwigs that match the query.
TMODS searches for patterns from bottom to up, finding sub-patterns first and
then composing them into more complex higher-level patterns.
2.5 Analysis
We studied the above algorithms specially according to their pruning mech-
anisms. We distinguish three main prunning mechanisms that can be used
separately or combined for more efficiency:
the data vertices that are not compatible with the query vertex using the
k−neighborhood around u. VF2 [41] looks to 2-hops neighborhood. SPath
Spath [57] uses the k-neighborhood by maintaining for each vertex u a
structure that contains the labels of all vertices that are at a distance less
or equal than k from u. SPath uses its encoding of the k-neighborhood to
remove the data vertices that have a k-neighborhood that does not englobe
any k-neighborhood of query vertices. By rewriting, the query within a
tree, QuickSI [45] and TurboI SO [23] use also the k−neighborhood with
the particularity that the neighborhood is rooted at a more pruning vertex.
The tree representation of TurboI SO is also more compact as it aggregates
similar vertices.
f req(G, (u))
rank(u) = (2.1)
deg(u)
However, when the labels and the degrees are not discriminative, this
ranking becomes obsolete leading to visiting the whole search space.
For the query graph, the matching order means that the query vertices are
handled in an order that simplifies their matching. For this, the most used
order is equivalent to tree traversal of the query vertices. A spanning tree
of the query is constructed according to a ranking equivalent to the one
given by Equation (1). In [26, 46, 56, 58] the root of the tree is the less
popular vertex. This solution is also adopted by TurboI SO [23]. However,
TurboI SO goes further by grouping the vertices that have the same labels
and the same neighborhood. The obtained smaller tree is called a NEC tree.
TurboI SO constructs candidate regions for the query Q in the data graph G
by constructing for each region a BFS search tree TG from the root vertex us
of the NEC tree Q so that each leaf is on the shortest path from us . Then,
for the start vertex vs of each target candidate region, identify candidate
data vertices for each query vertex by performing depth-first search using
TG and starting from vs . TurboI SO reduces the number of regions using the
ranking function given by Equation 2.1. When exploring candidate regions,
TurboI SO also minimizes the number of enumerated partial solutions by
ordering the NEC tree vertices by increasing sizes. Thus, paths involving
fewer vertices are explored first, the space is pruned if no isomorphism is
possible. In [23], TurboI SO is compared to the other approaches and its
superiority in processing queries is attested via extensive experimentations.
2.6 Conclusion
In this chapter, we presented and discussed the state of the art related to sub-
graph isomorphism search algorithms. We focused on the more recent and
interesting methods but we also reviewed general approaches as well as those
dedicated to specific plateforms. We then analysed these algorithms according
on how they filter the search space.
In the next chapter, we introduce our first contribution that solves subgraph
isomoprhism search problem on compressed graphs in order to deal with large
graphs on a simple commodity machine.
Contents
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1 Introduction
4. Noise elimination: real graph data interfere with many hidden, or erro-
neous links and labels. Summarization is used to filter out noise and reveal
patterns that exist in the data.
Definition 7. [12] Given a labeled graph G with V (G) partitioned into groups, i.e.,
V (G) = V1 (G), V2 (G), · · · , Vk (G) such that:
1. Vi (G) ∩ Vj (G) = φ, 1 ≤ i j ≤ k
• an edge (vi , vj ) with label l exists in comp(G) if and only if there is an edge (u, u )
with label l between some vertex u ∈ Vi (G) and some other vertex u ∈ Vj (G).
Figure 3.1 illustrates the compression (graph (b)) of the graph given in (a).
Each vertex of the compressed graph is a group of vertices having the same
label. However, this compression does not retain all the structural information
available in the original graph. For example, the edge between the vertex labeled
b and the vertex labeled d in Figure 3.1(b) cannot inform us if this edge links
d
u3
a b d u3
u2
u1
u5 a b
u4 a u1 u2
u6 c u9 u5
u11
u8 u12
u10 c u7
c
c
u11 b c u9
a u4
a u7 u8 c
u6
u12 u10
(a) (b)
Figure 3.1: Graph compression with [12].
u3 to u11 or u3 to u2 . This means that the algorithms that use the compressed
graphs do not aim to have exact solutions but approximative ones.
In [12], authors propose an algorithm that finds all frequent subgraphs
in a database of large graphs where the graphs are compressed according to
Definition 7.
We also have query-oriented compression which is a compression that pre-
serves a kind or a class of queries [3, 27, 10, 37, 50, 16]. Given a graph G and a
kind of queries Q, i.e., path queries, neighborhood queries, reachability queries,
etc., the compression constructs a smaller graph G such that the results on G for
all queries in the class Q are equivalent to their results on G . The compression
function depends of the kind of the queries. For example, the compression
function is equivalent to the one given by Definition 7 for pattern queries and it
groups the vertices that have the same neighbors for reachability queries.
Grouping the vertices that have the same neighbors in a graph is a well known
concept in graph theory called modular decomposition. Modular decomposition
has been introduced by Gallai [17] to solve optimization problems. It is used to
generate a tree representation of a graph that highlights groups of vertices that
have the same neighbors outside the group. These subsets of vertices are called
modules.
Modules are classified into three categories according to how the vertices are
connected inside the module:
Figure 3.2(a) presents a graph and its modules. This example will be used as a
running example throughout the paper. It is borrowed from [31] with a slight
modification.
In [31], authors use modular decomposition to compress large graphs. They
define a similarity measure between graphs using the obtained compressed
graphs. They compress graphs by recursively compacting modules as illustrated
in Figure 3.2. To obtain a unique representation of the graph only the modules
that do not overlap with other modules are compacted.
To retain all the properties of the original graph with the obtained compact
representation of the graph, adjacency information for neighborhood modules
a
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
000000000
111111111 000000000000
111111111111
000000000
111111111
l m
000000000
111111111
000000000
111111111
000000000
111111111
d
000000000000
111111111111
000000000000
111111111111
b 111111111111
000000000000
000000000
111111111
000000000
111111111 11111111
11111111
000000000000
00000000111111111111
000000000000
00000000111111111111
000000000
111111111 11111111
00000000000000000000
111111111111
000000000
111111111
000000000
111111111 11111111
11111111
00000000
000000000000
00000000111111111111
000000000000
111111111111
11111111
00000000000000000000
111111111111
00000000000000000000
11111111111111111111
11111111
00000000
11111111
000000000000
111111111111
000000000000
00000000111111111111
000000000000
111111111111
e
11111111
00000000
c
11111111000000000000
00000000111111111111
111111111111111
000000000000000 00000000000000000000
11111111
11111111
00000000
111111111111
000000000000
111111111111
111111111111111
000000000000000 00000000111111111111
11111111000000000000
111111111111111
000000000000000 000000000000
111111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000 000000000000
111111111111
00000000000
11111111111
111111111111111
000000000000000 000000000000
111111111111
000000000000
111111111111
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000
k h
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
f
00000000000
11111111111
111111111111111
000000000000000 000000000000
111111111111
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111 000000000000
111111111111
111111111111111
000000000000000
00000000000
11111111111 000000000000
111111111111
i
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
00000000000
11111111111
111111111111111
000000000000000 000000000000
111111111111
j
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000 000000000000
111111111111
000000000000
111111111111
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000
00000000000
11111111111
111111111111111
000000000000000
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
g
00000000000
11111111111
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
a
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
(a)
a
l m d
S(b,c)
e
k h f
i
j g
a
(b)
a a
l m P(l, m)
S(b,c) S(b,c)
k h N(d, e, f, g) k N(d, e, f, g)
h
i i
j j
a a
(c) (d)
a a
P(l, m) P(l, m)
S(b,c) S(b,c)
k N(d, e, f, g) h N(d, e, f, g)
h P(S(i,j), k)
S(i, j)
a
a
(e) (f)
must be stored. Series and parallel modules need no information about adjacency.
For example, The obtained compressed graph illustrated in Figure 3.2(f) is a
neighborhood module that can be denoted :
N (a, S(b, c), N (d, e, f , g), h, a, P(S(i, j), k), P(l, m)).
For this module, we retain the edges between the supervertices to keep adjacency
information. This gives the final compressed graph. We also retain the edges
that bind the vertices of the neighborhood module N (d, e, f , g).
This compression method can allow high compression rates as illustrated in
Figure 3.3 that presents a protein interaction graph and its compression obtained
by modular decomposition.
A triangle listing algorithm is also proposed on graphs compressed by mod-
ular decomposition in [30]. In [43], the authors use a compression unifying
Definition 7 and the concept of modules. They compact the vertices that have
the same labels by distinguishing between two kinds of groups of vertices: those
that are completely connected and form a clique, i.e., a series module, and those
that are not connected at all,i.e., a parallel module.
In our framework, we also rely on modular decomposition to compress
graphs mainly because it is a more general compression that encompasses the
compressions used till now.
compression.
To compress graphs, we use modular decomposition which is a well studied
concept in graph theory [21]. To our knowledge, we are the first to use modular
decomposition to reduce the search space in subgraph isomorphism search
algorithms. As we will demonstrate, the benefits are threefold:
Query processing takes in entry the compressed versions C(Q) and C(G) of Q
and G respectively and reports all the embeddings of Q in G. To avoid ambiguity,
we will use the terms supervetex or module to denote a node in the compressed
graph and we will denote it by m. The term vertex and leaf will denote a node
from the original graph. The Algorithm operates in two phases: a candidate
supervertex selection phase and a subgraph search phase. During the first phase,
the compressed data graph is parsed to retain only regions of the graph that are
likely to contain the query. This selection uses only the labels of the modules.
During the second phase, a backtracking-like algorithm is used in each region to
verify the embedding. In the following we detail both phases and show how we
can find all the embedding by parsing the compressed data graph.
The aim of this step is to determine the modules (supervertices) that are likely
to match the query. With this step, we minimize the number of vertices of the
data graph to be processed. For this, we explore the modules of C(G) to get all
those that contain at least one of the labels of the query. Let Cand denotes the
obtained result with:
After that the set of candidate modules is partitioned in several subsets where
each of them is candidate for a single embedding. Each subset contains the
minimum number of modules that satisfy all the labels of the query. Subgraph
search is then invoked on each of these subsets. This step is illustrated in Figure
3.6 and its detailed actions are given by Algorithm 1.
Note that at this step, we have a set of candidates with no order. These
candidates are selected solely on labels. No structural verification are done with
the query. So, at the end of this step, we do not know if there is a subgraph in G
that matches the query. The aim of the next step is to aggregate the candidate
supervertices in order to verify if the structure of the query is preserved within
them.
The subgraph search phase is illustrated in Figure 3.7. Its detailed actions are
given by Algorithm 2.
This step takes as inputs a query C(Q) and a set s = {m1 , m2 , · · · , mj } of mod-
ules that are likely to contain an embedding of the query. It returns all the
embeddings of the query in these modules. An embedding is represented by
a set P of pairs (u, v), where u is a query vertex and v is the data vertex that
matches u. For each vertex u in C(Q), SubgraphSearch first finds the set of
candidate vertices Cu from the vertices of the modules of the set s. A vertex v of
the data graph matches u if it has the same label as u and all the neighbors of u
are matched to neighbors of v. This is verified by a call to function I sJoinable
(detailed in Algorithm 3). Given two vertices u ( from the query) and v (from the
data graph) to be matched, function I sJoinable returns TRUE if the neighbors of
vertex u are matched to neighbors of vertex v in the match P. To have the list
P
S
l m
P b c Pr
S
h g
k d e f
i j
a
Figure 3.9: Tree representation of Modules [31].
Algorithm 3: Verify that two vertices to be matched have the same adja-
cency (I sJoinable).
Data: Two vertices u and v to be matched.
Result: True is the the vertices have the same adjacency.
begin
return (∀u ∈ N eighbors(u), if u is matched to v then v ∈ N eignbors(v));
end
of queries. We also compared it with the most efficient state of the art algorithm,
called TurboI SO and presented in [23]. We recall that TurboI SO is itself compared
to the other existing solutions in [23] and showed to be superior to them.
We first describe the datasets used in the experiments, then we present our
results.
3.4.1 Datasets
• NASA database: This dataset contains 36, 790 trees with an average size of
32, and a number of unique labels of 117, 302.
interaction network. This graph has 4, 675 vertices and 86, 282 edges. The
number of unique labels in the dataset is 90.
• Orkut: is a free on-line social network with more than 3 million members
and more than 117 friendship connections. This network is provided by
The Online Social Networks Research Project [1].
• WebGoogle: this is Google web graph. Vertices represent web pages and
edges represent hyperlinks between them. It was released in 2002 by
Google as a part of Google Programming Contest.
Table 4.2 summarises the characteristics of the nine datasets. Besides the
average number of vertices and edges of the graphs in the dataset, we also give the
average compression rate of each dataset. Given a graph G and its compressed
|E(C(G)|)|
graph C(G), the compression rate of G is given by: CR(G) = |E(G)|
· 100%. It
compares the number of edges in C(G) in respect to G. We also provide the time
necessary to compress each dataset.
• AIDS and NASA query sets: For each of these datasets, the authors of [23]
constructed 6 query sets (Q4, Q8, Q12, Q16, Q20, Q24), each of which
contains 1,000 query graphs of the same size. Additionally, each query Qi
is contained in a query Qi+1 . Each query is a subgraph of a graph in the
dataset.
• Human Query sets: For this dataset, the authors of [23] generated three
kind of queries:
1. Subgraph queries as for the Aids and Nasa datasets. In this case, we
have 10 query sets obtained by varying the number of query sizes
from 1 to 10.
3. Path queries where the query subgraph is a path. A path query corres-
ponds to transcriptional or signaling pathways [26].
For the large datasets, i.e., WebGoogle, Wiki-Talk, Kopec, Patent Citation, Live-
Journal and Orkut, we constructed for each graph, 5 query sets (Q100, Q200,
Q300, Q400, Q500). Each set Qi contains 100 query graphs of the same size i.
The time performance reported in the results is the average time computed
over the sets of queries of the same size.
3.4.2 Results
Figure 3.11 shows the experimental results for AIDS. We can clearly see that the
time performed by TurboI SO decreases when the query size increases. This is
explained in [23] by the containment relationship among the query sets in AIDS.
We can also observe the same behavior with SumI SO which achieves better than
TurboI SO . In our case, this can be explained by important compression rate of
AIDS that yields a small number of candidates to be considered.
Figure 3.12 shows the experimental results for NASA. For this dataset,
SumI SO achieves significantly better than TurboI SO for all the queries.
Figure 3.13 shows the results of subgraph queries over the human dataset.
The superiority of SumI SO over TurboI SO is clearly observable as soon as the
query size is greater than 8.
Figure 3.14 shows the results of subgraph isomorphism search for path
and clique queries over the Human dataset. For the clique queries, SumI SO
significantly outperforms TurboI SO . This is mainly due to the fact that a clique
is compressed to a single vertex in our approach. For path queries, we have also
a better results than TurboI SO even if not significantly. We explain this by the
3.4.3 Discussion
With such methods that do not enforce an exact mapping between the query
graph and the the data graph we can achieve better time performance.
3.5 Conclusion
Search phase and it may be possible to define some rules to prune the sets
of candidate supervertices selected for the matching step by relying on vertex
invariants, matching order and/or the properties of the compression. It will be
also interesting to see if it is feasible to run such an approach on a graph database
like Neo4j by designing and developing all the necessary database operations
such as create, delete and insert on the compressed dataset.
Contents
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1 Motivation
We recall that subgraph isomorphism search, also known as exact Subgraph
matching or Subgraph queries, is the problem of enumerating all the occurrences
of a query graph within a larger graph called the data graph. Figure 4.1 shows an
example of a query graph and a data graph. This example will be used throught
the chapter to illustrate the algorithms and concepts.
Most solutions to tackle this problem are based on exploiting a search space
in the form of a recursion tree that maps the query vertices to the data graph
vertices. However, existing algorithms never construct entirely the recursion tree
and use prunning methods to have smaller search space. Filtering is fundamental
as it reduces the search space explored by the searching task. Existing algorithms
differ by the pruning power of the filtering mechanisms they implement but
also by when these filters take place with respect to searching. Our analysis of
these two points of difference, highlighted four weaknesses in the state of the art
algorithms that we address within the proposed framework. These weaknesses
are as follows:
vertex u if mndG (v) < mndQ (u). As MND is not as powerful as NLF, the idea is to
apply it before applying NLF as detailed in Algorithm 5 (see lines 2-3). However,
MND is not always effective as we can see in the example depicted in Figure 4.2
where only 3 vertices are pruned with the MND filter and consequently NLF
must be applied for each of the remaining vertices.
It is also worth noting that for some neighborhood configurations filtering
is useless and only the searching step is decisive. Let consider the query and
data graphs depicted in Figure 4.3 where all the vertices have the same label
and the same degree and let consider that k = 1000. Clearly, in this case, the
1000 comparisons required by NLF for each query vertex and each data vertex
are needless. This doesn’t mean that filtering is not necessary but that its cost
must be reduced. Interestingly, using a less costly filtering with Ullmann’s native
subgraph searching subroutine outperforms the state of the art algorithms as
showed by our experiments.
Weakness 2: Global filtering Vs local filtering. Depending on its scope,
filtering can be characterised as global or local. Local filtering designates the
Figure 4.2: MND Filter on on the Running Example (pruned of the vertices that
do not match query labels).
filtering methods that reduce the number of data vertices candidates for a given
query vertex, i.e., reduce the size of C(ui ), i = 1, |VQ |, where C(ui ) is the set of
vertices of the data graph that are candidates for the query vertex ui . Global
filtering designates the filtering methods that can be applied on the entire search
space, obtained by joining the above sets, i.e., C(u1 ) × C(u2 ) × · · · × C(u|VQ | ). Our
study of existing algorithms shows that local pruning is predominant. Some
mechanisms allow global pruning but they require extra passes of the data graph
to be effective. The matching order is such a mechanism. However, it is a very
difficult problem to choose a robust matching order mainly because the number
of all possible matching orders is exponential in the number of vertices. So, it
is expensive to enumerate all of them. For example, TuorboI so relies on vertex
ordering for pruning. However, to compute this order, it needs to compute for
each query vertex a selectivity criteria based on the frequency of its label in the
data graph.
To deal with this problem, we introduce the Iterative Local Global Filtering
mechanism (ILGF), a simple way to achieve global punning relying on local
pruning filters.
Weakness 3: Late filtering. Our analysis of how filtering and searching are
undertaken with respect to each other in the state of the art algorithms revealed
that most algorithms apply their filtering mechanisms during subgraph search.
In fact, little filtering, reduced mainly to label or degree filtering, is undertaken
prior to subgraph search. This means that, the first cartesian products involved
by subgraph search are costly. To tackle this, CFL-match [6] applies the MND-
NLF filter prior to subgraph search. However, as we can see in Figure 4.4, the
amount of achieved pruning depends on the order within which vertices are
parsed. In our example, if v2 is processed before v16 the amount of pruning is
less than the one obtained with the reverse order. To get caught up, existing
solutions rely on additional mechanisms and data structures during subgraph
search such as NEC tree in TurboIso [23] and CPI in CFL-mach [6] that both
use path-based ordering during subgraph search. However, the underlying
data structures are time and space exponential [6]. To avoid constructing and
maintaining such data structures, we propose to achieve filtering solely prior to
subgraph search. Our experiments show that this approach is as efficient as the
state of the art algorithms.
Weakness 4: lack of scalability. This drawback results directly from the three
above weaknesses. In fact, the lack of global filtering and the necessity to keep
the data graph into memory for several passes make these backtracking-based
solutions not suitable for graphs that do not fit into main memory. We aim to
achieve a single parse of the data graph and reduce as early as possible the search
space.
So, our contributions are:
Figure 4.4: NLF filtering with two different vertex parsing orders
• Our encoding mechanism has the advantage to adapt to all graph access
models: main memory, external memory and streams. By performing one
sequential pass of the disk file (or the stream of edges) of the input graph.
This avoids expensive random disk accesses if the graph does not fit into
main memory.
Symbol Description
G = (V , E, , Σ) undirected vertex and edge labeled graph
is a labeling function
Σ is the set of labels
V (G) vertex set of the graph G
E(G) edge set of the graph G
deg(v) degree of vertex v in G
degS (v) number of neighbors of v
that have a label in S
G[X] the subgraph of G induced by the set
of vertices X
L(Q) the set of unique labels in the query Q
cni(v) compact neighborhood index of v
In our method, the high-level idea is to put into a simple integer the neigh-
borhood information that characterise a vertex. Matching two vertices is then
a simple comparison between integers. Given a vertex u, the compact neigh-
borhood index of u, denoted cni(u), distills the whole structure that surrounds
the vertex into a single integer. It is the result of a bijective function that is
applied on the vertex’s neighborhood information. This function ensures that
two given vertices u and v will never have the same compact neighborhood index
if they have the same number of neighbors and the same label unless they are
isomorphic at one-hop. Let x1 , x2 , x3 , · · · , xk be the list of u’s neighbors’ labels.
The compact neighborhood index of u in the graph G is given by:
cni(u) = (1, x1 ) + (2, x1 + x2 ) + · · · + (k, x1 + x2 + x3 + · · · + xk ). So, cni(u) =
k q+p−1 (q+p−1)!
j=1 (j, x1 + ... + xj ) where (q, p) = q = q!(p−1)!
4.2.2.
gk (x1 , x2 , x3 , · · · , xk ) = j=1 (j, x1 + ... + xj )
k
and
q+p−1 (q + p − 1)!
(q, p) = =
q q!(p − 1)!
To use this bijection on vertices’labels, we need to assign a unique integer to
each vertex label. This assignment can be simply achieved by numbering labels
parting from 1 or by using an associative array to store the query labels. We
use ord((u)) to retrieve the integer associated to the label of vertex u. ord((u))
will return 0 if vertex u has a label that does not belong to L(Q). This will
systematically prune the neighbors that do not verify the label filter and avoid
to consider them in the computation of the CNI of a vertex. Figure 4.5 illustrates
the CNIs for our pruning example. In the computation of cni(v) and degL(Q) (v),
we do not consider the neighbors of the data vertex v that have not a label in
L(Q). These vertices are illustrated in the figure with dotted lines. For example,
degL(Q) (v1 3) = 1 because v17 has a label that does not belong to the query.
For filtering, We rely on three filters: the label filter, the degree filter and
the CNI filter. The label and degree filters are the basis of all pruning methods.
The CNI filter is based on the above bijection. So, we verify candidates for query
vertices by the lemmas below.
Lemma 1 (Label filter). Given a query Q and a data graph G, a data vertex v ∈ V (G)
is not a candidate of u ∈ V (Q) if (v) (u).
Lemma 2 (Degree filter). Given a query Q and a data graph G, a data vertex
v ∈ V (G) is not a candidate of u ∈ V (Q) if degL(Q) (v) < degL(Q) (u).
Lemma 3 (CNI filter). Given a query Q and a data graph G, a data vertex v ∈ V (G)
that verifies the label and degree filters is not a candidate of u ∈ V (Q) if cni(v) <
cni(u).
n n−1
Proof. By deduction from the property of the binomial coefficient: k = k +
n−1
k−1 (Pascal Formula)
Lemma 6. ∀k > 0, If gk (x1 , ..., xk ) = gk (x1 , ..., xk ) then x1 + ... + xk = x1 + ... + xk
(k, x1 + ... + xk ) < gk (x1 , ..., xk ) = gk (x1 , ..., xk ) < (k, x1 + ... + xk + 1
we obtain then: (k, x1 + ... + xk ) < (k, x1 + ... + xk + 1 According to Lemma
4, (k, p) is strictly increasing. So, the inequality x1 + ... + xk ≤ x1 + ... + xk holds.
Similarly, we prove the inverse inequality. This proves that x1 +...+xk = x1 +...+xk .
For k ≥ 2, we assume that gk−1 is injective and we prove that gk is also injective.
Let (x1 , ...., xk ) and (x1 , ...., xk ) such that gk (x1 , ...., xk ) = gk (x1 , ...., xk ). According to
Lemma 6, x1 + ... + xk = x1 + ... + xk . We have also by definition of gk :
⎧
⎪
⎪
⎪
⎨ gk (x1 , ..., xk ) = gk−1 (x1 , ..., xk−1 ) + (k, x1 + ... + xk )
⎪
⎪
⎪
⎩ gk (x1 , ..., xk ) = gk−1 (x1 , ..., xk−1
) + (k , x1 + ... + xk )
By subtracting side by side, we obtain gk−1 (x1 , ..., xk−1 ) = gk−1 (x1 , ..., xk−1
) which is
our induction hypothesis that gives (x1 , ..., xk−1 ) = (x1 , ..., xk−1
). This implies that
xk = xk .
Conclusion: gk is injective.
To show that gk is also surjective, we recall that (k, x1 +...+xk ) ≤ gk (x1 , ..., xk ) <
(k, x1 + ... + xk + 1). As p → (k, p) is a strictly increasing sequence, we deduce
that each n in N have an antecedent in Nk .
So, gk is a bijection from Nk to N which proves Theorem 1.
The aim of the Iterative Local Global Filtering Algorithm (ILGF) is to reduce
globally the search space using CNIs. It relies on the fact that cni(v) can be
easily updated after a local filtering giving rise to new filtering opportunities.
Algorithm 6 details this iterative filtering process. To verify the CNI filter
on a candidate data vertex, the algorithm uses the cniMatch() subroutine that
implements Lemma 3 and consequently allows to verify that a data vertex is
a candidate for a given query vertex according to the CNI filter. The ILGF
algorithm removes iteratively from G the vertices that do not match a query
vertex using the label, the degree and the CNI filters (see lines 5-7) of the
algorithm. Each time a vertex is removed by the filtering process the degree
and CNI of its neighbors are updated (lines 8-10) giving rise to new filtering
Figure 4.5: CNIs of the Query graph and the Data graph.
Vertices (and the corresponding edges) in dotted lines are not considered in the
computation of degL(Q) (u) and cni(u).
Algorithm 6: ILGF.
Data: A data graph G
Result: A filtered version of G
begin
stopFilter ← FALSE;
cpt ← |V (G)|;
repeat
foreach vertex v ∈ V (G) do
if ∀u ∈ V (Q), !cniMatch(v, u) then
remove v from V (G) and the corresponding edges from E(G);
foreach x ∈ N (v) do
update cni(x);
end
end
else
cpt;
end
end
if cpt=0 then
stopFilter ← TRUE;
end
until stopFilter;
foreach vertex u ∈ V (Q) do
C(u) ← {v ∈ V (G) such that cniMatch(v, u)};
if C(u) = ∅ then
return (∅);
end
end
M ← ∅;
SubgraphSearch(M);
end
Figure 4.6 illustrates the ILGF algorithm on our example. We can see in these
figures that using our three filters we have the following possible mappings
betwen data vertices and query vertices:
In fact, the first iteration of the ILGF algorithm, finds out that vertices v1 , v3 , v5 ,
v7 , v13 , v14 , v15 , v16 , v17 , v19 , v20 and v21 cannot be mapped to any query vertex
because:
• v1 , v13 , v15 , v16 , v19 , v20 and v21 do not pass the degree filter.
After removing these vertices and updating the degree and CNI of their neigh-
bors a new filtering iteration is triggered (see Figure 4.6 (b). This second filtering
iteration reveals that vertices v2 , v4 , v8 and v18 can also be pruned. In fact, v2
and v4 do not pass the CNI filter and v8 and v18 do not pass the degree filter.
The final filtered graph is illustrated in Figure 4.6 (b).
After filtering, the data graph contains only the vertices that are candidates
for query vertices, i.e., the vertices map at one-hop according to the CNI filter.
Subgraph search allows to verify the mapping at k-hops. Algorithm 8 imple-
ments this step. It is the depth first search subroutine of Ullmann’s algorithm. It
lists the subgraphs of the filtered data graph that are isomorphic to the query
by verifying the adjacency relationships. This step allows also to handle edge
labels by discarding those that do not match the query labels. The subroutine
neighborCheck() verifies that a mapping (v, u) is added to the current partial
embedding M only if v and u have neighbors that also map.
Algorithm 8: SubgraphSearch.
Data: a partial embedding M.
Result: All embeddings of Q in G.
begin
if |M| = |V (Q)| then
Report M;
end
Choose a non matched vertex u from V (Q);
C(u) ← { non matched v ∈ V (G) such that cniMatch(v, u));
foreach v ∈ C(u) do
if neighborCheck(u,v, M) then
M ← M ∪ {(u, v)};
SubgraphMatch(M);
Remove (u, v) from M ;
end
end
end
For large data graphs, we aim to keep in memory as few vertices and edges as the
three filters can achieve. So, filtering begins while reading the data graph. For
this, we compute vertex degrees and CNIs incrementally during graph parsing.
Only a single pass of the graph is needed. This is important if we deal with a
graph stream or a sequential read of a graph from disk, i.e., a graph that does
not fit into main memory and that is loaded part by part. We keep in memory
only the vertices (and the corresponding edges) that verify the label, degree and
CNI filters. These are the vertices and edges that will be used during subgraph
search. As we parse the data graph, the label filter is straightforward. However,
the degree and the CNI can be used when their values, computed incrementally,
are sufficient for pruning. However, this depends on how the stream of edges
arrives. If edges are sorted, i.e., we access all the edges involving vertex i, then
all the edges involving vertex i + 1 and so on, the amount of pruning will be
larger during the parse than in the case edges arrive randomly.
Algorithm 10 presents the filtering actions performed during the data graph
reading in the case where edges are sorted. In this case, the three filters can
be applied as the edges of a vertex are accessed avoiding to store them. When
all the edges incident to the current vertex are available (see lines 14-20), we
can compute the CNI of the current data vertex and compare it with the CNIs
of the query vertices (see lines 21-25), the vertex and all its edges are pruned.
The filtered data graph, denoted GQ obtained at the end of the reading-filtering
process contains only the data vertices that are candidates to query vertices.
4.4 Experiments
4.4.1 Datasets
1. Small graphs: these graphs are known datasets used by almost all existing
methods in their evaluation process. So, we mainly use them as comparat-
ive datasets. The underlying graphs represent protein interaction networks
coming from three main organisms: human (HUMAN and HPRD datasets),
yeast (YEAST dataset) and fish (DANIO-RERIO dataset). The HUMAN and
DANIO-RERIO datasets are available in the RI database of biochemical
data1 [9]. The HPRD and YEAST come from the work of [33] and [6].
• HPRD: This is a graph that contains 37, 081 edges and 9, 460 vertices.
The number of unique labels in the dataset is 307.
• YEAST: This graph contains 12, 519 edges, 3, 112 vertices, and 71
distinct labels.
• DANIO-RERIO: This graph contains 51, 464 edges and 5, 720 vertices.
We used it with different number of labels (32, 64, 128 and 512) and
distributions of them.
To query the HUMAN, HPRD and YEAST datasets, we use the sets of
queries generated in [6]. Each query is a connected subgraph of the data
graph obtained using a random walk on the data graph. For HPRD an
YEAST, the authors of [6] provide 8 query sets, each containing 100 query
graphs of the same size. The 8 query sets are denoted 25s, 25n, 50s, 50n,
100s, 100n, 200s, and 200n, where is and in denote query sets with i
vertices and, respectively, average degree ≤ 3 (i.e., sparse) and > 3 (i.e., non-
sparse). For HUMAN which is the smallest graph among the considered
datasets, the authors constructed smaller queries denoted 10s, 10n, 15s,
15n, 20s, 20n, 25s, and 25n.
2. Large graphs: In this category, we considered a real graph from the Stan-
ford Large Network Dataset Collection 2 called LiveJournal. It is a graph
representing an on-line social network with almost 5 million members
and over 68 million friendship relations, i.e., edges. We used 200 distinct
labels and 4 sets of queries with 100k, 200k, 400k and 500k ( with k = 103 )
vertices. Each set contains 10 query graphs of the same size.
These graphs do not fit in the main memory of the computer used for
2 http://snap.stanford.edu/
Table 4.2 summarises the characteristics of the datasets. For each graph, we
report the number of vertices, the number of edges, the number of unique labels
and the compression rate which is the ratio between the number of edges of the
compressed graph on the number of edges of the original graph using modular
decomposition of graphs as a compression tool of graphs [18, 21, 31]. Modular
decomposition compresses graphs by aggregating vertices that have the same
neighbors into one single vertex. The compression ratio is used to show how
well the datasets are compressible and consequently how well they are suitable
for a subgraph isomorphism search algorithm such as TurboI so, its boost version
developed in [43] or SumI so [40]. For instance, we can see that the HUMAN
dataset is highly compressible, i.e., compression rate of 61%.
4.4.2 Results
and 200n. In fact, the experiments undertaken in [6] on the same data graphs
and the same set of queries report that TurboI so exceeds 5 hours running time
on these large queries on almost the all three datasets on an Intel i5 3.20 GHz
CPU and 8GB memory. We recall that, we used the binaries provided by the
authors and consequently no modifications have been done on these algorithms.
According to our results plotted on Figure 4.7, there is practicably no difference
between the four algorithms on the HUMAN data graph whatever is the size of
the query and its sparsity. We note that this dataset is highly compressible and
is suitable for algorithms such as TurboI so and SumI so .
For YEAST and HPRD, we clearly see that CNI outperforms CFL-match and
SumIso , which behave almost similarly on all queries, and both perform better
than TurboI so that obtains the worst time performance. This is due to our new
neighborhood encoding that allows an easy global pruning step.
Against Existing Algorithms by Varying |Σ| within the small datasets: Fig-
ure 4.8 shows the average total processing time for each query graph on the
DANIO-RERIO for the four algorithms with various numbers of query labels
and also two distributions (uniform and Gaussian) of these labels on vertices.
These graphs are queried by 2 sets of queries: sparse and non sparse queries.
Each set contains 100 query graphs of the same size (128 vertices). We can see on
this Figure that the worst results are obtained by TurboI so . This can be explained
by the complexity of its data structures when we list all the embeddings [6].
CFL-match and sumI so have very close results on sparse queries on all the con-
sidered label numbers and with the two distributions. However, sumI so behaves
better with non sparse queries mainly because the corresponding grahs are more
likely compressible. CN I clearly outperforms the three other algorithms which
confirms the importance to reduce filtering cost.
Against Existing Algorithms by Varying |V (Q)| within the large dataset:
Figure 4.10 shows the average total processing time for each query graph our
large dataset LIVEJOURNAL. The obtained results have the same pattern as
the results obtained on small graphs. However, the difference between the four
algorithms is less pronounced than with small graphs. This can be explained
by the fact that the small graphs are more difficult instances for subgraph
isomorphism search with denser graphs.
Against Existing Algorithms by Varying |V (Q)| within the big datasets: It
was not possible to use CFL-match, TurboI so , and SumI so with big graphs. So, the
results concern only CNI. Figure 4.9 shows the total processing time of CNI on
the two big graphs. We can mainly see that even with a query graph of 500, 000
vertices we cannot perceive any exponential shape which confirms the scalability
of the approach. This tendency is also confirmed when we vary the number
of vertices of the data graph on Figure 4.11. These results definitely settle the
scalability of the proposed approach.
The compact neighborhood index can also be computed for the k-neighborhood
with k > 1 and can be extended to cover edge labels. The CNI of vertex v
featuring its neighborhood at k-hops can be computed using the same formula:
(q+p−1)!
cnik (v) = sj=1 (j, x1 + ... + xj ) where (q, p) = q+p−1
q = q!(p−1)! , s is the number
of k-hops neighbors of v, i.e, number of vertices of G that are reachable from
v with exactly k-hops in a shortest path from v, and x1 , · · · , xs are the numeric
labels of these vertices. For instance, the CNI at k = 2 of the query vertex u1 of
our running example (see Figure 4.1 comprises vertices u4 and u5 and can be
computed as: cni2 (u1 ) = (1, 3) + (2, 4) = 7. The CNI at k-hops can be used to
prune the data vertices that are not candidate for a query vertex but that passe
Figure 4.7: Time performance on small datasets (varying |V (Q)|). Results are in
logscale.
104 Liris laboratory Chemseddine Nabti
4.5 cni(v) at (k > 1)-hops Neighborhood
Figure 4.8: Time performance on the small dataset DANIO-RERIO (varying |Σ|
and the label distribution).
(a) Twitter
(b) Friendster
Lemma 7 (k-hop Degree filter). Given a query Q and a data graph G, a data vertex
v ∈ V (G) is not a candidate of u ∈ V (Q) if degL(Q)
k k
((v) < degL(Q) k
(u) where degL(Q) (u)
is the number of vertices reachable from u with exactly k-hops in a shortest path from
u and have a label in L(Q).
Lemma 8 (CN Ik filter). Given a query Q and a data graph G, a data vertex v ∈ V (G)
that verifies the CN Ik filter and the (k + 1)-hops Degree filter is not a candidate of
u ∈ V (Q) if cnik+1 (v) < cnik+1 (u).
4.6 Conclusion
Subgraph isomorphism search is an NP-complete problem. This means a pro-
cessing time that grows with the size of the involved graphs. Pruning the search
space is the pilar of a scalable subgraph isomorphism search algorithm and
has been the main focus of proposed approaches since Ullmann’s first solution.
In our contribution, we proposed CNI, a simple subgraph isomorphism search
algorithm that relies on a compact representation of the neighborhood, called
Compact Neighborhood Index (CNI), to perform an early global pruning of the
search space. CNI distills topological information of each vertex into an integer.
This vertex encoding is easily updatable and can be used to prune globally the
search space using an iterative algorithm. Furthermore CNI does not require
that the entire data graph is loaded into main memory and can be used with a
graph stream. Our extensive experiments validate the efficiency of our approach.
As part of future work related to this second contribution, it will be inter-
esting to extend CNI to construct a graph index that allows to handle a graph
database. For this issue, we plan to compute a vertex CNI that includes the
vertex label: cni(u) = kj=1 (j, x1 + ... + xj ) where the label of u is among the xi
and then compute a compact neighborhood index for the graph using the same
formula as follows: cni(G) = kj=1 (j, x1 + ... + xj ) where each xi is the CNI of a
vertex of G. This resulting graph CNI can be used to index a graph in a database
of graphs defined on the same set of labels.
Contents
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.1 Conclusion
To conclude this manuscript, we present, in the following, a summary of the
work that we achieved during our thesis. The research perspectives that could
be considered following this work are also discussed.
In this thesis, we studied the problem of subgraph isomorphism search in
massive data. Subgraph isomorphism search is the main tool used for graph
querying. It is an NP-complete problem. Basically, this problem consists to
determine an equality between two graphs in terms of structure and labels. It
also finds a mapping between all the vertices and/or edges of the query graph
and the target graph while respecting the labeling functions. Graph querying
can be very usefull. For example, in chemistry, scientists usually aim to find
a small complex molecule in a big one during their tests. Such a problem can
be solved using subgraph isomorphism seach with a graph representation of
molecules.
As presented in the state of the art (see Chapter 2), many algorithms and
solutions were proposed to solve subgraph isomorphism search efficiently. The
main problem is how to reduce the search space to save memory space and time
processing.
A search space is generally a tree that the algorithm has to parse to search for
the query. The very first and most used technique to browse the search space is
backtracking, proposed first by Ullmann [49]. Algorithms that followed, rely on
Ullmann’s solution and try to outperform it by further reducing the size of the
search space. This is done by filtering unpromising vertices, that can not answer
the query, as soon as possible.
Many techniques were proposed to reduce the search space, some of them use
paths as patterns of comparison. Instead of checking for isomorphism with all
vertices, the search will be performed on a shorter list of candidates. A candidate
is a pattern that shows more probability to answer the query. Some techniques
use score functions to determine if a candidate is relevant or not. After reducing
the search space by returning a list of relevant candidates a second phase, called
verification phase, is performed on the final list of relevant candidates to check
for subgraph isomorphism.
single large data graph. This problem is more dificult than the first one, because
here we aim to find all the occurrences of the query in the data graph, instead of
checking for the exsistance of the query on each graph in a database.
After analyzing our state of the art, we presented our two contributions that
globaly aim to reduce the search space by compression.
In the first contribution, we compress the whole data graph. In fact, a smaller
representation of the graph will definitely lead to a smaller search space, which
gives less time and memory storage complexity. Graph compression (or sum-
marizing), is a well known technique that is effective when dealing with massive
data.
The best compression algorithm is the one that retains all the properties of
the original graph. We surveyed several approaches to compress graph and the
one that responds to this criteria is a concept from graph theory called modular
decomposition and that dates back to the work of Gallai in 1967 [17].
Modular decomposition of graphs consists to highlight a set of vertices that
have the same neighbors and so are not distinguishable from outside. These
sets of vertices are called modules. Each module is compressed as a single
vertex depending on how vertices are connected within a module. We showed
that we can query these compressed graphs without decompressing them. Our
experimentations show that the proposed approach achieves good performance
on both time processing of queries and space storage of data graphs.
In our second contribution, we compress the neighborhood of each vertex.
In this contribution, we focused on the best way to filter the search space. We
proposed a new constant time pruning mechanism. The main idea in this contri-
bution was to avoid comparing all vertex’s neighbors, in order to check if two
vertices are identical or not. To do so, we regroup all information that surround
a vertex on one simple integer.
5.2 Perspectives
Our work can be extended according to several axes:
Another futur issue, that will help handling large graphs, will be to find a
way to test these representations on a parralel programming model such
as MapReduce.
It is also interesting to see if it is feasible to run such approach on a graph
database like Neo4j by designing and developing all the necessary database
operations such as create, delete, and insert on the compressed dataset.
• In our second contribution, the idea was to propose an effecitve and very
tight filtering technique to filter-out unpromising vertices as soon and
effective as possible. Our filtering was based on the vertex’s neighborhood
information.
Even if our method shows good performance, the issue is that we have to
compute the CNI for each vertex. It will be interesting to extend this tech-
nique by constructing a subgraph neighborhood index, instead of doing it
for each vertex. The idea will be to divide the target graph into candidate
subgraphs. A candidate subgraph is a subgraph that have more possibility
to match the query. Unlike our contribution, the idea will be to compact
all neighborhood information of the subgraph on one integer.
For example: if we have a target graph of 100 vertices divided into 10 can-
didate subgraphs, the index will contain 10 integers storing all necessary
information, instead of calculating 100 integeres. The last comparison
will be between the query, which will also have an integer regrouping all
neighborhood information, and 10 other integers (representing the target
graph). This will largely reduce the processing time, and the search space.
If the query and the target graph are large graphs, the query will be also
divided into subgraphs, each one with its CNI and the target graph’s sub-
graphs in this case will be candidates for the query subgraphs instead of
being candidate for the whole query.
Another issue will be to extend CNI to construct a graph index that allows
to handle a graph database. For this, we plan to compute a vertex CNI that
includes the vertex label. The resulting graph CNI can be used to index a
graph in a database of graphs defined on the same set of labels.
Journal paper
[1] Chemseddine Nabti, Hamida Seba. Querying Massive Graph Data: a Com-
press and Search Approach. Future Generation Computer Systems (FGCS),
Volume 74, September 2017, Pages 63-75, Elsevier
[1] http://socialnetworks.mpi-sws.org/data-imc2007.html.
[2] Micah Adler and Michael Mitzenmacher. Towards compressing web graphs.
In Proceedings of the Data Compression Conference, DCC ’01, Washington,
DC, USA, 2001. IEEE Computer Society.
[6] Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. Efficient
subgraph matching by postponing cartesian products. In Proceedings of the
2016 International Conference on Management of Data, SIGMOD ’16, pages
1199–1214, New York, NY, USA, 2016. ACM.
[7] Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. Efficient
subgraph matching by postponing cartesian products. In Proceedings of
the 2016 International Conference on Management of Data, pages 1199–1214.
ACM, 2016.
[9] Vincenzo Bonnici, Rosalba Giugno, Alfredo Pulvirenti, Dennis Shasha, and
Alfredo Ferro. A subgraph isomorphism algorithm and its application to
biochemical data. BMC Bioinformatics, 14(Suppl 7)(S13), 2013.
[10] Peter Buneman, Martin Grohe, and Christoph Koch. Path queries on com-
pressed xml. In Proceedings of the 29th International Conference on Very
Large Data Bases - Volume 29, VLDB ’03, pages 141–152. VLDB Endowment,
2003.
[11] Christian Capelle, Michel Habib, and Fabien De Montgolfier. Graph decom-
positions and factorizing permutations. Discrete Mathematics & Theoretical
Computer Science - DMTCS, 5(1):55–70, 2002.
[12] Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng
Yan, and Jiawei Han. Mining graph patterns efficiently via randomized
summaries. Proc. VLDB Endow., 2(1):742–753, August 2009.
[13] Donatello Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. Thirty
Years of Graph Matching in Pattern Recognition. International Journal of
Pattern Recognition and Artificial Intelligence, 18:265–298, 2004.
[14] Luigi P Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. Per-
formance evaluation of the vf graph matching algorithm. In Image Analysis
and Processing, 1999. Proceedings. International Conference on, pages 1172–
1177. IEEE, 1999.
[15] Wenfei Fan and Jin-Peng Huai. Querying big data: Bridging theory and
practice. Journal of Computer Science and Technology, 29(5):849–869, 2014.
[16] Wenfei Fan, Jianzhong Li, Xin Wang, and Yinghui Wu. Query preserving
graph compression. In Proceedings of the 2012 ACM SIGMOD International
Conference on Management of Data, SIGMOD ’12, pages 157–168, New York,
NY, USA, 2012. ACM.
[19] Michael Randolph Garey and David S. Johnson. Computers and Intractability:
A Guide to the Theory of NP-Completeness. 1979.
[22] Michel Habib, Fabien De Montgolfier, and Christophe Paul. A simple linear-
time modular decomposition algorithm for graphs. Scandinavian Workshop
on Algorithm Theory - SWAT, pages 187–198, 2004.
[23] Wook-Shin Han, Jinsoo Lee, and Jeong-Hoon Lee. Turboiso: Towards Ultra-
fast and Robust Subgraph Isomorphism Search in Large Graph Databases.
In Proceedings of the 2013 ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD ’13, pages 337–348, New York, NY, USA, 2013.
ACM.
[24] Wook-Shin Han, Jinsoo Lee, and Jeong-Hoon Lee. Turboiso: Towards ul-
trafast and robust subgraph isomorphism search in large graph databases.
In Proceedings of the 2013 ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD ’13, pages 337–348, New York, NY, USA, 2013.
ACM.
[30] Sofiane Lagraa and Hamida Seba. An efficient exact algorithm for triangle
listing in large graphs. Data Mining and Knowledge Discovery, pages 1–20,
2016.
[31] Sofiane Lagraa, Hamida Seba, Riadh Khennoufa, Abir M’Baya, and Hama-
mache Kheddouci. A distance measure for large graphs based on prime
graphs. Pattern Recognition, 47(9):2993 – 3005, 2014.
[32] Jinsoo Lee, Wook-Shin Han, Romans Kasperovics, and Jeong-Hoon Lee.
An in-depth comparison of subgraph isomorphism algorithms in graph
databases. In Proceedings of the VLDB Endowment, volume 6, pages 133–144.
VLDB Endowment, 2012.
[33] Jinsoo Lee, Wook-Shin Han, Romans Kasperovics, and Jeong-Hoon Lee.
An in-depth comparison of subgraph isomorphism algorithms in graph
databases. In Proceedings of the 39th international conference on Very Large
Data Bases, PVLDB’13, pages 133–144. VLDB Endowment, 2013.
[35] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Predicting positive
and negative links in online social networks. In Proceedings of the 19th
International Conference on World Wide Web, WWW ’10, pages 641–650.
ACM, 2010.
[36] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time:
Densification laws, shrinking diameters and possible explanations. In Pro-
ceedings of the Eleventh ACM SIGKDD International Conference on Knowledge
Discovery in Data Mining, KDD ’05, pages 177–187, New York, NY, USA,
2005. ACM.
[37] Hossein Maserrat and Jian Pei. Neighbor query friendly compression of
social networks. In Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 533–
542, New York, NY, USA, 2010. ACM.
[38] Brendan D. McKay and Adolfo Piperno. Practical graph isomorphism, II.
CoRR, abs/1301.1493, 2013.
[39] Ryan Boyd William Lyon Michael Hunger. Rdbms graphs: Sql vs. cypher
query languages. https://neo4j.com/blog/sql-vs-cypher-query-languages/,
2015.
[40] Chemseddine Nabti and Hamida Seba. Querying massive graph data: A
compress and search approach. Future Generation Computer Systems, 74:63
– 75, 2017.
[41] Luigi P. Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. A
(sub)graph isomorphism algorithm for matching large graphs. IEEE Trans.
Pattern Anal. Mach. Intell., 26(10):1367–1372, October 2004.
[43] Xuguang Ren and Junhu Wang. Exploiting vertex relationships in speeding
up subgraph isomorphism over large graphs. Proc. VLDB Endow., 8(5):617–
628, January 2015.
[44] Hamida Seba, Sofiane Lagraa, and Elsen Ronando. Comparison issues in
large graphs: State of the art and future directions. CoRR, abs/1502.07576,
2015.
[45] Haichuan Shang, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. Taming veri-
fication hardness: An efficient algorithm for testing subgraph isomorphism.
Proc. VLDB Endow., 1(1):364–375, August 2008.
[46] Haichuan Shang, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. Taming veri-
fication hardness: An efficient algorithm for testing subgraph isomorphism.
Proc. VLDB Endow., 1(1):364–375, August 2008.
[47] Zhao Sun, Hongzhi Wang, Haixun Wang, Bin Shao, and Jianzhong Li.
Efficient subgraph matching on billion node graphs. Proc. VLDB Endow.,
5(9):788–799, May 2012.
[50] Sebastiaan J. van Schaik and Oege de Moor. A memory efficient reachability
data structure through bit vector compression. In Proceedings of the 2011
ACM SIGMOD International Conference on Management of Data, SIGMOD
’11, pages 913–924, New York, NY, USA, 2011. ACM.
[51] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent
structure-based approach. In Proceedings of the 2004 ACM SIGMOD Inter-
national Conference on Management of Data, SIGMOD ’04, pages 335–346,
New York, NY, USA, 2004. ACM.
[52] Jaewon Yang and Jure Leskovec. Defining and evaluating network com-
munities based on ground-truth. Knowledge and Information Systems,
42(1):181–213, 2015.
[53] TARA SAFAVI DANAI KOUTRA YIKE LIU, ABHILASH DIGHE. Graph
summarization: A survey. ACM Computing Surveys, 2017.
[55] Shijie Zhang, Shirong Li, and Jiong Yang. Gaddi: Distance index based
subgraph matching in biological networks. In Proceedings of the 12th Inter-
national Conference on Extending Database Technology: Advances in Database
Technology, EDBT ’09, pages 192–203, New York, NY, USA, 2009. ACM.
[56] Shijie Zhang, Shirong Li, and Jiong Yang. Gaddi: Distance index based
subgraph matching in biological networks. In Proceedings of the 12th Inter-
national Conference on Extending Database Technology: Advances in Database
Technology, EDBT ’09, pages 192–203, New York, NY, USA, 2009. ACM.
[57] Peixiang Zhao and Jiawei Han. On graph query optimization in large
networks. Proc. VLDB Endow., 3(1-2):340–351, September 2010.
[58] Peixiang Zhao and Jiawei Han. On graph query optimization in large
networks. PVLDB, 3(1):340–351, 2010.
[59] Peixiang Zhao, Jeffrey Xu Yu, and Philip S. Yu. Graph indexing: Tree + delta
graph. In Proceedings of the 33rd International Conference on Very Large Data
Bases, VLDB ’07, pages 938–949. VLDB Endowment, 2007.
[60] Xiang Zhao, Chuan Xiao, Xuemin Lin, and Wei Wang. Efficient Graph
Similarity Joins with Edit Distance Constraints. In IEEE 28th International
[61] Gaoping Zhu, Xuemin Lin, Ke Zhu, Wenjie Zhang, and Jeffrey Xu Yu.
Treespan Efficiently computing similarity all-matching. In Proceedings of
the 2012 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’12, pages 529–540, New York, NY, USA, 2012. ACM.
[62] Lei Zou, Lei Chen, Jeffrey Xu Yu, and Yansheng Lu. A novel spectral coding
in a large graph database. In Proceedings of the 11th International Conference
on Extending Database Technology: Advances in Database Technology, EDBT
’08, pages 181–192, New York, NY, USA, 2008. ACM.