Shi

Seminar 2009
Frequent Subgraph/ Substructure Mining
Lei Shi
Department of Computer Science and
Engineering
State University of New York at Buffalo
University at BuffaloThe State University of New York
Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary
Graphs are everywhere
Graph Mining Problems

Graph Pattern Mining
Frequent subgraph pattern mining

Pattern summarization
Optimal graph patterns
Graph patterns with constraints
Approximate graph patterns .
Graph Classification
Graph clustering
Important node identification
Bridge and hub identification
Other Important Topics

Graph compression
Graph model
Social network analysis.
Subgraph pattern Mining

Frequent subgraph
A (sub)graph is frequent if its support (occurrence frequency) in a
given dataset is no less than a minimum support threshold
Application of subgraph pattern mining
Mining biochemical structures

Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classifiction, clustering,compression,
comparison and correlation analysis.
Frequent Subgraph Example
(1)
A
A
(2)
A
C
subgraph
C
Support
C
B
(3)
A
3
Key Challenges in Subgraph Mining
Graph isomorphism
to detect if two graphs are identical in structure
Graph representation (Canonical Labeling)
A canonical label is a unique code of a given graph.

Canonical label should be the same no matter how graphs are
represented, as long as graphs have the same topological
structure and the same labeling of edges and vertices.
Subgraph candidate generation
generate candidate frequent subgraphs from datasets
Subgraph Mining Approaches

Apriori-based
AGM/AcGM: Inokuchi, et al. (PKDD00)

FSG: Kuramochi and Karypis (ICDM01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
ICDM01, pages 313-320, Nov. 2001
PATH#: Vanetik and Gudes (ICDM02, ICDM04)
FFSM: Huan, et al. (ICDM03) and SPIN: Huan et al. (KDD04)
FTOSM: Horvath et al. (KDD06)
Pattern growth based

Subdue: Holder et al. (KDD94)
MoFa: Borgelt and Berthold (ICDM02)
gSpan: Yan and Han (ICDM02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure
Pattern Mining. In Proceedings of the 2002 IEEE international
Conference on Data Mining (Icdm02) (December 09-12, 2002).
ICDM. IEEE Computer Society, Washington, DC, 721
Gaston: Nijssen and Kok (KDD04)
CMTreeMiner: Chi et al. (TKDE05)
LEAP: Yan et al. (SIGMOD08)
Outline
Introduction and Background
Summary
Apriori-based Approach
FSG : Frequent subgraph discovery. In ICDM01, Nov. 2001

M.Kuramochi and G. Karypis.
Flattened Representation as Canonical Labeling
Apriori-based method to generate subgraph candidate
Graph Representation in FSG

Flattened Representation
0e0 00e1 00
Graph Representation in FSG

Flatterned Representation
Lexicographic order or dictionary order
Apriori-based method
Apriori Property
If a graph is frequent, all of its

subgraphs are frequent.
Candidate Generation
Create a set of candidate size k+1
-from given two frequent ksubgraphs
-containing the same (k-1)subgraph
-Result in several candidates size
k+1
Graph candidate generated Example
FlowChart
Experiment Result
-Chemical Compound Dataset, which contains 340
compounds,24 different atoms (vertices)
Outline
Introduction
Summary
Motivation of gSpan
Weakness of Apriori-based approach
The generation of size (k+1) subgraph candidates from size

k frequent subgraph too complicated and complex.
Pruning false positive : subgraph isomorphism is an NP
complete problem which is costly.
gSpan: Graph-Based Substructure Pattern

Mining
Change the way to represent a graph (DFS: Depth First

Search)
Using pattern growth to generate new subgraph candidate.
gSpan: Graph-Based Substructure Pattern Mining

DFS (Depth First Search) Code
First Step: DFS the graph and use edges on the path to
represent the graph.
Second Step: DFS Lexicographic Order
Pattern Growth subgraph generation
DFS code
An edge is
presented
by 5 tuples.
(i, j , li , l( i , j ) , l j )
(0,1, X , a, Y )
DFS code
Second Step: DFS Lexicographic Order
Pattern Growth Approach

Pattern Growth (free extension)

Duplicate Graphs

Free extension

Right most extension

Exmaples (cont.)
gSpan
gSpan

Experimental result using Chemical data
340 molecules
66 atom types and
4 bond types as labels
On average only 27
vertices with 28 edges
Summary
Graph representation
Flattern representation vs. DFS code
Generation of Candidate Patterns

apriori vs. pattern growth
Pattern-Growth Approach
Frequent Graph Pattern

Given a graph dataset D, find subgraph g, s.t.
freq(g )
Where
freq(g ) is the percentage of graphs in D
that contain g.
Problem 1 : Exponential Pattern Set

Problem 2 : Threshold Setting
Difference between frequent itemset and frequent

subgraph discovery
Frequent itemset discovery
subgraph Mining Algorithms
Apriori-based approach
Pattern growth approach

Framework of subraph Mining Algorithms
Search Order
breadth vs. depth
complete vs. incomplete
Generation of Candidate Patterns
apriori vs. pattern growth
Discovery Order of Patterns
DFS order
path tree graph
Elimination of Duplicate Subgraphs
passive vs. active
Support Calculation
embedding store or not
Frequent Subgraph
Examples:
Example (cont.)
Subgraph Mining Approaches

Apriori-based approach

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
ICDM01, pages 313-320, Nov. 2001
Pattern growth approach

Outline
Introduction and Background

Summary
DFS code

Shi

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Shi

Transféré par

Droits d'auteur :

Formats disponibles

Seminar 2009

Frequent Subgraph/ Substructure Mining

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

Graphs are everywhere

University at BuffaloThe State University of New York

Graph Mining Problems

Frequent subgraph pattern mining

Other Important Topics

Subgraph pattern Mining

Application of subgraph pattern mining

Mining biochemical structures

University at BuffaloThe State University of New York

Frequent Subgraph Example

University at BuffaloThe State University of New York

Key Challenges in Subgraph Mining

to detect if two graphs are identical in structure

Graph representation (Canonical Labeling)

A canonical label is a unique code of a given graph.

Subgraph candidate generation

generate candidate frequent subgraphs from datasets

University at BuffaloThe State University of New York

Subgraph Mining Approaches

AGM/AcGM: Inokuchi, et al. (PKDD00)

Pattern growth based

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

FSG : Frequent subgraph discovery. In ICDM01, Nov. 2001

Apriori-based method to generate subgraph candidate

University at BuffaloThe State University of New York

Graph Representation in FSG

University at BuffaloThe State University of New York

Graph Representation in FSG

Lexicographic order or dictionary order

University at BuffaloThe State University of New York

If a graph is frequent, all of its

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

Weakness of Apriori-based approach

The generation of size (k+1) subgraph candidates from size

gSpan: Graph-Based Substructure Pattern

Change the way to represent a graph (DFS: Depth First

University at BuffaloThe State University of New York

gSpan: Graph-Based Substructure Pattern Mining

Pattern Growth subgraph generation

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

Pattern Growth Approach

University at BuffaloThe State University of New York

Pattern Growth Approach

University at BuffaloThe State University of New York

Pattern Growth Approach

University at BuffaloThe State University of New York

Pattern Growth Approach

University at BuffaloThe State University of New York

Pattern Growth Approach

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

Pattern Growth Approach