Vous êtes sur la page 1sur 42

Seminar 2009

Frequent Subgraph/ Substructure Mining

Lei Shi
Department of Computer Science and
Engineering
State University of New York at Buffalo

University at BuffaloThe State University of New York

Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary

University at BuffaloThe State University of New York

Graphs are everywhere

University at BuffaloThe State University of New York

Graph Mining Problems


Graph Pattern Mining

Frequent subgraph pattern mining


Pattern summarization
Optimal graph patterns
Graph patterns with constraints
Approximate graph patterns .

Graph Classification
Graph clustering
Important node identification
Bridge and hub identification

Other Important Topics


Graph compression
Graph model
Social network analysis.
University at BuffaloThe State University of New York

Subgraph pattern Mining


Frequent subgraph
A (sub)graph is frequent if its support (occurrence frequency) in a
given dataset is no less than a minimum support threshold

Application of subgraph pattern mining

Mining biochemical structures


Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classifiction, clustering,compression,
comparison and correlation analysis.

University at BuffaloThe State University of New York

Frequent Subgraph Example

(1)
A
A

(2)
A
C

subgraph

University at BuffaloThe State University of New York

C
Support

C
B

(3)

A
3

Key Challenges in Subgraph Mining

Graph isomorphism

to detect if two graphs are identical in structure

Graph representation (Canonical Labeling)

A canonical label is a unique code of a given graph.


Canonical label should be the same no matter how graphs are
represented, as long as graphs have the same topological
structure and the same labeling of edges and vertices.

Subgraph candidate generation

generate candidate frequent subgraphs from datasets

University at BuffaloThe State University of New York

Subgraph Mining Approaches


Apriori-based

AGM/AcGM: Inokuchi, et al. (PKDD00)


FSG: Kuramochi and Karypis (ICDM01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
ICDM01, pages 313-320, Nov. 2001
PATH#: Vanetik and Gudes (ICDM02, ICDM04)
FFSM: Huan, et al. (ICDM03) and SPIN: Huan et al. (KDD04)
FTOSM: Horvath et al. (KDD06)

Pattern growth based


Subdue: Holder et al. (KDD94)
MoFa: Borgelt and Berthold (ICDM02)
gSpan: Yan and Han (ICDM02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure
Pattern Mining. In Proceedings of the 2002 IEEE international
Conference on Data Mining (Icdm02) (December 09-12, 2002).
ICDM. IEEE Computer Society, Washington, DC, 721
Gaston: Nijssen and Kok (KDD04)
CMTreeMiner: Chi et al. (TKDE05)
LEAP: Yan et al. (SIGMOD08)

University at BuffaloThe State University of New York

Outline
Introduction and Background
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary

University at BuffaloThe State University of New York

Apriori-based Approach

FSG : Frequent subgraph discovery. In ICDM01, Nov. 2001


M.Kuramochi and G. Karypis.
Flattened Representation as Canonical Labeling

Apriori-based method to generate subgraph candidate

University at BuffaloThe State University of New York

Graph Representation in FSG


Flattened Representation

0e0 00e1 00

University at BuffaloThe State University of New York

Graph Representation in FSG


Flatterned Representation

Lexicographic order or dictionary order

University at BuffaloThe State University of New York

Apriori-based method
Apriori Property

If a graph is frequent, all of its


subgraphs are frequent.

Candidate Generation
Create a set of candidate size k+1
-from given two frequent ksubgraphs
-containing the same (k-1)subgraph
-Result in several candidates size
k+1

University at BuffaloThe State University of New York

Apriori-based method
Graph candidate generated Example

University at BuffaloThe State University of New York

Apriori-based method
FlowChart

University at BuffaloThe State University of New York

Apriori-based method
Experiment Result
-Chemical Compound Dataset, which contains 340
compounds,24 different atoms (vertices)

University at BuffaloThe State University of New York

Outline
Introduction
Apriori-based Subgrah Mining
Pattern Growth Subgraph Mining
Summary

University at BuffaloThe State University of New York

Motivation of gSpan

Weakness of Apriori-based approach

The generation of size (k+1) subgraph candidates from size


k frequent subgraph too complicated and complex.
Pruning false positive : subgraph isomorphism is an NP
complete problem which is costly.

gSpan: Graph-Based Substructure Pattern


Mining

Change the way to represent a graph (DFS: Depth First


Search)
Using pattern growth to generate new subgraph candidate.

University at BuffaloThe State University of New York

gSpan: Graph-Based Substructure Pattern Mining


DFS (Depth First Search) Code
First Step: DFS the graph and use edges on the path to
represent the graph.
Second Step: DFS Lexicographic Order

Pattern Growth subgraph generation

University at BuffaloThe State University of New York

DFS code

An edge is
presented
by 5 tuples.

(i, j , li , l( i , j ) , l j )
(0,1, X , a, Y )

University at BuffaloThe State University of New York

DFS code
Second Step: DFS Lexicographic Order

University at BuffaloThe State University of New York

Pattern Growth Approach


Pattern Growth (free extension)

University at BuffaloThe State University of New York

Pattern Growth Approach


Duplicate Graphs

University at BuffaloThe State University of New York

Pattern Growth Approach


Free extension

University at BuffaloThe State University of New York

Pattern Growth Approach


Right most extension

University at BuffaloThe State University of New York

Pattern Growth Approach


Exmaples (cont.)

University at BuffaloThe State University of New York

gSpan

University at BuffaloThe State University of New York

gSpan

University at BuffaloThe State University of New York

Pattern Growth Approach


Experimental result using Chemical data
340 molecules
66 atom types and
4 bond types as labels
On average only 27
vertices with 28 edges

University at BuffaloThe State University of New York

Summary
Graph representation
Flattern representation vs. DFS code

Generation of Candidate Patterns


apriori vs. pattern growth

University at BuffaloThe State University of New York

University at BuffaloThe State University of New York

Pattern-Growth Approach

University at BuffaloThe State University of New York

Frequent Graph Pattern


Given a graph dataset D, find subgraph g, s.t.

freq(g )
Where
freq(g ) is the percentage of graphs in D
that contain g.

Problem 1 : Exponential Pattern Set


Problem 2 : Threshold Setting

University at BuffaloThe State University of New York

Difference between frequent itemset and frequent


subgraph discovery

University at BuffaloThe State University of New York

Frequent itemset discovery

University at BuffaloThe State University of New York

subgraph Mining Algorithms

Apriori-based approach
AGM/AcGM: Inokuchi, et al. (PKDD00)
FSG: Kuramochi and Karypis (ICDM01)
PATH#: Vanetik and Gudes (ICDM02, ICDM04)
FFSM: Huan, et al. (ICDM03) and SPIN: Huan et al. (KDD04)
FTOSM: Horvath et al. (KDD06)

Pattern growth approach


Subdue: Holder et al. (KDD94)
MoFa: Borgelt and Berthold (ICDM02)
gSpan: Yan and Han (ICDM02)
Gaston: Nijssen and Kok (KDD04)
CMTreeMiner: Chi et al. (TKDE05)
LEAP: Yan et al. (SIGMOD08)

University at BuffaloThe State University of New York

Framework of subraph Mining Algorithms

Search Order
breadth vs. depth
complete vs. incomplete
Generation of Candidate Patterns
apriori vs. pattern growth
Discovery Order of Patterns
DFS order
path tree graph
Elimination of Duplicate Subgraphs
passive vs. active
Support Calculation
embedding store or not

University at BuffaloThe State University of New York

Frequent Subgraph
Examples:

University at BuffaloThe State University of New York

Example (cont.)

University at BuffaloThe State University of New York

Subgraph Mining Approaches


Apriori-based approach

AGM/AcGM: Inokuchi, et al. (PKDD00)


FSG: Kuramochi and Karypis (ICDM01)
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
ICDM01, pages 313-320, Nov. 2001
PATH#: Vanetik and Gudes (ICDM02, ICDM04)
FFSM: Huan, et al. (ICDM03) and SPIN: Huan et al. (KDD04)
FTOSM: Horvath et al. (KDD06)

Pattern growth approach


Subdue: Holder et al. (KDD94)
MoFa: Borgelt and Berthold (ICDM02)
gSpan: Yan and Han (ICDM02)
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure
Pattern Mining. In Proceedings of the 2002 IEEE international
Conference on Data Mining (Icdm02) (December 09-12, 2002).
ICDM. IEEE Computer Society, Washington, DC, 721
Gaston: Nijssen and Kok (KDD04)
CMTreeMiner: Chi et al. (TKDE05)
LEAP: Yan et al. (SIGMOD08)

University at BuffaloThe State University of New York

Outline
Introduction and Background

Apriori-based Subgrah Mining


Pattern Growth Subgraph Mining
Summary

DFS code
Yan, X. and Han, J. 2002. gSpan : Graph-Based Substructure
Pattern Mining. In Proceedings of the 2002 IEEE international
Conference on Data Mining (Icdm02) (December 09-12, 2002).
ICDM. IEEE Computer Society, Washington, DC, 721

University at BuffaloThe State University of New York

Pattern Growth Approach

University at BuffaloThe State University of New York

Vous aimerez peut-être aussi