Vous êtes sur la page 1sur 4

(IJCNS) International Journal of Computer and Network Security, 143

Vol. 2, No. 5, May 2010

Graph Matching Measure for Document Retrieval


Amina Ghardallou Lasmar1, Anis Kricha2 and Najoua Essoukri Ben Amara3
1
Higher Institute of Technological Studies of Sousse,
BP 135, Sousse Erriadh, 4023 Sousse Tunisia
Aminaghardallou@yahoo.fr
2
National Engineering School of Monastir
Avenue Ibn ElJazzar, 5019 Monastir, Tunisia
Anis.Kricha@topnet.tn
3
National Engineering School of Sousse
Avenue 18 janvier Babjdid Sousse 4000, Tunisia
Najoua.Benamara@eniso.rnu.tn

and pattern recognition [2], [6].


Abstract: In this paper, we propose an algorithm for graph
matching in a pattern recognition context. Graphs used for ARG has become commonly used as an adequate
mapping are formulated in terms of Attributed Relational Graph representation for documents owing to their representational
(ARG) to represent document structure. ARG nodes represent power. According to this representation, the nodes represent
regions composing the structure and edges define different entities to be matched and edges define relations operating
relations between them. A graph matching algorithm is between them. If the structural descriptions of document
presented to compare different document structure, based on
images are represented by graphs, different documents can
clique finding in an association graph. The originality of this
algorithm resides on the way of defining an association graph,
be matched by performing some kind of graph matching.
which allows us to define a similarity coefficient. The Graph matching is the process of finding a correspondence
coefficient measures the resemblance degree of two comparing between nodes and edges of two graphs that satisfies some
structures involving the best matching. constraints ensuring that similar substructures of one graph
Keywords: graph matching, structural representation, graph are mapped to substructures of the other. Two drawbacks
retrieval. can be stated for the use of graph matching in computer
vision. The first is its computational complexity which can
1. Introduction be solved by using approximate algorithms. The second
drawback is dealing with noise and distortion. Encoding
In document retrieval applications, it is essential to define
entity from a document by an ARG may not be perfect due
some description of the document based on a set of features.
to noise and errors introduced in low-level stages. In such
These descriptions are then used to search and determine
situations, graphs will be distorted by different attribute
which documents satisfy the query selection criteria. The
values, missing or added vertices or edges, etc. This fact
effectiveness of a document retrieval system ultimately
makes exact graph matching useless in many applications.
depends on the type of representation used to describe a
The matching must incorporate an error model able to
document. In pattern recognition, the document
identify the distortion results in distorted graphs. A
representation can be broadly divided into statistical and
matching between two graphs involving error models is
structural methods [6]. In the former, the document is
referred to as inexact graph matching. Several techniques
represented by a feature vector; and in the latter, a data
have been put forward to solve the graph matching problem.
structure (e.g. graphs or trees) is used to describe objects and
The main approaches are summarized in table 1 with the
their relationships in the document. In the last decades,
most representative references in the field of computer
many structural approaches have been developed which deal
vision.
especially with graph-based representations of graphic
document. These approaches have different purposes like
segmentation [5], [7], document retrieval by content [10]

Table 1: Comparison between classical graph matching


Exact matching Inexact Optimal Key references
matching
Backtrack tree search Yes No Yes [17], [1]
Decision tree Yes No Yes [11], [15]
Graph edition Yes Yes Yes [10], [3], [12], [14]
Association graph Yes No Yes [13]
144 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 5, May 2010

In this paper, we address the problem of indexation of I1 (type: graphic)


R:in the right

historical Arabic document, which aims to represent the R


I:included

structure of a document by a simple model and allows a I2 I3 R


I2

I1 (type: graphic)
further process like document or object retrieval. Therefore, I

we propose an ARG-based representation for the document I3 (type:text)

physical structure. A graph matching method is then


developed based on clique finding in an association graph to Figure 1. A document image and the corresponding ARG
compare document structures. Results are then ranked representation
according to coefficients which convey the degree of
similarity between a query document and database 2.2 A modified association graph for structure
documents. retrieval of the ARG
The remainder of the paper is organized as follows: In the The proposed algorithm for graph matching is based on
next section, we propose the graph matching algorithm and clique finding in an association graph, which is one of the
we also introduce a similarity coefficient between two most popular algorithms that can achieve an exact
graphs. Experimentation is discussed in section 3. The final matching.
part concerns the conclusion and future works. For the two graphs G1 = (V1, E1, LV1 , LE1 )
and G 2 = (V 2, E 2, LV 2 , LE 2 ) , the association graph is a
2. Proposed approach
vertex labeled, undirected graph which can be created by the
Even thought we are not interested in the segmentation step, two step process [8], [15]:
we should consider error introduced by this step knowing 1. For each correct vertex mapping from graph G1 to G2,
that we operate with ancient documents, which includes insert a vertex in the association graph. This vertex is
much degradation. Thus, we will adapt an exact graph labeled with the vertex mapping between the vertices of G1
matching algorithm to our application to overcome and G2.
distortion and errors introduced in precedent stages. The 2. For each pair of vertices va and wa in the association
developed approach is based on a maximal clique finding in graph, insert the edge (va ,wa) if the vertices mapped from
a modified association graph. First, we define a graph-based G1 have the same edge characteristics as the vertices
structure representation; the document structure is organized mapped from G2.
in an ARG whose nodes represent the different regions Our approach is based on the construction of an
composing the document and the graph edges defining the association graph but with modification of the second step.
spatial relations between these regions. The second step We will insert edges not only when we retrieve all
characteristics from vertices mapped from G1 and G2 but
consists of constructing the association graph which is
also when some characteristics are existing. In this way, the
modified compared to the proposed definition in literature
edges of the association graph will be level-headed with
[8], [15] and adapted to our application. Finally, we define
weights.
a similarity coefficient by examining the characteristics of We note (I1, I2) tow vertices from G1and (J1, J2) from
the maximal clique in the AG. G2, where (I1, J1) and (I2, J2) present two correct mappings
2.1 Definition of the ARG from graph G1 to G2. These weights are affected to AG
edges as follows:
ARG is a powerful means of giving structural
(va , wa ) = ((I1 , J1 ),(I 2 , J 2 ))
descriptions of visual patterns. Within an ARG, we can
0 if (I1 = I 2 ) or (J1 = J 2 )
distinguish a syntactic part, made of the layout of unlabeled 
nodes and edges respectively identifying the structural 1 if LE1 (I1, I2 ) = LE2 (J 1, J 2 ) (1)

primitive components of the pattern and their relations, and = 0.2 if LE1 (I1, I2 ) ≠ LE2 (J1 , J 2 )
2
3 if (LE1(I1 , I 2 ) ⊂ {U, R} and LE2 (J1 , J 2 ) = {UR})
a semantic part consisting of the attributes associated to
nodes and edges.

Formally an ARG graph is defined as a  or inversly
quadruple G = (V , E , LV , LE ) , where V is the set of vertices, In Figure 2, we illustrate an example of the way that we
E ⊆ V × V is the set of edges, and LV and LE are two labeling create an association graph from two graphs G1 (figure 2.a)
and G2 (figure 2.b), candidates for comparison. Vertices of
functions associating attributes to vertices and edges
the AG are created from tow correspondent vertices from G1
respectively.
and G2 (tow vertices are correspondent when they have
A document and its corresponding ARG structure are
similar attributes). Then AG edges are introduced between
illustrated in Figure 1. In this figure, we denote the node tow vertices with different weights (figure 2.c).
labels (text or graphics), and near the connecting lines we Let us remember that a clique is a complete sub-graph so
denote edge labels (Upper; on the Right; Included; and that every pair of vertices is joined by an edge. Each clique
Upper Right). We have excluded other spatial relations to within the association graph represents a set of vertices
avoid redundancy since the existence of a relation in a which have the same mutual relationships in G1 and G2;
direction implies that it is complementary in the other one. i.e., the cliques represent sub-graphs common to G1 and G2.
In order to determine the number of possible common sub-
graphs between two graphs, it is sufficient to examine the
(IJCNS) International Journal of Computer and Network Security, 145
Vol. 2, No. 5, May 2010

characteristics of cliques in an unlabeled, undirected graph Figure 2.d illustrates the maximal clique to a given
of the appropriate size. In literature, we find several association graph. If AG contains more than one clique, the
algorithms for clique finding like Messmer method [11] and maximal one is the one having maximal weights.
Belaid algorithm [4]. We have implemented for this goal an
optimal algorithm based on Messmer method.
I1 J1 (I 2 ,J 1 )
Gr Gr ( I1 , J 1 ) ( I 2 ,J 1 )
R I2

Gr
R I
0 .2 1 1
I

T xt T xt
J2
(I 3 ,J 2 )
I3 ( I3 ,J 2 )
(a ) G ra p h G 1 (b ) G rap h G 2 ( c ) A s s o c ia tio n G r a p h (d ) M a x im a l C li q u e

Figure 2. Illustration of the graph matching: (a) Graph G1, (b) Graph G2, (c) Association Graph, (d) Maximal clique

structures. Then, we experiment the proposed algorithm on


2.3 Similarity coefficient
a set of 100 ancient documents. We have supposed that
It is intuitive to denote that the criteria that can these pages are pre-segmented and we have generated
characterize the degree of resemblance of tow objects is corresponding graphs to be introduced as candidates for
calculated, generally from the worth of common features in matching process. We have illustrated bellow (table 2)
ratio of all characteristics of two objects [9], [16]. For this results of the application of our approach to queries
fact, we propose a measure similarity based on the clique documents from synthetic database (query 1) and from
weight and the characteristic of the two candidates G1 and ancient document database (query 2). We conclude that the
G2. similarity score illustrate well the efficiency of our approach
We define description of a graph as the number of mainly with synthetic query.
vertices and edges like the number of regions and relations
involved in the structure of a document. This definition is 4. Conclusion
the same for a clique with a substitution number for the
weight of edges. In this paper, we have presented a new approach to indexing
descr (C max ) − splits and retrieval ancient document structure. Indexing process
sim(G1, G 2) = (2) is performed by an ARG structure. Thus, document search
descr (G1) + descr (G 2) − descr (G1 ∩ G 2)
process is achieved by a graph matching algorithm based on
where splits represent the penalty for the over matching of clique finding in an association graph. Our contribution is
vertices belonging to the maximal clique. In our case, splits involved in the creation of the AG which is level-headed
represent the sum of weights of all relations between with weights, determined by analyzing relations in graphs
vertices of the clique and the remainder ones of the candidates for matching. Finally, the degree of similarity is
association graph. determined according to the characteristics of maximal
clique toward those of graphs candidates. We can conclude
3. Results and experiments that this approach offers good results and a nice challenge
for further improvements mainly by introducing other
To evaluate our approach, we have used a synthetic data content features for characterizing the structure.
base that we generate from 100 different document

Table 2: A tabulation of the top four documents matches for query document

I2 I3
I2 I2 I1
I3 I3 I2 I3
I2 I3 I1
I1
I1

Query 1 Sim=1 Sim=0.9136 Sim=0.8965 Sim=0.8586

Query 2 Sim=1 Sim= 0.7097 Sim=0.7097 Sim=0.5818


146 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 5, May 2010

[19] J.R. Ullman. “An algorithm for subgraph


References isomorphism,” Journal Association for Computer
Machinery 23, pp.31-42, 1976.
[1] C. Ah-Soon. Analyse de plans architecturaux. Ph.D.
Thesis, Institut National Polytechnique de Lorraine,
1998.
[2] D. Arivault. Apport des graphes dans la reconnaissance
non-contrainte de caractères manuscrits anciens. Ph.D
Thesis, Université de Poitiers, 2006.
[3] R. Ambauen, S. Fisher, and H. Bunke. Graph edit
distance with node splitting and merging, and its
application to diatom identification. IAPR-TC15
Workshop on GbRPR, LNCS 2726, pp. 95-106, 2003.
[6] A. Belaïd, Y. Belaïd, “Reconnaissance des formes:
Méthodes et applications,” InterEditions, 1992.
[7] F.P.G. Bergo, A.X. Falcão, P.A. Miranda, L.M. Rocha.
“Automatic image segmentation by tree pruning,”
Journal of Mathematical Imaging and Vision, 29(2-3),
pp. 141-162, 2007.
[8] H. Bunke, S.Gunter, X.Jiang, “Towards bridging the
gap between statistical and structural pattern
recognition: Two new concepts in graph matching,”
Advances in pattern recognition: ICAPR, Bresil, pp. 1-
11, March 2001.
[9] P.F. Felzenszwalb, D.P. Huttenlocher, “Efficient
graph-based image segmentation,” International
Journal of Computer Vision, 59(2), pp. 167-181,2004.
[10] P. Heroux. “Contribution au problème de la rétro-
conversion des documents structurés,” Ph.D.Thesis,
Université de Rouen, 2001.
[11] D. Lin, An information theoric definition of Similarity.
International Conférence on machines learning, pp.
296-304. Morgan Kaufmann, 1998.
[12] J. Lladós, “Symbol recognition by error-tolerant
subgraph matching between region adjacency graphs,”
IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 23, No. 10, 2001.
[13] B.T Messmer. “Efficient graph matching algorithms
for pre-processed model graphs,” Ph.D Thesis, Institut
für Informatik und angewandte Mathematik,
Université de Bern, Suisse, 1995.
[14] M. Neuhaus, H. Bunke. “Automatic learning of cost
functions for graph edit distance,” Inform.Sci, 177 (1):
pp 239-247, 2007.
[15] M.Pelillo, K.Siddiqi, S.W. Zucker. “Matching
hierarchical structures using association graphs,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol.
21, no. 11, pp.1105-1120, November 1999.
[16] K. Riesen, and H. Bunke. “Approximate graph edit
distance computation by means of bipartite graph
matching,” Image and Vision Computing, 27 (7), pp.
950-959, 2009.
[17] K. Shearer, H. Bunke, S. Venkatesh. “Video indexing
and similarity retrieval by largest common subgraph
detection using decision trees,” Pattern Recognition 34,
pp.1075-1091, 2001.
[18] S.Sorlin. “Mesurer la similarité de graphes,” Ph. D
Thesis, Université Claude Bernard Lyon I, 24
Novembre 2006.

Vous aimerez peut-être aussi