Attribution Non-Commercial (BY-NC)

10 vues

Attribution Non-Commercial (BY-NC)

- Inside People Soft Trees
- Mcq
- News Eplan en Us
- Data Structure Mcqs
- INFLUENCE OF MEDIA EXPOSURE TO SCHOOL SUCCESS – DECISION TREE ANALYSIS
- week2
- BTech Computer Science Syllabus
- Errata List for Algorithm Manual Skiena
- MELJUN CORTES ALGORITHM Transform-And-Conquer Algorithm Design Technique
- Simple Fast Algorithms for the Editing Distance Between Tree and Related Problems
- 3rd Sem Data-structure
- AVL Tree _ Set 1 (Insertion) - GeeksforGeeks
- MTH 221 Proactive Tutors/snaptutorial
- I CTA (Sy)181011053256.pdf
- Jerrold Griggs, Charles E. Killian and Carla D. Savage- Venn Diagrams and Symmetric Chain Decompositions in the Boolean Lattice
- Lecture-10 Regression Tree
- [Lee-2010]-ADPC.pdf
- DecisionAnalysis Overview
- Treegggggg
- ProblemStatement_29Oct2011

Vous êtes sur la page 1sur 18

Algorithmica

9 1996Springer-VerlagNewYorkInc.

Unordered Labeled Trees 1

Kaizhong Zhang 2

Abstract. This paper considers the problem of computing a constrained edit distance between unordered

labeled trees. The problem of approximate nnordered tree matching is also considered. We present dynamic

programming algorithms solving these problems in sequential time O(ITll • IT21• (deg(T1) + deg(T2)) •

log2(deg(Tl) q- deg(T2))). Our previous result shows that computing the edit distance between unordered

labeled trees is NP-complete.

Key Words. Unordered trees, Constrained edit distance, Approximate tree matching.

1. Introduction. Unordered rooted labeled trees are trees whose nodes are labeled

and in which only ancestor relationships are significant (the left-to-right order among

siblings is not significant). Such trees arise naturally in many fields: genealogical studies,

e.g., determining genetic diseases based on ancestry tree patterns; computer vision, e.g.,

object representation and recognition [7], [8]; chemistry, e.g., studying the relationship

between the structure of chemical compounds and their properties [ 10], [4]; file system,

e.g., information retrieval in a unix file system. For many such applications, it would be

useful to compare unordered labeled trees by some meaningful distance metric.

The edit distance metric, used with success [6] for ordered labeled trees [17], [18], is

one such natural metric. A previous result has shown that computing the edit distance

for unordered labeled trees is NP-complete [19] (in fact MAX SNP-hard [16]). Since

unordered trees are important in many applications, one would like to find distances that

can be efficiently computed.

Motivated by these results, this paper considers a constrained edit distance metric

between unordered labeled trees. The constraint we add is a natural one, namely, dis-

joint subtrees be mapped to disjoint subtrees. Actually some applications would require

this constraint (e.g., classification tree comparison) [11]. We present a polynomial-time

algorithm to compute this constrained edit distance metric.

In Section 2 we review the previous NP-completeness and MAX SNP-hard results

on the edit distance metric between unordered trees. In Section 3 we introduce the new

distance metric between trees, which we call the constrained edit distance. In Section 4

we investigate the properties of the constrained edit distance metric. In Section 5 we

present a dynamic programming algorithm to compute the constrained edit distance

This research was supported by the Natural Sciences and Engineering Research Council of Canada under

Grant No. OGP0046373.

2 Department of Computer Science, University of Western Ontario, London, Ontario, Canada, N6A 5B7.

kzhang@csd.uwo.ca.

Received October 16, 1993; revised May 18, 1994. Communicated by H. N. Gabow.

206 Kaizhong Zhang

metric and analyse the complexity. In Section 6 we consider the approximate unordered

tree-matching problem.

review the previous results on unordered trees.

2.1. Trees. A free tree is a connected, acyclic, undirected graph. A rooted tree is a

free tree in which one of the vertices is distinguished from the others. The distinguished

vertex is called the root of the tree. We refer to a vertex of a rooted tree as a node of the

tree. An unordered tree is just a rooted tree. We use the term unordered tree to distinguish

it from the rooted, ordered tree defined below. A n ordered tree is a rooted tree in which

the children of each node are ordered. That is, if a node has k children, then we can

designate them as the first child, the second child . . . . . and the kth child.

Unless otherwise stated, all trees we consider in the paper are unordered labeled trees.

means changing the label on n. Deleting a node n means making the children of n

become the children of the parent of n and then removing n. Inserting is the complement

of deleting. This means that inserting n as the child of m will make n the parent of a

subset (as opposed to a consecutive subsequence [17]) of the current children of m.

Suppose each node label is a symbol chosen from an alphabet E. Let )~, a unique

symbol not in E , denote the null symbol. Following [9], [17], and [19], we represent an

edit operation as a ~ b, where a is either )~ or a label of a node in tree 7"1 and b is either

)~ or a label of a node in tree T2. We call a ~ b a change operation i f a r ~ and b r )~; a

delete operation if b = )~; and an insert operation if a -- )~. Let T2 be the tree that results

from the application of an edit operation a --+ b to tree T1; this is written T1 --+ T2 via

a-+b.

Let S be a sequence sl . . . . . sk of edit operations. A n S-derivation from tree A to tree

B is a sequence of trees Ao . . . . . Ak such that A = Ao, B = Ak, and Ai-1 ~ Ai via si

for 1 < i < k. Let F be a cost function which assigns to each edit operation a ~ b a

nonnegative real number y (a ~ b).

We constrain F to be a distance metric. That is:

(ii) y ( a - + b) = ~,(b --+ a).

(iii) y (a --+ c) < ~' (a --+ b) + y (b --+ c).

X-~Isl

We extend F to the sequence of edit operations S by letting F ( S ) = ~-.~i=1 F(si).

2.3. Editing Distance and Edit Distance Mapping. The results in this subsection are

from [19]. We omit the proofs.

2.3.1. Editing Distance. The edit distance between two trees is defined [19] by con-

sidering the minimum cost edit operations sequence that transforms one tree to the other.

A Constrained Edit Distance B e t w e e n U n o r d e r e d Labeled Trees 207

2.3.2. Editing Distance Mappings. The edit operations give rise to a mapping which is

a graphical specification of what edit operations apply to each node in the two unordered

labeled trees.

Given a tree, it is usually convenient to use a numbering to refer to the nodes of the

tree. For an ordered tree T, the left-to-right postorder numbering or left-to-right preorder

numbering are often used to number the nodes of T from 1 to IT I, the size of tree T.

Since we are concerned with unordered trees, we can fix an arbitrary order for each of the

node in the tree and then use left-to-right postorder numbering or left-to-right preoi~der

numbering. We refer to the above as a numbering of the tree.

Suppose that we have a numbering for each tree. Let t[i] be the ith node of tree T

in the given numbering. Formally we define a triple (Me, T~, T2) to be an edit distance

mapping from T1 to T2, where Me is any set of ordered pairs of integers (i, j ) satisfying:

(0) 1 < i < I/'11, 1 < j < IT2I; where ITkl is the size of tree Tk, k = 1, 2.

(1) For any pair of (il, jl) and (i2, j2) in Me:

(a) il = i2 iff jl = j2 (one-to-one).

(b) q[il] is an ancestor of t1[i2] iff t2[jl] is an ancestor of t2[j2] (ancestor order

preserved).

We use Me instead of (Me, T1, T2) if there is no confusion. Let Me be an edit distance

mapping from T1 to T2. Then we can define the cost of Me :

(i,j)cM~ {il~ j s.t. (i,j)EMe}

+ V()~ ~ t2[j]).

{j[~[ i s.t. (i,j)cMe}

The relation between an edit distance mapping and a sequence of edit operations is

as follows:

distance mapping Me from Ta to T2 exists such that V (Me) < y ( S). Conversely, for any

edit distance mapping Me, a sequence of edit operations exists such that • ( S) = y (Me).

Based on the lemma, the following theorem states the relation between the edit dis:

tance and the edit distance mappings. This is whey we call this kind of mapping edit

distance mapping.

THEOREM 1.

Me

One of the results in [19] is that finding De(T1, T2) is NP-complete even if the trees

are binary trees with a label alphabet of size two. Kilpelainen and Mannila [3] showed

208 Kaizhong Zhang

that even the inclusion problem for unordered trees in NP-complete. In fact we recently

proved a stronger result [16]: that the problem of finding the minimum cost edit distance

mapping (edit distance) between two unordered trees and the problem of finding the

largest common subtree of two unordered trees are both M A X SNP-hard. This means

that there is no polynomial-time approximation scheme (PTAS) for these problems unless

P = NP [1]. Since unordered trees are important in many applications, it is desired to

find distances that can be efficiently computed.

3. A Constrained Edit Distance Metric Between Unordered Trees. Our new dis-

tance metric is based on a constraint of the mappings allowed between two trees. The

intuitive idea is that two separate subtrees of 7"1 should be mapped to two separate sub-

trees of T2. This idea was proposed by Tanaka and Tanaka [11] in their definition for

a structure preserving mapping between two ordered labeled trees although the defini-

tion in [11] does not capture the idea precisely. Tanaka and Tanaka [11] also showed

that in some applications (e.g., classification tree comparison) this kind of mapping is

more meaningful than edit distance mapping. In a separate result [15], we developed an

algorithm for the constrained edit distance between ordered labeled trees with time com-

plexity O([TI[[T2[). This paper extends the definition from ordered trees to unordered

trees.

3.1. Constrained Edit Distance Mappings. Suppose again that we have a numbering

for each tree. Let t[i] be the ith node of tree T in the given numbering. Let T[i] be the

subtree rooted at t [i] and let F[i] be the unordered forest obtained by deleting t [i] from

T[i].

Formally we define a triple (M,/'1, T2) to be a constrained edit distance mapping

from T1 to T2, where M is any set of ordered pairs of integers (i, j ) satisfying:

(1) M is an edit distance mapping.

(2) For any triple (il, jl), (i2, j2), and (i3, j3) in M let q[1] be the lca(q[il], q[i2]) and

let t2[J] be the lca(t2[jl], t2[j2]), where lca represents least common ancestor, tl [I]

is a proper ancestor of tl [i3] iff t2[J] is a proper ancestor of tz[j3].

This definition adds condition (2) to the definition of the edit distance mapping.

We examine the meaning of the constrained edit distance mappings. Suppose tl [I] is a

descendant of tl [i3], then tl [il] and tl [i2] must be descendants of tl [i3]. By condition

(1) t2[jl] and tz[j2] must be descendants of t2[j3]. This in turn means that t2[J] is a

descendant of T2[j3]. Therefore condition (1)implies that tl[I] is a descendant of tl [i3]

if and only if tz[J] is a descendant of tz[j3]. Condition (2) adds the constraint that tl [I]

is a proper ancestor of t1[i3] if and only if tz[J] is a proper ancestor of tz[j3]. Suppose

now that neither q[I] nor q[i3] is an ancestor of the other, then by (1) and (2) neither

tz[J] nor t2[j3] is an ancestor of the other.

Consider two separate subtrees T1 [il ] and T1[i2] such that neither tl [il ] nor tl [i2] is

an ancestor of the other. Given a mapping M, let

Mk = {n I (m, n) c M and tl[m] is a node of T1 [ik]}

A Constrained Edit Distance Between Unordered Labeled Trees 209

image of Tl[ik] under mapping M. With our definition of the constrained edit distance

mapping, neither t2[jl] nor t2[j2] is an ancestor of the other. This coincides with the

intuitive idea that separate subtrees of Tl should be mapped to separate subtrees of/'2.

We use M instead of (M, Tl, T2) if there is no confusion. Let M be a constrained edit

distance mapping from T1 to T2; the cost of M, F (M), is well defined since M is also an

edit distance mapping.

Constrained edit distance mappings can be composed. Let Ml be a constrained edit

distance mapping from T1 and T2, and let M2 be a constrained edit distance mapping

from T2 to T3. Define

LEMMA 2.

(2))/(M1 o M2) < y(M1) + y(M2).

PROOF. (1) follows from the definition of constrained edit distance mappings. We check

condition (2) only. Let (il, jl), (i2, j2), and (i3, J3) be in M1 o ME. By the definition of

M1 o M2, there are kl, k2, and k3 such that (il, kl), (i2, k2), and (i3, k3) are in M1 and

(kl, jl), (k2, j2), and (k3, j3) are in M2. Let I be the lca(il, i2), let K be the lca(kl, k2),

and let J b e the lca(jl, j2). By the definition of M1 and M2, I is a proper ancestor of i3

iff K is a proper ancestor of k3. Moreover, K is a proper ancestor of k3 iff J is a proper

ancestor of j3. Therefore I is a proper ancestor of i3 iff J is a proper ancestor of J3.

(2) Let MI be the constrained edit distance mapping from T1 to T2, let/I//2 be the

constrained edit distance mapping from T2 to T3, and let M1 o M2 be the composed

constrained edit distance mapping from T1 to T3. Three general situations occur: (i, j)

M1 oM2, i ~ M1, or j ( M2. In each case this corresponds to an edit operation F (x --+ y)

where x and y may be nodes or may be )~. In all such cases, the triangle inequality on

the distance metric y ensures that F (x ~ y) < F (x --~ z) + F (z ~ y). []

3.2. A Constrained Edit Distance Between Trees. We can now define a dissimilarity

measure between T1 and T2 as:

M

THEOREM 2.

(1) D(T, T) = O.

(2) D(T1, T2) = D(T2, T~).

(3) D(T~, T3) < D(T~, T2) + D(T2, T3).

PROOF. (1) and (2) follow directly from the definition of the constrained edit distance

mapping. For (3), consider the minimum cost constrained edit distance mappings M1

210 Kaizhong Zhang

D(T1, T3) _< y(M1 o M2) 5 y(M1) + z(M2) = D(T~, T2) + D(T2, T3). []

We call this distance metric the constrained edit distance. The relation between the

constrained edit distance metric D and the edit distance metric De is De(T1, T2) _<

D(T1, T2). The reason is that any constrained edit distance mapping is always an edit

distance mapping.

lemmas which are the basis for the algorithm in the next section. We use 0 to represent

the empty tree.

LEMMA 3. Let h[il], tl[i2] . . . . . ta[ini] be the children of tl[i] and let t2[jl], t2[j2],

.... tz[jnj] be the children of tz[j]. Then

D(O, O) = ,

ni

D(FI[i], O) = Z D(Tl[ik], 0), D(TI[i], O)= D(FI[i], O) + y(tl[i] -+ ~.),

k=l

nj

D(O, F2[j]) = Z D(O, T2[jk]), D(O, T2[j]) :- D(O, F2[j]) + y()~ ~ t2[j]).

k=l

PROOF. Trivia. []

t2[Jnj] be the children of t2[j]. Then

l<t<nj

l<s<n i

PROOF. Consider the minimum cost constrained edit distance mapping M between

TI[i] and T2[j]. There are four cases:

(1) i e M and j qgM.

(2) iCMandj eM.

(3) ieMandj 6M.

(4) i C M and j C M.

Case 1. Let (i, t) in M. Since j r M, t must be a node in F2[j]. Let t2[jt] be the

child of t2[j] on the path from tz[t] to t2[j]. Thus D(TI[i], T2[j]) = D(T1 [i], T2[jt]) -4-

A Constrained Edit Distance Between Unordered Labeled Trees 2 11

Since D(O, T2[j]) -----y(/k, t2[j]) + ~ : - - 1 D(O, Tz[jk]), we can rewrite the right-hand

side of the formula as D(O, T2[j]) + D(TI[i], T2[jt]) - D(O, T2[jt]). Since the range of

k is from 1 to n j, we take the minimum of these corresponding costs.

Case 2. Similar to case 1.

Case 3. Since i ~ M and j c M, by the condition of constrained edit distance mapping,

(i, j ) must be in M. Since M - {(i, j)} is a constrained edit distance mapping between

Ft [i] and F2[j], and for any constrained edit distance mapping M ' between F1 [i] and

Fz[j], (i, j ) tO M ' is a constrained edit distance mapping between TI[i] and T2[j], we

know D(TI[i], Tg_[j]) = D(FI[i], F2[j]) + y(q[i], tv_[j]).

Case 4. Similar to Case 3. The formula would be D(TI[i], Tz[j]) = D(F1 [i], Fz[j]) +

y (tl [i], ~) + y 0~, t2[j]). Since ~' (q [i], t2[j]) _< ~' (q [i], ~.) + y (~., t2 [j]), we do not have

to include this case in our final formula. []

A restricted mapping RM(i, j) between FI [i] and F2[j] is defined as follows:

(1) RM(i, j) is a constrained edit distance mapping between F1 [i] and F2[j].

(2) If (l, k) is in RM(i, j), q[l] is in Tl[is], and t2[k] is in T2[jt], then, for any (ll, kl)

in RM(i, j), tl[ll] is in T~[is] if and only if t2[kl] is in T2[jt].

restricted mapping is well defined. The meaning of the restriction is that nodes from a

subtree T1 [i~] can only map to nodes of one subtree T2[jt] and vice versa. It turns out that,

although there are exceptions, this is the general case when we consider the constrained

edit distance mapping between/71 [i] and F2[j].

LEMMA 5. Let q[il], tt[i2] . . . . . q[in,] be the children of q[i] and let t2[jl], t2[j2],

. . . . t2[jnj] be the children of t2[j]. Then

D(FI[i], F2[j]) = min D(FI[i], O) + l<s<ni

min {D(F~[is], F2[j]) - D(Vl[is], 0)},

Ft[i] and F2[j]. There are four cases.

Case 1. There is a 1 < t < nj such that if (k, l) ~ M, then t2[l] is a node in subtree

T2[jt]; and there are (kt, ll) and (k2, 12) in M such that q[kt] is a node in Tl[isl] and

q[k2] is a node in Tt[is2], where 1 < Sl ~ s2 < ni. Note that in this case jt cannot be

in M. This is similar to case 1 in L e m m a 4, and hence we have the following formula:

D(Ft[i], F2[j]) = D(O, F2[j]) + minl<_t<_nj{D(Fl[i], F2[jt]) - D(O, F2[jt])}.

212 KmzhongZhang

Case 2. There is a 1 < s < ni such that if (k, l) ~ M, then q [k] is a node in subtree

7"1[is]; and there are (k l, ll) and (k2, lz) in M such that t2 [/1] is a node in T2[jt, ] and t2[12]

is a node in T2[jt:], where 1 < tl ~ t2 < nj. This is similar to case 1.

Case 3. There are s and t such that if (k, l) 6 M, then tl[k] is node in Tl[is] and t2[l]

is node in Tz[it]. In this case M is a restricted mapping and therefore D(F1 [i], F2[j]) =

minRM(i,j) y ( R M ( i , j)).

Case 4. There are (kl, ll), (k2, 12), (xl, Yl), (x2, Y2) in M such that q[kl] is a node in

T1 [is, ], tl [k2] is a node in T1 [is~], t2[yl] is a node in T2[it, ], and t2[y2] is a node in T2[it2],

where 1 < sl ~ s2 <_ ni and 1 < tl ~ t2 < nj. We show that in this case the constrained

edit distance mapping M is a restricted mapping between F1 [i] and F2[j].

Suppose this is not true. Then, without loss of generality, we assume that there are

(al, bl) and (a2, b2) in M such that q [ a l ] and tl[a2] belong to the same subtree and

t2[bl] and t2[bz] belong to different subtrees. Let (a3, b3) e M such that q[a3] and

tl [al] belong to different subtrees of F1 [i]. Now consider lca(q [a 1], fi [a2]) and tl[a3],

Since q [al ] and tl [a2] belong to the same subtree which is different from the subtree

tl [a3] belongs to, we know that lca(tl[al], tl [a2]) and tl [a3] do not have the ancestor or

descendant relationship. However, if we consider lca(t2 [bl ], t2 [b2]) and t2 [b3 ], it is easy

to see that lca(t2[bl], t2[b2]) = t2[j], which is an ancestor of t2[b3]. This means that M

is not a valid constrained edit distance mapping. Contradiction. []

minRM(i,j) y ( R M ( i , j)). From the definition of the restricted mapping, we know that

nodes of one subtree of F1 [i] can only map to nodes of one subtree of F2[j]. This means

that if Tl[is] is involved in the mapping, then it will map to T2[jt] for some 1 < t < nj;

if it is not involved in the mapping, then it will map to an empty tree. This gives a

kind of matching between subtrees of FI[i] and subtrees of F2[j]. The next two lem-

mas establish the relationship between minRM(i,j) y ( R M ( i , j ) ) and D(T1 [is], T2[jt]),

where 1 < s < ni and 1 < t < r/j. We need the following definitions in the next two

lemmas.

Let I = { i l , i2 . . . . . in~}, J = {jl, j2 ..... jnj}, A = {al, a2 . . . . . anj}, and B =

{bl, bz . . . . . bn, }. We use A and B to represent empty trees. Let L = I U A and R = J t3 B,

we define a weighted bipartite graph G(i, j ) = (V, E) as follows:

(1) V = L U R .

(2) E = L • R, where • stands for the Cartesian product.

(3) w(u, v) = D(TI[u], T2[v]) i f u 6 I and v 6 J,

w(u, v) =- D(Tl[u], O ) i f u c I and v c B,

w(u, v) = D(O, T2[v]) i f u 6 A and v 6 J,

w(u,v)=Oifu E A a n d v 6 B.

at most one edge of M is incident on v. The size of M, IMh is the number of edges in

M. A maximum matching is a matching with maximum cardinality, that is, a matching

M such that, for any matching M', we have IMI > IM'I. Let MM(i, j ) be a maximum

matching on G(i, j), then IMM(i, J)l = ni + nj.

A Constrained Edit Distance Between Unordered Labeled Trees 213

Let MM(i, j) be a maximum matching on G(i, j). The cost ofMM(i, j ) , the sum-

mation of the weights of the edges in MM(i, j), is defined as follows:

(u, v) ~MM(i, j)

(u,v)EMM a E1 and vEJ

+ Z D(TI[u], O)

(u,v)6MM and uel and oeB

+ Z D(O, T2[v]).

(u.v)EMMand ueA and vcJ

on G(i, j) as follows:

where

s.t. q[k] (t2[/]) is a node in Tl[is] (T2[jt])},

JMM : {(at, j r ) [ there is no is s.t. (is, jr) E HMM},

KMM = {(as, bt) I (it, js) E HMM}.

more, it is easy to see that y(MM(i, j)) < y(RM(i, j)).

On the other hand, given a maximum matching MM (i, j), we can construct a restricted

mapping RM(i, j) as follows:

RM(i, j) = (k, I) minimum cost constrained edit distance mapping

between Tl[is] and T2[jt]

It is clear that this is indeed a restricted mapping. Therefore, y(RM(i, j)) <

y (mm(i, j)).

Hence minRM(i,j) v(RM(i, j ) ) = minMM(i,j) y(MM(i, j)). []

214 KaizhongZhang

LEMMA7. Let q[il], q[i2] . . . . . tl[im] be the children of tl[i] and let t2[jl], t2[J2],

. . . . t2[jni] be the children of t2[j]. Then

1<t <nj

1<s <ni

MM(i,j)

MM(i,j)

5.1. Algorithm. From the definition of MM(i, j) and y(MM(i, j)), it is clear that this

problem is the minimum cost maximum bipartite matching problem. Therefore we can

use the weighted m a x i m u m matching algorithm to solve this problem. However, since

we have many empty trees in graph G(i, j), this will result in redundant computation.

Instead, we reduce this problem directly to the minimum cost maximum flow problem

by adding only two empty trees, one to F1 [i] and the other to Fe[j].

Given Fl[i] and F2[j], let I = {il, i2 . . . . . . in,} and J = {jl, j2 . . . . . jnj}, where ik,

1 < k < hi, represents tree Tl[ik], and jl, 1 < l < nj, represents tree T2[jd.

We construct a graph G = (V, E) as follows:

vertex set: V = {s, t, el, ej } t.J I U J, where s is the source, t is the sink,

and ei and ej represent two empty trees;

edge set: [s, ik], [s, eli, [jl, t], [ej, t] with cost zero, [ik, jl] with cost

D(Tl[ik], T2[jt]), [el, jt] with cost D(O, T2[jl]), [ik, ej] with cost

D(Tl[ik], 0), and [el, ej] with cost zero. All the edges have capacity one

except [s, el], [ei, ej], and [ej, t] whose capacities are nj, max{n/, nj} -

min{ni, nj }, and hi, respectively.

G is a network with integer capacities, nonnegative costs, and the maximum flow f * =

n i -1- nj (see Figure 1). Let cost(G) be the cost of the minimum cost on maximum flows

of G, where only the costs on positive flows are considered. The following lemma states

the relation between cost(G) and minMM(i,j) y(MM(i, j)), which enables us to use the

minimum cost flow algorithm to compute minMM(i.j) g (MM(i, j)).

flow On G: for any (is, jr) c MM(i, j) the flow on edge [is, Jt] is one; for any (is, bt) E

A Constrained Edit Distance Between Unordered Labeled Trees 215

i

J

ik I, DOt[it], T2[jd) j l

1,

2,0

MM(i, j) the flow on edge [is, ej] is one; for any (as, jr) ~ MM(i, j) the flow on edge

[ei, jt] is one; the flow on edge

the flows on edge Is, ik] and edge [jl, t] are one; the flow on edge [s, el] is nj; the flow

on edge [ej, t] is ni; and all the flows on the other edges are zero.

On the other hand, given a maximum flow on G, cost(G) represents the cost of the

following maximum matching on G(i, j):

MM(i, j) = {(L, jt), (at, bs) [ the flow on edge [is, jt] is one}

U {(is, bs) [ the flow on edge [is, ej] is one}

U {(at, jr) [ the flow on edge [ei, jr] is one}.

minMM(i,j ) y (MM(i, j) ). []

216 KaizhongZhang

Output: D(TI[i], T2[j]), where 1 < i < iTll and 1 < j < IT21.

D(O, 0) = 0;

f o r / = 1 to ITs]

nl

D(Fl[i], O) -= ~ D(Tl[ik], O)

k=l

D(TI[i], O) ----D(FI[i], O)+ y(ta[i] ~ )0

for j = 1 to IT21

nj

k=l

D(O, T2[j]) ----D(O, F2[j]) + )/(~. -+ t2[j])

fori = 1 to ITll

for j = 1 to IT21

D(O, F2[j]) + rain {D(F~[il, F2[jt]) - D(O, F2[j,I)},

l<t<nj

D(FI[i], F2[j]) ----rain D(F~[i], 0) + rain {D(Fltis], Fztj]) - D(Fl[is], 0)},

l<s<n i

rain y(MM(i, j)).

MM(i,j)

l<t<nj

D(Tl[i], T2[j]) = rain D(TI[i], O) + rain {D(Tl[is], T2[j]) - O(Tl[is], 0)},

l<_s<ni

D(Fl[i], F2[j]) + y(tl[i] --~ t2[j]).

bounded by O ( n i q'- n j ) . The complexity of computing D(Fa[i], F 2 [ j ] ) i s bounded

by O (hi -}- nj) plus the complexity of the minimum cost maximum flow computation.

Our graph G is a graph with integer capacities, nonnegative edge costs, and maximum

flow f * ~-- ni "[- nj. The complexity of finding minimum cost maximum flow for such

a graph with n vertices and m edges is O(mlf*l log(2+mt,) n) < O(mLf*l log2 n) [12].

For our graph, n = ni + nj + 4 and m = ni x nj + 2ni + 2nj + 3; therefore the complexity

is bounded by O(ni x n j x (hi -t- nj) x log2(ni -4- nj)).

Hence for any pair i and j, the complexity of computing D(TI[i], T2[j]) and

D(FI[i], FE[j]) is bounded by O(ni x nj x (ni -1- nj) x log2(n i + nj)). Therefore

the complexity of our algorithm is

ITd IT2~

E E O(ni x nj • (ni + nj) • log2(ni + nj))

i=1 j = l

I/'11 t/'21

< ~_~ O(ni • nj • (deg(T1) -4-deg(T2)) x log2(deg(T1) -4-deg(T2)))

i=1 j = l

A Constrained Edit Distance Between Unordered Labeled Trees 217

i=1 j=l

is a natural extension of approximate string matching [5], [13], [2] and approximate

ordered tree matching [ 17].

The approximate-string-matching problem is as follows. Given two strings TEXT and

PAT, the problem is to compute, for each i, SD[i, PAT] = minj {D(TEXT[j..i], PAT)},

where 1 _< j _< i + 1 and SD is the string distance metric. With the edit distance metric,

the problem is to compute, for each i, the minimum number of edit operations between

"pattern" string PAT[ 1.. I PATI] and the "text" substring TEXT[ 1..i], where any prefix

can be removed from TEXT[1..i]. Intuitively the problem is the find the "occurrence" in

TEXT that most closely matches PAT.

Given pattern tree P and data tree T, in order to allow P to match only a part of T,

we must generalize the notion of a substring and the notion of removing a prefix. For us,

a substring means a subtree and a prefix means a collection of subtrees.

Assume a numbering of tree T, we use t[i] to represent the ith node of tree T, T[i] to

represent the subtree rooted at t[i], and Tf[i] to represent the forest obtained by deleting

t[i] from T[i]. Similarly, for pattern tree P we can define p[i], P[i], and Pf[i].

We first define the operation of removing at a node [17].

Removing at node t[i] means removing the subtree rooted at t[i]. Given tree T, we

define S(T) as a set of nodes of T satisfying

Define R(T, S(T)) to be the tree obtained by removing at all the nodes in S(T) from

tree T.

Now we can give the definition of approximate unordered tree matching. Given pattern

tree P and data tree T, for each subtree T[i], we want to compute

The intuitive meaning of the above definition is as follows. Given a pattern tree P

and a data tree T, with approximate tree matching we want to find part of T, say Te, that

most closely matches P. We assume that Te is connected, therefore Tp must be a tree.

Since we consider rooted trees, Tp has a root. Let t [i] be the root of Tp; it is clear that we

can obtain Tp by removing some subtrees, say S(T[i]), from T[i] since Te is connected.

Hence, mini Dr(P, T[i]) will give us the distance between P and Tp.

In the following we use ni to represent the number of children of the node p[i] and

nj to represent the number of children of the node t[j].

218 KaizhongZhang

LEMMA 9. Let p[il], p[i2] . . . . . p[ini] be the children of p[i] and let t[jl], t[j2],

. . . . t[jnj] be the children oft[j]. Then

Dr(O, O) =-- O,

Dr(O, Tf[j]) = 0, Dr(O, T [ j l ) = 0,

ni

Dr(Pf[i], O) = Z Dr(e[ik], 0), Dr(P[i], O) = Dr(Pf[i], O) -t- y ( p [ i ] --+ L).

k=l

PROOF. Trivial. []

LEMMA 10. Let p[il], p[i2] . . . . . p[in,] be the children of p[i] and let t[jl], t[j2],

.... t[jnj] be the children oft[j]. Then

Dr (P[i], 0),

l y 0 ~ ~ t[j]) + min Dr(P[i], T[jt]),

I l<t<nj

Dr(P[i], T [ j l ) = min I Dr(P[i], O) + rain Dr(P[is], T[j]) - Dr(P[is], 0),

I 1 s<ni

[ Dr(Pf[i], Tf[j])-+-y(p[i] -'+ t[j]).

PROOF. We consider the best mapping between P[i] and T[j] after performing an

optimal removal of subtrees of T[j]. If the whole tree T[j] is removed, then the dis-

tance should be Dr(P[i], 0). Otherwise we have the same four cases as in Lemma 4.

In case 1 p[i] is in the best mapping but t[j] is not. Suppose that p[i] is mapped to

t[k] which is a node in T[jt]. Then subtrees T [ j l ] . . . . . T[jt-1], T[jt+l] . . . . . T[j~j]

should have been removed in the optimal removal of subtrees of T[j]. Therefore the

distance is y(L - , t [ j ] ) + Dr(P[i], T[jt]). The other three cases are the same as in

Lemma 4. []

weighted bipartite graph. However, we have to modify the definition of the bipartite

graph: Let I = {i1, i2 . . . . . in~}, J = {jl, j2 . . . . . jnj}, and B = {bl, b2 bin}. We,

. . . . .

bipartite graph Gr(i, j ) = (V, E) as follows:

(1) V = L U R .

(2) E = L x R, where x stands for Cartesian product.

(3) w(u, v) = D(TI[ul, T2[vl) i f u e I and v ~ J,

w(u, v) = D(Tl[u], O) i f u 6 I and v 6 B.

A Constrained Edit Distance Between Unordered Labeled Trees 219

LEMMA 11. Let p[il], P[i2] . . . . . p[i,i] be the children of p[i] and let t[jl], t[j2],

. . . . t[jnj] be the children oft[j]. Then

I

Or(Pf[i], Tf[j]) = min I/ Dr(Pf[i], O) + l_s<_n,minD,(Pf[is], Tf[j]) - D~(Pf[ij], 0),

/

I min y(MMr(i,j)).

I~MMr(i,j)

PROOF. Again we consider the best mapping after performing the optimal removal.

If Tf[j] is removed, then the distance should be Dr (el[i], 0). However, this is handled

by minMMr(i,j) y(MM~(i, j)) since y(MMr(i, j)) = D~(Pf[i], O) when MMr(i, j) =

{(iv, bs) [ 1 < s < ni}.

Suppose that Tf[j] is not removed, then we have four cases as in L e m m a 5. Case 1 is

similar to the case 1 in L e m m a 5. The only difference is that we do not need to consider

the costs of inserting subtrees of Tf[j] which are not in the mapping since they should

have been removed. Case 2 is the same as case 2 in L e m m a 5. Case 3 is a special case

for minMMr(i,j) y(MMr(i, j)), where MMr(i, j) contains only one element from I • J.

Case 4 is similar to case 4 of L e m m a 5. Gr (i, j ) and MMr (i, j ) ) are defined to deal with

this case where we do not need to consider those subtrees of Tf[j] that are not in the best

mapping. []

We again reduce the computation of minMMr(i,j) y(MMr(i, j)) to the minimum cost

flow problem.

Given P[i] and T[j], let 1 = {il, i2 . . . . . inl} and J = {jl, j2 . . . . . jn~}, where ik,

1 < k < hi, represents tree P[ik], and jr, 1 < l < nj, represents tree T[jl].

We construct a graph Gr = (V, E) as follows:

represents an empty tree;

edge set: [s, ik], [jl, f ] , [e, f ] , [f, t] with COSt zero, [ik, jr] with cost

Dr(P[ik], T[jd), and [ik, e] with cost Dr(P[ik], 0). All the edges have

capacity one except [e, f ] and [f, t], whose capacities are hi.

Gr is a network with integer capacities, nonnegative costs, and the maximum flow

f * = ni (see Figure 2).

We can prove that cost(Gr) = minMMr(i,j) y (MMr (i, j)). The proof is similar to the

proof of L e m m a 8.

We are now ready to give an algorithm for approximate unordered tree matching. The

time complexity of the algorithm is O ( I P I x ITI x deg(P) • log2(deg(P ) + deg(T)))

since f * for Gr is r/i instead of ni -[- nj.

220 Kaizhong Zhang

Tti]

s t

l,l~-'P[i,l, 19

Input: P and T.

Output: D~(P, T [ i ] ) , w h e r e 1 < i < }Tt.

Dr (0, 0) = 0;

for j = 1 to ITI

D~(O, T,[j]) = 0

Dr(O, T [ j ] ) = 0

for/= 1 to IP[

ni

o~(Pf[il, o) = ~ DAP[ik], O)

k=l

D~(P[i], O) = Dr(Pf[i], O) + y ( p [ i ] --~ ;0

for i = l to IPI

for j : 1 to IT]

min Dr(Pe[il, ~ [ j , ] ) + y ( L --+ t [ j t ] ) ,

l<t<ny

Dr(Pf[il, Tf[j]) = min Dr( Pf[i], O) q- min Or( Pf[is], Tf[j]) - Dr( Pf[is], 0),

l <s<n i

MM~(i,j)

A Constrained Edit Distance Between Unordered Labeled Trees 221

Dr(P[i],O),

l<t<nj

Dr(P[i], T[j]) = min

D r ( P [ i ] , O) q- m i n Dr(P[is]: T [ j ] ) - Dr(P[is], O, ),

l<s<ni

D r ( P f [ i ] , Tf[j]) + y ( p [ i l ---> t [ j l ) .

[ 16], we have defined a constrained edit distance metric b e t w e e n unordered labeled trees.

We present an algorithm for c o m p u t i n g this distance metric based on a reduction to the

m i n i m u m cost m a x i m u m flow problem. O u r algorithm is generalizable with the same

complexity to the approximate unordered-tree-matching problem.

The work presented is part of a project to develop a comprehensive tool for approxi-

mate tree pattern m a t c h i n g [ 14]. The proposed algorithms and their i m p l e m e n t a t i o n are

currently integrated into this tool.

Acknowledgments. The author would like to thank the a n o n y m o u s referees for their

m a n y helpful suggestions.

References

[ 1] A. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy, Proof verification and hardness of approx-

imation problems, Proc. 33rd IEEE Symp. on the Foundation of Computer Science, 1992, pp. 14-23.

[2] G.M. Landau and U. Vishkiu, Fast parallel and serial approximate string matching, J. Algorithms, 10

(1989), 157-169.

[3] P. Kilpelainen and H. Mannila, The tree inclusion problem, Proc. Internat. Joint Conf. on the Theory

and Practice of Software Development (CAAP '91), 1991, Vol. 1, pp. 202-214.

[4] S. Masuyama, Y. Takahashi, T. Okuyama, and S. Sasaki, On the largest common subgraph problem,

Algorithms and Computing Theory, RIMS, Kokyuroku (Kyoto University), 1990, pp. 195-201.

[5] P.H. Sellers, The theory and computation of evolutionary distances, J. Algorithms, 1 (1980), 359-373.

[6] B. Shapiro and K. Zhang, Comparing multiple RNA secondary structures using tree comparisons,

Comput. Appl. Biosci., 6(4) (1990), 309-318.

[7] E Y. Shih, Object representation and recognition using mathematical morphology model, J. System

Integration, 1 (1991), 235-256.

[8] E Y. Shih and O. R. Mitchell, Threshold decomposition of grayscale morphology into bina~'ymorphol-

ogy, IEEE Trans. Pattern Anal. Mach. Intell., 11 (1989), 31-42.

[9] K.C. Tai, The tree-to-tree correction problem, J. Assoc. Comput. Mach., 26 (1979), 422-433.

[10] Y. Takahashi, Y. Satoh, H. Suzuki, and S. Sasaki, Recognition of largest common structural fragment

among a variety of chemical structures, Anal. Sci., 3 (1987), 23-28.

[ 11] E. Tanaka and K. Tanaka, The tree-to-tree editing problem, Internat. J. Pattern Recog. Artificial Intell.,

2(2) (1988), 221-240.

[12] R.E. Tarjan, Data Structures and Network Algorithms, CBMS-NSF Regional Conference Series in

Applied Mathematics, CBMS, Washington, ~C, 1983.

[13] E. Ukkonen, Finding approximate pattern~in strings, J. Algorithms, 6 (1985), 132-137.

[14] J.T.L. Wang, Kaizhong Zhang, Karpjoo Jeong, and D. Shasha, ATBE: a system for approximate tree

matching, IEEE Trans. Knowledge Data Engrg, 6(4) (1994), 559-571.

222 Kaizhong Zhang

[15] Kaizhong Zhang, Algorithms for the Constrained Editing Distance Between Ordered Labeled Trees and

Related Problems, Technical Report No. 361, Department of Computer Science, University of Western

Ontario, 1993.

[16] Kaizhong Zhang and Tao Jiang, Some MAX SNP-hard results concerning unordered labeled trees,

Inform. Process. Lett., 49 (1994), 249-254.

[17] Kaizhong Zhang and D. Shasha, Simple fast algorithms for the editing distance between trees and related

problems, SIAMJ. Comput., 18(6) (1989), 1245-1262.

[18] Kaizhong Zhang, D. Shasha, and J. Wang, Approximate tree matching in the presence of variable length

don't cares, J. Algorithms, 16 (1994), 33-66.

[19] Kaizhong Zhang, R. Statman and D. Shasha, On the editing distance between unordered labeled trees,

Inform. Process. Lett., 42 (1992), 133-139.

- Inside People Soft TreesTransféré parSGANGRAM
- McqTransféré parZohaib Bukhari
- News Eplan en UsTransféré parOvidiu Fragă
- Data Structure McqsTransféré parStoriesofsuperheroes
- week2Transféré parAnonymous Hurq9VK
- INFLUENCE OF MEDIA EXPOSURE TO SCHOOL SUCCESS – DECISION TREE ANALYSISTransféré parMirjana Pejic Bach
- BTech Computer Science SyllabusTransféré parRohit Sharma
- Errata List for Algorithm Manual SkienaTransféré parcoolvar90
- MELJUN CORTES ALGORITHM Transform-And-Conquer Algorithm Design TechniqueTransféré parMELJUN CORTES, MBA,MPA
- Simple Fast Algorithms for the Editing Distance Between Tree and Related ProblemsTransféré parEmil Stenqvist
- 3rd Sem Data-structureTransféré parSanjeev Surad
- AVL Tree _ Set 1 (Insertion) - GeeksforGeeksTransféré paramit
- MTH 221 Proactive Tutors/snaptutorialTransféré pardisneytomandjerry21
- I CTA (Sy)181011053256.pdfTransféré parvvvvm
- Jerrold Griggs, Charles E. Killian and Carla D. Savage- Venn Diagrams and Symmetric Chain Decompositions in the Boolean LatticeTransféré parSprite090
- Lecture-10 Regression TreeTransféré partmtb04
- [Lee-2010]-ADPC.pdfTransféré parI-Lun Tseng
- DecisionAnalysis OverviewTransféré pargaziahmad
- TreeggggggTransféré parRiaz Mir
- ProblemStatement_29Oct2011Transféré parsybil11
- Rumor Spreading Maximization and Source Identification in a Social NetworkTransféré parGate RankOne
- Graphs - DATA StructuresTransféré parAsadAliBhatti
- 1.A Random Decision Tree Framework.pdfTransféré parmeenaalagar
- prjctTransféré parAnushaAnima
- 12ma term 2 planner 2016Transféré parapi-336059073
- Vamanan-TreeCAMTransféré parrobinbdctg
- 17330 (1).pdfTransféré parmeenakshi more
- Using Visualization Tool to Gain Insight Into Your DataTransféré parjohncoolali
- CS1201 DATA STRUCTURESTransféré parSaravana Natarajan
- module02-1upTransféré parTiasa Mondol

- avl treesTransféré parsaiteja1234
- DocTransféré parboylucho2012
- Decision Tree Model for Measurement of Teacher Performance Optimality in Determining Giving Merit Pay CompensationTransféré parIjcams Publication
- 10-Interactive Reporting in SAP ABAPTransféré parKIRAN
- Smale, Topology of AlgorithmsTransféré parFernanda Aiub
- Scheme of B_Tech ECE Batch 2011_uploaded_06_06_13.pdfTransféré parDilpal Singh
- CatalanTransféré parSatyam Singh
- Block Chain 2 ByzantineTransféré parDharmaraj Martyn
- GATE CS Tips !! _ Whiteswami's BlogTransféré parKapil Tirode
- Avl tree javaTransféré parsna ans
- Excel Labels SelectionTransféré parSangeetha Kamalakannan
- TREE (C)Transféré parबानि तमिन्
- HelpTransféré parRag
- Urban Design and Procedural ModelingTransféré parkukuduxu
- Www.mu.Ac.in Syllabus 4.78 S.E. ITTransféré parprathamgunj
- RTUtil User's Guide Release 9_V013Transféré parMinh Dương
- Ds ObjectiveTransféré parAnamika Chauhan Verma
- Big Data HeadacheTransféré parAlvin Cardona
- d-samanta data structre.pdfTransféré parAkash Bhagat
- Query OptimizationTransféré parHendro Patmanto
- CSC 337 Lecture NotesTransféré parAdenrele Abolanle Afolorunso
- Interview FaqsTransféré parKumar Perabattula
- 3.Lstview,Treeview,Imglst,Pctbx,Pnl,Grpbox,TabcntrlTransféré parkris2tmg
- unit-1 newTransféré parAmsaveni Kannan
- An and 1 112 Canoe TutorialTransféré parShaddai Sg
- BUSN603_11th_PPT_Ch_03 (2)Transféré parJukjuk Manlangit
- Real Time Packet Classification and Analysis based on Bloom Filter for Longest Prefix MatchingTransféré parEditor IJRITCC
- Binary search tree.pptxTransféré parBikram Kesharee Nayak
- String CompressionTransféré parvatarpl
- it_2012_semester_3_4_18072013Transféré parsunshinesun49