Vous êtes sur la page 1sur 19

Two-dimensional Pre x String Matching

and Covering on Square Matrices


Maxime Crochemore Costas S. Iliopoulosy Maureen Kordaz

August 1995

Abstract
Two linear algorithms are presented. One for computing the longest pre x
of a square pattern that occurs in every position of a given 2 dimensional
array and one for computing all square covers of a given 2-dimensional
array.

 Institut Gaspard Monge, Universite de Marne-la-Vallee, 2, rue de la Butte Verte, F-93160 Noisy-
le-Grand , France. Email: mac@litp.ibp.fr.
y Department of Computer Science, King's College London, Strand, London, England and School
of Computing, Curtin University, Perth, WA, Australia. Email: csi@dcs.kcl.ac.uk. Partially sup-
ported by SERC grants GR/F 00898 and GR/J 17844, NATO grant CRG 900293, ESPRIT BRA grant
7131 for ALCOM II, and MRC grant G 9115730.
z Department of Computer Science, King's College London, Strand, London, U.K.
Email: mo@dcs.kcl.ac.uk. Supported by a Medical Research Council Studenship.
1. Introduction
In recent study of repetitive structures of strings, generalized notions of periods
have been introduced. A typical regularity, the period u of a given string x, grasps
the repetitiveness of x since x is a pre x of a string constructed by concatenations of
u. A substring w of x is called a cover of x if x can be constructed by concatenations
and superpositions of w. A substring w of x is called a seed of x if there exists a
superstring of x which is constructed by concatenations and superpositions of w. For
example, abc is a period of abcabcabca, abca is a cover of abcabcaabca, and abca is a
seed of abcabcaabc. The notions \cover" and \seed" are generalizations of periods in
the sense that superpositions as well as concatenations are considered to de ne them,
whereas only concatenations are considered for periods.
Given a string x of length n and a pattern p of length m, the string pre x-
matching problem is that of computing the longest pre x of p which occurs at each
position of x. Main and Lorentz introduced the notion of string pre x-matching in
[ML84] and presented a linear algorithm for it. In two dimensions, the pre x-matching
problem is to compute the largest pre x of an m  m pattern matrix, P which occurs at
each position of an n  n text matrix, T . The 2-dimensional pre x-matching problem
can be solved using the LSux tree construction of Giancarlo [G93]. The LSux tree
for T , de ned over an alphabet , takes O(n2 (log jj + log n2)) time to build. All
occurrences of a pattern, P in T can be found in O(m2 log jj + t) time, where t is the
total number of occurrences. We present an optimal linear time algorithm for the 2-
dimensional pre x matching problem which uses the powerful notion of a 2-dimensional
failure function to reduce the number of substring comparisons.
In computation of covers, two problems have been considered in the literature.
The shortest-cover problem is that of computing the shortest cover of a given string
of length n, and the all-covers problem is that of computing all the covers of a given
string. Apostolico, Farach and Iliopoulos [AFI91] introduced the notion of covers and
gave a linear-time algorithm for the shortest-cover problem. Breslauer [Br92] presented a
linear-time on-line algorithm for the same problem. Moore and Smyth [MS94] presented
a linear-time algorithm for the all-covers problem. In parallel computation, Breslauer
[Br95] gave an optimal O( (n) log log n)-time algorithm for the shortest cover, where
(n) is the inverse Ackermann function. Iliopoulos and Park [IP94] gave an optimal
O(log log n)-time (thus work-time optimal) algorithm for the shortest-cover problem.
Iliopoulos, Moore and Park [IMP93] introduced the notion of seeds and gave
an O(n log n)-time algorithm for computing all the seeds of a given string of length n.
For the same problem Ben-Amram, Berkman, Iliopoulos and Park [BBIP94] presented
a parallel algorithm that requires O(log n) time and O(n log n) work. Apostolico and
1
Ehrenfeucht [AE93] considered yet another problem related to covers.
In this paper we generalize the all-covers problem to 2-dimensions and we present
an optimal linear algorithm for the problem. Let S be a square submatrix of a square
matrix A; we say that S covers A (or equivalently S is a cover of A), if every point
of A is within an occurrence of S . The 2-dimensional all-covers problem is as follows:
given a 2-dimensional square matrix A, compute all square submatrices S that cover
A. While the algorithms for the shortest-cover problem [AFI91,Br92,Br95,IP94] rely
mostly on string properties, our algorithm for the 2-dimensional all-covers problem is
based on the Aho-Corasick Automaton and "gap" monitoring techniques.
The paper is organized as follows: in the next section we present some de nitions
and results used in the sequel. In section 3 we present a linear algorithm for computing
all the borders of a square matrix. In section 4 we present a linear algorithm for
computing the diagonal failure function of a square matrix (an algorithm similar to
the one in [ABF92]). In section 5 we present a linear algorithm for the 2-dimensional
pre x-string matching problem. And nally in Section 6, we present a linear algorithm
for computing all the covers of square matrices.

2. Preliminaries
A string is a sequence of zero or more symbols from an alphabet . The set of
all strings over the alphabet  is denoted by . A string x of length n is represented
by x1    xn , where xi 2  for 1  i  n. A string w is a substring of x if x = uwv for
u; v 2  . A string w is a pre x of x if x = wu for u 2  . Similarly, w is a sux of
x if x = uw for u 2  . The string xy is a concatenation of two strings x and y. The
concatenations of k copies of x is denoted by xk . A string u is a period of x if x is a
pre x of uk for some k, or equivalently if x is a pre x of ux. The period of a string x is
the shortest period of x.
A two dimensional string is an n  m matrix of nm symbols drawn from an
alphabet . The nn square matrix A can be represented by A[1    n; 1    n]. An mm
matrix M is a submatrix of A if the upper left corner of M can be aligned with an element
A(i; j ), 1  i; j  n m + 1 and M [1    m; 1    m] = A[i    i + m 1; j    j + m 1]
In this case, the submatrix M is said to occur at A(i; j ). A submatrix M is said to be
a pre x of A if M occurs at A(1; 1). Similarly, a submatrix M is a sux of A if M
occurs at A(n m + 1; n m + 1).
A string b is a border of x if b is a pre x and a sux of x. The empty string and
x itself are trivial borders of x. A m  m submatrix M is a border of A if M occurs at
positions A(1; 1); A(n m + 1; 1); A(1; n m + 1) and A(n m + 1; n m + 1). The
empty matrix and the matrix A itself are trivial borders of A.
2
Fact 1. A string u is a period of x = ub if and only if b is a non-trivial border of x.
Proof . It follows immediately from the de nitions of period and border.

A substring w of x is called a cover of x, if x can be constructed by concatenations


and superpositions of w. A submatrix M is called a cover of A if every element A(i; j )
is contained within some occurrence of M . Here we consider the two-dimensional all-
covers problem, i.e., that of computing all the covers of a given n  n square matrix
A.
Fact 2. A cover of string x is also its border. A cover of matrix A is also its border.
Proof . A cover of a string (matrix) occurs as both a pre x and a sux and therefore
it is a border.

The Aho-Corasick Automaton [AC75] was designed to solve the multi-keyword


pattern-matching problem: given a set of keywords fr1 ; :::; rk g and an input string t,
test whether or not a keyword ri occurs as a substring of t. The Aho-Corasick pattern
matching automaton is a six-tuple (Q; ; g; h; q0 ; F ), where Q is a nite set of states, 
is a nite alphabet input, g : Q   ! Q [ fail is the forward transition, h : Q ! Q
is the failure function (link), q0 is the initial state and F is the set of nal states (for
details see [AC75]).
Informally, the automaton can be represented as a rooted labeled tree augmented
with the failure links. The label (denoted ls) of the path from the root (initial state)
to a state s is a pre x of one of the given keywords ; we denote such label by ls. If s
is a nal state , then ls is a keyword. There are no two sibling edges which have the
same label. The failure link of a node s points to a node h(s) such that the string lh(s)
is a pre x of a keyword and also the longest sux of the string ls (see Figure 1 for an
example).

Theorem 2.1 The Aho-Corasick automaton


Pk solves the multi-keyword pattern-matching
problem in O(m + n) time , with m = i jri j.=1

In the sequel we shall need to perform string comparisons using the Aho-Corasick
automaton and the following results by Scheiber and Vishkin [SV88] and Berkman and
Vishkin [BV94] will be used in guiding us within the automaton:
3
q0

b
a c

b
c b c

a c a a

11
00 11
00 11
00 11
00
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11

Figure 1
The Aho-Corasick Automaton for fabca; aabc; acba; aacag. Non-trivial failure links
are shown as dotted lines. Final states are shown as patterned squares.
Theorem 2.2 Let T be a rooted tree with n nodes and let u; v be nodes of the tree.
One can preprocess T in linear time, so that the following queries can be answered in
constant time:
(i) Find the lowest common ancestor of u and v.
(ii) Find the k-th level ancestor node on the path from node u to the root where the
rst ancestor of u is the parent of u.
Using the above theorem one can derive the following two useful corollaries:

Corollary 2.3 Given the the Aho-Corasick automaton for keywords r ; ::::; rk and al-
1
lowing linear time for preprocessing, the query of testing whether pre xk (ri ) = rm
requires constant time.

Proof . We preprocess the automaton as required by the Berkman-Vishkin algorithm.


We can identify the ancestor node, s, of the leaf ri , which corresponds to the k-th pre x
of ri in constant time. This node is the (jri j k)-th ancestor on the path from the leaf
ri ( nal state) to the root (initial state). The equality holds only when s is also the leaf
( nal state) rm .
4
Corollary 2.4 Given the the Aho-Corasick automaton for keywords r ; ::::; rk and al- 1
lowing linear time for preprocessing, the query of testing whether pre xk (ri;d ) = rm
requires constant time, where ri;d is the sux of ri which starts at the d-th position of
ri .
r

ut
u

. .
. . .
.
v . . .

u1

ri uj
uj-1

Figure 2
Proof . Let ri = 1:::q and ri;d = d:::q . Then we have pre xk (ri;d ) = d:::d+k. Let v
be the d + k-th node of the path from the root to the leaf ri . Using the failure links it is
possible to construct a path v; u1::::; ut; r from the node v to the root r. It is not dicult
to see that the label of the path from the root r to uj is s := d+k l:::d+k, where l
is the distance of uj from the root r (see Figure 2). One can see that pre xk(ri;d ) is a
label of a path from the root to a node u, if and only if u is one of the uj 's. One can
say that equality of the query holds if uj , for some j , is the leaf rm .
Given the automaton one can answer the query in constant time as follows.
Consider the following tree T : root is the initial state of the Aho-Corasick automaton,
nodes are the nodes of the automaton and edges the failure links of the automaton.
Let pre xk (ri;d ) be associated with a node v as above. It is not dicult to see that
pre xk (ri;d ) = rm if and only if the Lowest Common Ancestor of rm and v in T is rm -
a condition that can be checked in constant time by the [BV94] algorithm.
By Fact 2, all borders of a string x or matrix A are candidates for covers. Our
algorithm for the all-covers problem starts by computing all the borders of A and nds
covers among the borders.
The algorithm is subdivided in the following three steps:
5
1. Compute all the borders of the array and derive the candidates for covers. Let
Bk be the largest candidate border.
2. Compute the longest pre x of Bk which occurs at every position A(i,j) of the
array. Derive the largest border to occur at each position A(i,j).
3. Compute the gaps, if any, between occurrences of each border.

3. Computing All Borders and Candidates


Here we describe a linear algorithm for computing all the square borders of an
n  n matrix A. The algorithm makes use of two auxiliary n  n matrices C and R such
that : 
C (i; j ) = 01;; otherwise.
A[1:::i; j ] is a border of A[1:::n; j ];


R(i; j ) = 01;; otherwise.
A[j; 1:::i] is a border of A[j; 1:::n];

We compute all the borders of every row and column of A using the Knuth, Morris
and Pratt ([KMP77]) algorithm. If the j -th column of A has a border of length i, then
C (i; j ) := 1; similarly if the j -th row of A has a border of length i, then R(i; j ) := 1.
Furthermore we also use two arrays

M (i) = 10;; R (i; j ) = C (i; j ) = 1; 8 1  j  i AND R(i; l) = 1; 8 n i + 1  l  n;
otherwise.

Lemma 3.1. The square submatrix A[1::i; 1::i] is a border A if and only if M (i) = 1.
Proof . Assume that M (i) = 1. From the de nition of M (i) it follows that R(i; j ) =
1; for all 1  j  i, which in turn implies that rows r1 ; r2; :::; ri all have borders
of length i. Therefore A[1::i; 1::i] occurs at A(n i + 1; 1). Similarly, from the fact
that C (i; j ) = 1; for all 1  j  i the columns c1; c2; :::; ci all have borders of length i.
Therefore A[1::i; 1::i] occurs at A(1; n i+1). From the fact that R(i; j ) = 1; for all n
i + 1  j  n it follows that rows rn i+1; :::; rn all have borders of length i, which in
turn implies that A[1::i; 1::i] occurs at A(n i + 1; n i + 1).
The reverse follows similarly

Theorem 3.2 The algorithm above computes all the borders of an n  n matrix A in
O(n2 ) time.
6
begin
Compute C (i; j ), 1  i; j  n ;
Compute R(i; j ), 1  i; j  n ;
comment Use the KMP algorithm.
for 1  i  n do
if M (i) = 1 then
return A[1::i; 1::i] as a border of A;
comment It follows from Lemma 3.1.
od
end
Algorithm 3.1
Proof . The computation of the borders of each row and column takes O(n) time using
the KMP algorithm, O(n2 ) in total. Border veri cation of each diagonal candidate takes
time O(i), O(n2 ) in total.
Although here we make use of the borders as candidates for covers, one can
obtain a (perhaps) smaller set of candidates by modifying the matrices C and R as
follows: 
C (i; j ) = 01;; otherwise.
A[1:::i; j ] is a cover of A[1:::n; j ];


R(i; j ) = 10;; A [j; 1:::i] is a cover of A[j; 1:::n];
otherwise.
One can compute all the covers of every row and column of A using the [MS94]
algorithm. The veri cation of the above set of candidates can be similarly done in linear
time. These candidates may lead to a more ecient algorithm (as they may be fewer
than the borders) but it does not change the asymptotic complexity.

4. Computing the Diagonal Failure Function of a Square Ma-


trix
Let A be an n  n matrix and Ak denote the pre x A[1:::k; 1:::k] of A. We de ne
the diagonal failure function f (i), of A, for 1  i  n to be equal to k, where k is the
largest integer such that A[i k + 1:::i; i k + 1:::i] is a proper pre x of A; if there is
no such k, then f (i) = 0 (see Figure 3-(i)). Below we present a linear procedure for
computing the diagonal failure function; a similar algorithm was presented in [ABF92].

The algorithm works by constructing the Aho-Corasick multi-word automaton


for the set of words r1 :::rn; c1:::cn, where ri is the string A[i; i]A[i; i 1]:::A[i; 1], and ci
7
Ak

(i,i)
ci

Ak
(i,i)
ri

(i) (ii)
Figure 3
is the string A[i; i]A[i 1; i]:::A[1; i] (see Figure 3-(ii)). The computation is similar to
the computation of the failure function by the KMP algorithm, with the exception that
character comparisons are now substring comparisons. Informally, at iteration i we have
computed f (i) = k and we proceed to compute f (i + 1) by comparing rk+1 and ck+1
with pre xk+1(ri+1 ) and pre xk+1(ci+1) respectively (Recall that pre xk (x) = x1 :::; xk
and suxk (x) = xk :::; xn with jxj = n.) Clearly, if both match then f (i + 1) = k + 1.
Otherwise, as in KMP, we recursively (on k) compare rk+1 and ck+1 with pre xk+1(ri+1)
and pre xk+1(ci+1), with k := f (k), until both strings match.
begin
ri A[i; i]A[i; i 1]; :::A[i; 1]; 1  i  n;
ci A[i; i]A[i 1; i]; :::A[1; i]; 1  i  n;
R fr1 ; :::; rng;
C fc1; :::; cng;
Construct Aho-Corasick automata for R and C ;
Preprocess the automata for LCA queries;
f (1) 0; k f (1);
for i = 2 to n do
while pre xk+1(ri ) 6= rk+1 or pre xk+1(ci ) 6= ck+1 do
comment The condition is tested as in Corollary 2.3.
k f (k);
od
k k + 1;
f ( i) k ;
od
end
Algorithm 4.1
8
Theorem 4.1 The algorithm above computes the failure function of n  n array in
O(n2 ) time.

Proof . The computation of the Aho-Corasick automaton and preprocessing requires


O(n2 ) time. The condition of the while loop can be done in constant time as in Corollary
2.3.

5. Pre x String Matching in Two Dimensions


Let P (the "Pattern"), T (the "Text") be m  m; and n  n matrices respectively.
The two-dimensional problem of pre x string matching is to compute the longest pre x
of P that occurs at every position of T .
The algorithm below makes use of the diagonals of the text matrix T (see Figure
4). Starting from the top of each d-diagonal, and sliding downwards (on the d-diagonal),
we iteratively compute the maximum pre x of P at each point of the d-diagonal. At
the next iteration step, we attempt to augment that occurrence by extending it by a
row and a column (in a manner similar to the L-character used by [G93]); this is only
possible when the relevant row and column of the text match the corresponding ones
of the pattern. If such an extension of the occurrence of the pre x of P is not possible,
then we make use of the diagonal failure link, and attempt to extend the pre x pointed
to by the link. Analytically, the pseudo-code below computes the longest pre x of P
that occurs at every position T ; in order to simplify the exposition we only compute
the maximum pre x of the pattern occurring at points below the main diagonal of the
text.

(d,i-d+1)
(d,1)

ci,d
0-diagonal
(i,i-d+1)
ri,d

d-diagonal

Figure 4
9
begin
ri T [i; n]T [i; n 1]; :::T [i; 1]; 1  i  n;
ci T [n; i]T [n 1; i]; :::T [1; i]; 1  i  n;
comment The strings ri ; ci are the rows and columns of the text T reversed.
r0 i P [i; i]P [i; i 1]; :::P [i; 1]; 1  i  m;
c0i P [i; i]P [i 1; i]; :::P [1; i]; 1  i  m;
comment The strings ri0 ; c0i are similar to the ones in Figure 3-(ii).
R fr1 ; :::; rn; r0 1; :::; r0 m g;
C fc1; :::; cn; c0 1; :::; c0 mg;
Construct the Aho-Corasick automaton for R and C ;
Compute the failure function f of P ;
for d = 0 to n 1 do
Let ld = n d be the length of the diagonal.
ri;d T [i; i d + 1] ::: T [i; 1];
ci;d T [d; i d + 1] ::: T [d; i d + 1];
comment See Figure 4 for illustration of ri;d and ci;d:
k 0
for j = 1 to ld do
while pre xk+1(rj;d ) 6= r0 k+1 or pre xk+1(cj;d) 6= c0k+1 do
comment The condition is tested as in corollary 2.4.
p(j k; d) k;
comment The integer p(i; d) is the dimension of largest pre x of the pattern
occurring at position (i; i d + 1).
k f (k);
od
k k + 1;
od
od
end
Algorithm 5.1
Theorem 5.1 The algorithm above computes the longest pre x of the m  m array P
occurring at every position of an n  n array T in O(n + m ) time.
2 2

Proof . The computation of the failure function of P requires O(m2 ) time. The compu-
tation of the Aho-Corasick automaton requires O(n2 + m2) time. The theorem follows
from corollaries 2.3 and 2.4.

10
6. Computing the Gaps
In this section we focus on the covering problem: given a square matrix A com-
pute all sub-matrices S that cover A. Recall that a sub-matrix S covers A if every
position of A is within an occurrence of S . The linear algorithm below is based on gap
monitoring techniques.
The algorithm makes use of the fact that a cover of A is also a border. We rst
compute all the borders of A, let Bk be the largest one. Next we compute the largest
pre x of Bk that occurs at every position of A. We "round" up these occurrences to
the nearest border size. Then we begin a gap monitor program starting with Bk . We
consider all occurrences of Bk in A and we check whether there are any positions that
are not covered by an occurrence of Bk { these are called gaps. If there is a gap, then
Bk is not a cover. We proceed by considering whether Bk 1 covers A. Note that Bk 1
occurs in all positions of Bk , thus we have to "add" some positions to the occurrences
of Bk to obtain the set of occurrences of Bk 1 in A. In fact every insertion of a new
position may "reduce" previous gaps. Also due to the fact that we now consider a
smaller border Bk 1 in positions were Bk was, the gap size between previous gaps may
"increase". We monitor all these changes and if all gaps are closed then the border is a
cover. Analytically the steps of the algorithm are as follows:

6.1 The Gap Monitoring Algorithm


Step 1. Compute all the square bt  bt borders Bt, 1  t  k, of the input array A;
Let Bk be the largest border and without loss of generality bt < bt+1; 1  t < k. For
1  i  n, let B(i) = ft : bt  i < bt+1 g and let B0 (i) = ft : bt < i  bt+1 g. These
two functions will be used for rounding up the maximum pre xes computed in the next
step.
Step 2. For every position (i; j ) of A compute the length d(i; j ) of the maximum
pre x of Bk that occurs in that position by using the Algorithm 5.1. Let P (i; j ) =
B(d(i; j )). We round up the occurrence to the nearest border size because only borders
are candidates for covers (see Fact 2).
Step 3. Let D(t) denote the ordered list of positions (e; j ) such that
(i) P (i; j ) = t,
(ii) i  e  i + bt .
The list D(t) contains all positions of A which belong to the rst column of
an occurrence of Bt. Furthermore each position (p; q) of A is associated with a range
11
[p; q; l; r] if and only if (p; q) 2 D(t); 8 l  t  r. Note that a position may be
associated with more than one range. The computation of these ranges is described in
detail in the next section 6.2.
Step 4. Consider the range [p; q; l; r]. Here we de ne the Left and Right functions
that will allow constant time deletion and insertion in updating the list D(t), which
will be kept as a doubly linked list. Consider the range [p; q; l; r]. We compute the
nearest position (p0 ; q0 ) to the left of (p; q) in A with Left([p; q; l; r]) := [p0 ; q0 ; l0 ; r0 ] and
l0  l  r0 .
We also compute the nearest position (p00 ; q00 ) to the right of (p; q) in A with
Right([p; q; l; r]) := [p00 ; q00 ; l00 ; r00 ] and l00  l  r00 .
Step 5. We will consider the borders from largest to smallest. Let Bt be the current
border. For 1  s  k, GAP (s) contains the positions of Dt whose gap d to the next
position satis es
bs < d  bs+1:
Also we de ne
GAP (t) := GAP (k) [ GAP (k 1) [    [ GAP (t)
The list GAP contains the positions of Bt occurrences, whose gap to the next Bt
occurrence is larger than bt .
Step 6. We process the borders in descending order. Let Bt , t < k, be the current
border. We update the list Dt and construct the new GAP 's as follows. First we
consider all positions (p; q) with range [p; q; l; t]. Note that (p; q) is in D(t) but not in
D(t + 1). For each of these (p; q)'s we do the following book-keeping operations:
6.1 Delete (p0 ; q0 ) from its GAP . The position (p0 ; q0 ) is in the left of position
(p; q) in A and the introduction of the position (p; q) in D(t) will narrow
the gap at position (p0 ; q0 ).
6.2 Put (p0 ; q0 ) into GAP (B0 (q q0 ). This is a new (perhaps smaller) gap size
at position (p0 ; q0 ). Note that we round up the gap sizes to the nearest
border size { since only borders are candidates for covers..
6.3 Put (p; q) into GAP (B0 (q00 q)). This gives the gap size at (p; q), using
the position (p00 ; q00 ) { the nearest position to the right of (p; q). Again we
round up the gap size to the nearest border size.
Now we consider all positions (p; q) with range [p; q; t + 1; r]. Note that (p; q) belongs
to D(t + 1) but it is not in D(t). Thus we update the gaps as follows:
12
6.4 Delete (p; q) from the GAP . This is done, since (p; q) is no longer in D(t)
and thus no part of the cover.
6.5 Put (p0 ; q0 ) into GAP (B0 (q00 q0 ). The gap at (p0 ; q0 ) has became larger
with the withdrawal of the position (p; q). The new gap is de ned by the
position (p00 ; q00 ) { the position to the right of (p; q).
After all elements of Dt have been processed (adding the new ones and delet-
ing old ones as above), the border Bt is a cover if and only if GAP (t) is empty.
Checking whether or not GAP is empty can be done in constant time, by keeping
GAP (k); :::; GAP (t) as a doubly linked list.
Theorem 6.1 The algorithm above checks whether Bi ; for all 1  t  k, covers the
matrix A in O(n2 + R), where R is the number of ranges.

Proof . Step 1 and Step 2 require O(n2 ) operations by Theorems 4.1 and 5.1. The
computation or Step 3 is given below (Theorem 6.2) and it requires O(R) time. Step 4
has also a one to one relationship with the number of ranges, also requiring O(R) time.
One can easily deduct that steps 6.1-6.5 require O(1) operations each, for adding and
deleting items in doubly linked lists; the total number of operations of Step 6 is also
bounded by O(R).

6.2 Computing the Ranges


Consider the j -th column of A. Let e1 < e2 < ::: < el  n be all the positions
of the j -th column that a border occurs, i.e., P (ei ; j ) = ti 6= 0; 1  i  l. Let
p be a position between ev and ev+1. If ec + bt_c 1 < p  ec + bt_c  ec + btc , then
(p; j ) 2 D(t) for all t_c  t  tc (see Figure 5). We say that (p; j ) is covered by a range
of borders bt such that t_c  t  tc . If a position (p; j ) is covered by two or more ranges,
say f[p; j; t_v ; tv ]; [p; j; t_c; tc ]; ::::g, and t_v  t_c  tv , then we merge the ranges, becoming
[p; j; t_v ; tc ].

13
j-th column

e1
e2

ec 11111111
00000000
00000000
11111111
00000000
11111111
ec+1 11111111
00000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
. 00000000
11111111
.

00000000
11111111
ev 00000000
11111111
00000000
11111111
00000000
11111111
p 00000000
11111111
00000000
11111111
00000000
11111111
ev+1

el

Figure 5
Btc is the large internal square matrix and Bt_c is the shaded one.
The pseudo-code below provides a detailed account of the range computation
and D(t).

begin
for j = 1 to n do
if Bt occurs at position (1; j ) then
j
Create the range [1; j; 1; tj ];
Link([1; j; 1; t]) 1;
comment The pointer Link([p; j; l; r]) points to the starting position
of the border Bl that covers position p.
for p = 1 to n do
if Bt occurs at position (p + 1; j ) then
if there is a range [p; j; l; r] with l  t  r then
Create the range [p + 1; j; 1; r];
comment In this case we merge the two ranges [p + 1; j; 1; t]
and [p + 1; j; l; r], since they overlap.
else
Create the range [p + 1; j; 1; t];
comment In this case we also merge the two ranges [p + 1; j; 1; t]
and [p + 1; j; l; r], since they overlap.
for each range [p; j; l; r] with r > t and l 6= r do
if Link([p; j; l; r]) + bl > p then
Link([p + 1; j; l; r]) Link([p; j; l; r]);
comment The occurrence of Bl at Link[p; j; l; r] covers position p + 1,
14
so the left bound of the range is still l. We also create
a Link for the new range of (p + 1; j ).
if Right(Link([p; j; l; r])) + bl > p then
Link([p + 1; j; l; r]) Right(Link([p; j; l; r]));
comment Note that Bl does not cover position p + 1. The function
Right points the nearest Bl to the right of the Link; if
that occurrence covers p + 1, then the left bound of the
range remains the same l; we also create the link for the
range of (p + 1; j ):
if Link([p; j; l; r]) + bl = p AND Right(Link([p; j; l; r])) > p + 1 then
Link([p + 1; j; l; r]) Link([p; j; l; r]);
l l + 1;
comment In this case Bl terminates on position p, and since r > l,
we have that the left bound of the range of (p + 1; j ) becomes
l + 1; we also create the links for the new range.
Create the range [p + 1; j; l; r];
comment Note that if r > l, then r is always the right bound of the
range of (p + 1; j ).
od
od
end.
Theorem 6.2 The algorithm above computes all ranges in O(n + R) time.2

Proof . The Right function can be preprocessed as in step 4 of Section 6.1; this requires
O(n2 ) time.
The internal for loop goes through all of the ranges in a column and the outer
loop goes through all columns; since each of the if statements requires constant time,
the loops require O(R) time.

6.3 Counting the Number of Ranges


Let e1 ; :::; el be the occurrences of a pre x of a border on a column of length n.
We say that the occurrence of Btc at ec causes a range at a position ev+1 > p  el  ec+1
if p 2 fDi :   i  btc g for some integer  > btv .
Lemma 6.3 Let e1; :::; el be the occurrences of a pre x of a border on a column of
length n. The occurrence of Btc at ec causes a range at a position ev+1 > p  ev  ec+1
if and only if
btc > p ec and btv < p ec
15
Proof . The border Btc causes a range if and only if it covers position p (see Figure 5),
that is p is a position within the occurrence of Btc at ec , or equivalently btc > p ec .
The occurrence of Btc at ec implies p 2 fDi : B0 (p ec )  i  tcg. Similarly
the occurrence Btv at ev implies that p 2 fDi : B0 (p ev )  i  tv g. Then Btc at ec
causes a range at m if and only if B0 (p ec ) > tv .

Lemma 6.4 Let e ; :::; el be the occurrences of a pre x of a border on a column of


1
length n. Let dj = ej ej+1 ; j = 1; :::; l. If ej causes a range to ej+1 for every i = 1:::l,
then
dj  32 dj+1 (6:1)
and the total number of ranges caused by these occurrences is O(n).

Proof . By induction on the number of occurrences. One can show that it holds for
l = 5. Assume that it holds for l=k-1 and positions e2 ; :::; el satisfy (6.1). Let xj be the
string that starts at position ej and has length dj . Using the facts that x1 is a pre x of
x2 , x2 is a pre x of x3 and x3 is a pre x of x4 , one can nd strings c; f; g; h such that
x1 = cfgh
x2 = cfg
x3 = cf
x4 = c
also let
z1 = cfghcfgcfc
z2 = cfgcfc
z3 = cfc
z4 = c
be the suxes of the column starting at positions e1 ; e2; e3 and e4 respectively.
Case of g < c: From the fact that z3 is a pre x of z2 it follows that c = gq for some
string q. From z1 and z2 above one can see that x2q is a pre x of hx2 x3 . If jhj  jx2 j=2
then we have that
x2 = hk h0; for some pre x h0 of h
which implies c = hlh00 and which in turn implies at least another occurrence of Bt1
between e2 and e3 a contradiction. Thus we have
h > jx2 j=2 ) d1 = jx1j > 23 d2
16
Case of g  c. This is similar to the case above.
One can observe e1; :::ej cause j ranges to positions between ej and ej+1, or jdj
in total for that region. The total number of ranges is
lj=1jdj  lj=1j ( 32 )j 1 d1 = O(n)

Theorem 6.5 The cardinality of the list of ranges created in each column is O(n).
Thus the total number of ranges R = O(n2 ).

Proof . Let e1; :::; el be the occurrences of a pre x of a border on a column of length n.
Let fei1 ; :::eil g be the eij 's that cause a range to eij+1 but no ev causes a range to eu for
any eij  ev < ru  eij+1 . One can note that positions between any eij and eij+1 belong
to at most one range caused by one of the borders occurring at eij ; eij +1; eij +2; ::eij+1 .
Additional ranges may be only caused by border occurrences at positions fei1 ; :::eil g.
But the number of those ranges is linearly bounded by lemma ( 7.2).

8. Conclusion and Open Problems


Theorem 6.5 together with Theorem 6.1 implies the computation of all covers of
a square matrix A can be done in linear time.
The Aho-Corasick Automaton depends on the alphabet; it is still an open ques-
tion whether the all-covers of a square matrix can be computed in linear time indepen-
dent of the alphabet. Also of interest is the PRAM complexity of the same problem.
Another extension of the above problem is that of computing approximate covers and
seeds.

9. References
[ABF92] A.Amir, G.Benson and M.Farach, Alphabet independent two dimensional match-
ing, Proc. 24th ACM Symposium on Theory of Computing, 59-68, 1992
[AC75] A.V.Aho and M.J.Corasick, Ecient string matching, Comm. ACM, Vol 18, No
6, 333-340, 1975
[AE93] A.Apostolico and A.Ehrenfeucht, Ecient detection of quasiperiodicities in
strings, Theoret. Comput. Sci, 119, 247-265, 1993
[AFI91] A.Apostolico, M.Farach and C.S.Iliopoulos, Optimal superprimitivity testing for
strings, Inform. Process. Lett. 39, 17-20, 1991
17
[BBIP94] A.M.Ben-Amram, O.Berkman, C.S.Iliopoulos and K.Park, The subtree max gap
problem with application to parallel string covering, Proc. 5th ACM-SIAM Symp.
Discrete Algorithms, 501-510, 1994
[Br92] D.Breslauer, An on-line string superprimitivity test, Inform. Process. Lett. 44,
345-347, 1992
[Br95] D.Breslauer, Testing string superprimitivity in parallel, Inform. Process. Lett,
to appear
[BV94] O.Berkman and U.Vishkin, Finding level-ancestors in trees, Journal of computer
and System Sciences, Vol 48, 214-230, 1994
[G93] R.Giancarlo, The sux tree of a square matrix, with applications, ACM-SIAM
Proc. 4th Symposium on Discrete Algorithms, 402-411, 1993
[IMP93] C.S.Iliopoulos, D.W.G.Moore and K.Park, Covering a string, Proc 4th Symp.
Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol 684,
54-62, 1993
[IP94] C.S.Iliopoulos and K.Park, An optimal O(log log n) time algorithm for parallel
superprimitivity testing, Journal of the Korea Information Science Society, Vol
21, No 8, 1400-1404, 1994
[KMP77] D.E.Knuth, J.H.Morris and V.R.Pratt, Fast pattern matching in strings , SIAM
Journal of Computing, Vol 6, 322-350, 1977
[ML84] G.M. Main and R.J.Lorentz, An optimal O(n log n) algorithm for nding all
repetitions in a string, Journal of Algorithms, Vol 5, 422-432, 1984.
[MS94] D.W.G.Moore and W.F.Smyth, Computing the covers of a string in linear time,
Proc. 5th ACM-SIAM Symp. Discrete Algorithms, 511-515, 1994
[SV88] B.Scheiber and U.Vishkin, On nding lowest common ancestors: simpli cation
and parallelisation, Siam Journal of Computing, Vol 17, No 6, 1253-1262, 1988

18

Vous aimerez peut-être aussi