Académique Documents
Professionnel Documents
Culture Documents
August 1995
Abstract
Two linear algorithms are presented. One for computing the longest prex
of a square pattern that occurs in every position of a given 2 dimensional
array and one for computing all square covers of a given 2-dimensional
array.
Institut Gaspard Monge, Universite de Marne-la-Vallee, 2, rue de la Butte Verte, F-93160 Noisy-
le-Grand , France. Email: mac@litp.ibp.fr.
y Department of Computer Science, King's College London, Strand, London, England and School
of Computing, Curtin University, Perth, WA, Australia. Email: csi@dcs.kcl.ac.uk. Partially sup-
ported by SERC grants GR/F 00898 and GR/J 17844, NATO grant CRG 900293, ESPRIT BRA grant
7131 for ALCOM II, and MRC grant G 9115730.
z Department of Computer Science, King's College London, Strand, London, U.K.
Email: mo@dcs.kcl.ac.uk. Supported by a Medical Research Council Studenship.
1. Introduction
In recent study of repetitive structures of strings, generalized notions of periods
have been introduced. A typical regularity, the period u of a given string x, grasps
the repetitiveness of x since x is a prex of a string constructed by concatenations of
u. A substring w of x is called a cover of x if x can be constructed by concatenations
and superpositions of w. A substring w of x is called a seed of x if there exists a
superstring of x which is constructed by concatenations and superpositions of w. For
example, abc is a period of abcabcabca, abca is a cover of abcabcaabca, and abca is a
seed of abcabcaabc. The notions \cover" and \seed" are generalizations of periods in
the sense that superpositions as well as concatenations are considered to dene them,
whereas only concatenations are considered for periods.
Given a string x of length n and a pattern p of length m, the string prex-
matching problem is that of computing the longest prex of p which occurs at each
position of x. Main and Lorentz introduced the notion of string prex-matching in
[ML84] and presented a linear algorithm for it. In two dimensions, the prex-matching
problem is to compute the largest prex of an m m pattern matrix, P which occurs at
each position of an n n text matrix, T . The 2-dimensional prex-matching problem
can be solved using the LSux tree construction of Giancarlo [G93]. The LSux tree
for T , dened over an alphabet , takes O(n2 (log jj + log n2)) time to build. All
occurrences of a pattern, P in T can be found in O(m2 log jj + t) time, where t is the
total number of occurrences. We present an optimal linear time algorithm for the 2-
dimensional prex matching problem which uses the powerful notion of a 2-dimensional
failure function to reduce the number of substring comparisons.
In computation of covers, two problems have been considered in the literature.
The shortest-cover problem is that of computing the shortest cover of a given string
of length n, and the all-covers problem is that of computing all the covers of a given
string. Apostolico, Farach and Iliopoulos [AFI91] introduced the notion of covers and
gave a linear-time algorithm for the shortest-cover problem. Breslauer [Br92] presented a
linear-time on-line algorithm for the same problem. Moore and Smyth [MS94] presented
a linear-time algorithm for the all-covers problem. In parallel computation, Breslauer
[Br95] gave an optimal O((n) log log n)-time algorithm for the shortest cover, where
(n) is the inverse Ackermann function. Iliopoulos and Park [IP94] gave an optimal
O(log log n)-time (thus work-time optimal) algorithm for the shortest-cover problem.
Iliopoulos, Moore and Park [IMP93] introduced the notion of seeds and gave
an O(n log n)-time algorithm for computing all the seeds of a given string of length n.
For the same problem Ben-Amram, Berkman, Iliopoulos and Park [BBIP94] presented
a parallel algorithm that requires O(log n) time and O(n log n) work. Apostolico and
1
Ehrenfeucht [AE93] considered yet another problem related to covers.
In this paper we generalize the all-covers problem to 2-dimensions and we present
an optimal linear algorithm for the problem. Let S be a square submatrix of a square
matrix A; we say that S covers A (or equivalently S is a cover of A), if every point
of A is within an occurrence of S . The 2-dimensional all-covers problem is as follows:
given a 2-dimensional square matrix A, compute all square submatrices S that cover
A. While the algorithms for the shortest-cover problem [AFI91,Br92,Br95,IP94] rely
mostly on string properties, our algorithm for the 2-dimensional all-covers problem is
based on the Aho-Corasick Automaton and "gap" monitoring techniques.
The paper is organized as follows: in the next section we present some denitions
and results used in the sequel. In section 3 we present a linear algorithm for computing
all the borders of a square matrix. In section 4 we present a linear algorithm for
computing the diagonal failure function of a square matrix (an algorithm similar to
the one in [ABF92]). In section 5 we present a linear algorithm for the 2-dimensional
prex-string matching problem. And nally in Section 6, we present a linear algorithm
for computing all the covers of square matrices.
2. Preliminaries
A string is a sequence of zero or more symbols from an alphabet . The set of
all strings over the alphabet is denoted by . A string x of length n is represented
by x1 xn , where xi 2 for 1 i n. A string w is a substring of x if x = uwv for
u; v 2 . A string w is a prex of x if x = wu for u 2 . Similarly, w is a sux of
x if x = uw for u 2 . The string xy is a concatenation of two strings x and y. The
concatenations of k copies of x is denoted by xk . A string u is a period of x if x is a
prex of uk for some k, or equivalently if x is a prex of ux. The period of a string x is
the shortest period of x.
A two dimensional string is an n m matrix of nm symbols drawn from an
alphabet . The nn square matrix A can be represented by A[1 n; 1 n]. An mm
matrix M is a submatrix of A if the upper left corner of M can be aligned with an element
A(i; j ), 1 i; j n m + 1 and M [1 m; 1 m] = A[i i + m 1; j j + m 1]
In this case, the submatrix M is said to occur at A(i; j ). A submatrix M is said to be
a prex of A if M occurs at A(1; 1). Similarly, a submatrix M is a sux of A if M
occurs at A(n m + 1; n m + 1).
A string b is a border of x if b is a prex and a sux of x. The empty string and
x itself are trivial borders of x. A m m submatrix M is a border of A if M occurs at
positions A(1; 1); A(n m + 1; 1); A(1; n m + 1) and A(n m + 1; n m + 1). The
empty matrix and the matrix A itself are trivial borders of A.
2
Fact 1. A string u is a period of x = ub if and only if b is a non-trivial border of x.
Proof . It follows immediately from the denitions of period and border.
In the sequel we shall need to perform string comparisons using the Aho-Corasick
automaton and the following results by Scheiber and Vishkin [SV88] and Berkman and
Vishkin [BV94] will be used in guiding us within the automaton:
3
q0
b
a c
b
c b c
a c a a
11
00 11
00 11
00 11
00
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11
Figure 1
The Aho-Corasick Automaton for fabca; aabc; acba; aacag. Non-trivial failure links
are shown as dotted lines. Final states are shown as patterned squares.
Theorem 2.2 Let T be a rooted tree with n nodes and let u; v be nodes of the tree.
One can preprocess T in linear time, so that the following queries can be answered in
constant time:
(i) Find the lowest common ancestor of u and v.
(ii) Find the k-th level ancestor node on the path from node u to the root where the
rst ancestor of u is the parent of u.
Using the above theorem one can derive the following two useful corollaries:
Corollary 2.3 Given the the Aho-Corasick automaton for keywords r ; ::::; rk and al-
1
lowing linear time for preprocessing, the query of testing whether prexk (ri ) = rm
requires constant time.
ut
u
. .
. . .
.
v . . .
u1
ri uj
uj-1
Figure 2
Proof . Let ri = 1:::q and ri;d = d:::q . Then we have prexk (ri;d ) = d:::d+k. Let v
be the d + k-th node of the path from the root to the leaf ri . Using the failure links it is
possible to construct a path v; u1::::; ut; r from the node v to the root r. It is not dicult
to see that the label of the path from the root r to uj is s := d+k l:::d+k, where l
is the distance of uj from the root r (see Figure 2). One can see that prexk(ri;d ) is a
label of a path from the root to a node u, if and only if u is one of the uj 's. One can
say that equality of the query holds if uj , for some j , is the leaf rm .
Given the automaton one can answer the query in constant time as follows.
Consider the following tree T : root is the initial state of the Aho-Corasick automaton,
nodes are the nodes of the automaton and edges the failure links of the automaton.
Let prexk (ri;d ) be associated with a node v as above. It is not dicult to see that
prexk (ri;d ) = rm if and only if the Lowest Common Ancestor of rm and v in T is rm -
a condition that can be checked in constant time by the [BV94] algorithm.
By Fact 2, all borders of a string x or matrix A are candidates for covers. Our
algorithm for the all-covers problem starts by computing all the borders of A and nds
covers among the borders.
The algorithm is subdivided in the following three steps:
5
1. Compute all the borders of the array and derive the candidates for covers. Let
Bk be the largest candidate border.
2. Compute the longest prex of Bk which occurs at every position A(i,j) of the
array. Derive the largest border to occur at each position A(i,j).
3. Compute the gaps, if any, between occurrences of each border.
R(i; j ) = 01;; otherwise.
A[j; 1:::i] is a border of A[j; 1:::n];
We compute all the borders of every row and column of A using the Knuth, Morris
and Pratt ([KMP77]) algorithm. If the j -th column of A has a border of length i, then
C (i; j ) := 1; similarly if the j -th row of A has a border of length i, then R(i; j ) := 1.
Furthermore we also use two arrays
M (i) = 10;; R (i; j ) = C (i; j ) = 1; 8 1 j i AND R(i; l) = 1; 8 n i + 1 l n;
otherwise.
Lemma 3.1. The square submatrix A[1::i; 1::i] is a border A if and only if M (i) = 1.
Proof . Assume that M (i) = 1. From the denition of M (i) it follows that R(i; j ) =
1; for all 1 j i, which in turn implies that rows r1 ; r2; :::; ri all have borders
of length i. Therefore A[1::i; 1::i] occurs at A(n i + 1; 1). Similarly, from the fact
that C (i; j ) = 1; for all 1 j i the columns c1; c2; :::; ci all have borders of length i.
Therefore A[1::i; 1::i] occurs at A(1; n i+1). From the fact that R(i; j ) = 1; for all n
i + 1 j n it follows that rows rn i+1; :::; rn all have borders of length i, which in
turn implies that A[1::i; 1::i] occurs at A(n i + 1; n i + 1).
The reverse follows similarly
Theorem 3.2 The algorithm above computes all the borders of an n n matrix A in
O(n2 ) time.
6
begin
Compute C (i; j ), 1 i; j n ;
Compute R(i; j ), 1 i; j n ;
comment Use the KMP algorithm.
for 1 i n do
if M (i) = 1 then
return A[1::i; 1::i] as a border of A;
comment It follows from Lemma 3.1.
od
end
Algorithm 3.1
Proof . The computation of the borders of each row and column takes O(n) time using
the KMP algorithm, O(n2 ) in total. Border verication of each diagonal candidate takes
time O(i), O(n2 ) in total.
Although here we make use of the borders as candidates for covers, one can
obtain a (perhaps) smaller set of candidates by modifying the matrices C and R as
follows:
C (i; j ) = 01;; otherwise.
A[1:::i; j ] is a cover of A[1:::n; j ];
R(i; j ) = 10;; A [j; 1:::i] is a cover of A[j; 1:::n];
otherwise.
One can compute all the covers of every row and column of A using the [MS94]
algorithm. The verication of the above set of candidates can be similarly done in linear
time. These candidates may lead to a more ecient algorithm (as they may be fewer
than the borders) but it does not change the asymptotic complexity.
(i,i)
ci
Ak
(i,i)
ri
(i) (ii)
Figure 3
is the string A[i; i]A[i 1; i]:::A[1; i] (see Figure 3-(ii)). The computation is similar to
the computation of the failure function by the KMP algorithm, with the exception that
character comparisons are now substring comparisons. Informally, at iteration i we have
computed f (i) = k and we proceed to compute f (i + 1) by comparing rk+1 and ck+1
with prexk+1(ri+1 ) and prexk+1(ci+1) respectively (Recall that prexk (x) = x1 :::; xk
and suxk (x) = xk :::; xn with jxj = n.) Clearly, if both match then f (i + 1) = k + 1.
Otherwise, as in KMP, we recursively (on k) compare rk+1 and ck+1 with prexk+1(ri+1)
and prexk+1(ci+1), with k := f (k), until both strings match.
begin
ri A[i; i]A[i; i 1]; :::A[i; 1]; 1 i n;
ci A[i; i]A[i 1; i]; :::A[1; i]; 1 i n;
R fr1 ; :::; rng;
C fc1; :::; cng;
Construct Aho-Corasick automata for R and C ;
Preprocess the automata for LCA queries;
f (1) 0; k f (1);
for i = 2 to n do
while prexk+1(ri ) 6= rk+1 or prexk+1(ci ) 6= ck+1 do
comment The condition is tested as in Corollary 2.3.
k f (k);
od
k k + 1;
f ( i) k ;
od
end
Algorithm 4.1
8
Theorem 4.1 The algorithm above computes the failure function of n n array in
O(n2 ) time.
(d,i-d+1)
(d,1)
ci,d
0-diagonal
(i,i-d+1)
ri,d
d-diagonal
Figure 4
9
begin
ri T [i; n]T [i; n 1]; :::T [i; 1]; 1 i n;
ci T [n; i]T [n 1; i]; :::T [1; i]; 1 i n;
comment The strings ri ; ci are the rows and columns of the text T reversed.
r0 i P [i; i]P [i; i 1]; :::P [i; 1]; 1 i m;
c0i P [i; i]P [i 1; i]; :::P [1; i]; 1 i m;
comment The strings ri0 ; c0i are similar to the ones in Figure 3-(ii).
R fr1 ; :::; rn; r0 1; :::; r0 m g;
C fc1; :::; cn; c0 1; :::; c0 mg;
Construct the Aho-Corasick automaton for R and C ;
Compute the failure function f of P ;
for d = 0 to n 1 do
Let ld = n d be the length of the diagonal.
ri;d T [i; i d + 1] ::: T [i; 1];
ci;d T [d; i d + 1] ::: T [d; i d + 1];
comment See Figure 4 for illustration of ri;d and ci;d:
k 0
for j = 1 to ld do
while prexk+1(rj;d ) 6= r0 k+1 or prexk+1(cj;d) 6= c0k+1 do
comment The condition is tested as in corollary 2.4.
p(j k; d) k;
comment The integer p(i; d) is the dimension of largest prex of the pattern
occurring at position (i; i d + 1).
k f (k);
od
k k + 1;
od
od
end
Algorithm 5.1
Theorem 5.1 The algorithm above computes the longest prex of the m m array P
occurring at every position of an n n array T in O(n + m ) time.
2 2
Proof . The computation of the failure function of P requires O(m2 ) time. The compu-
tation of the Aho-Corasick automaton requires O(n2 + m2) time. The theorem follows
from corollaries 2.3 and 2.4.
10
6. Computing the Gaps
In this section we focus on the covering problem: given a square matrix A com-
pute all sub-matrices S that cover A. Recall that a sub-matrix S covers A if every
position of A is within an occurrence of S . The linear algorithm below is based on gap
monitoring techniques.
The algorithm makes use of the fact that a cover of A is also a border. We rst
compute all the borders of A, let Bk be the largest one. Next we compute the largest
prex of Bk that occurs at every position of A. We "round" up these occurrences to
the nearest border size. Then we begin a gap monitor program starting with Bk . We
consider all occurrences of Bk in A and we check whether there are any positions that
are not covered by an occurrence of Bk { these are called gaps. If there is a gap, then
Bk is not a cover. We proceed by considering whether Bk 1 covers A. Note that Bk 1
occurs in all positions of Bk , thus we have to "add" some positions to the occurrences
of Bk to obtain the set of occurrences of Bk 1 in A. In fact every insertion of a new
position may "reduce" previous gaps. Also due to the fact that we now consider a
smaller border Bk 1 in positions were Bk was, the gap size between previous gaps may
"increase". We monitor all these changes and if all gaps are closed then the border is a
cover. Analytically the steps of the algorithm are as follows:
Proof . Step 1 and Step 2 require O(n2 ) operations by Theorems 4.1 and 5.1. The
computation or Step 3 is given below (Theorem 6.2) and it requires O(R) time. Step 4
has also a one to one relationship with the number of ranges, also requiring O(R) time.
One can easily deduct that steps 6.1-6.5 require O(1) operations each, for adding and
deleting items in doubly linked lists; the total number of operations of Step 6 is also
bounded by O(R).
13
j-th column
e1
e2
ec 11111111
00000000
00000000
11111111
00000000
11111111
ec+1 11111111
00000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
. 00000000
11111111
.
00000000
11111111
ev 00000000
11111111
00000000
11111111
00000000
11111111
p 00000000
11111111
00000000
11111111
00000000
11111111
ev+1
el
Figure 5
Btc is the large internal square matrix and Bt_c is the shaded one.
The pseudo-code below provides a detailed account of the range computation
and D(t).
begin
for j = 1 to n do
if Bt occurs at position (1; j ) then
j
Create the range [1; j; 1; tj ];
Link([1; j; 1; t]) 1;
comment The pointer Link([p; j; l; r]) points to the starting position
of the border Bl that covers position p.
for p = 1 to n do
if Bt occurs at position (p + 1; j ) then
if there is a range [p; j; l; r] with l t r then
Create the range [p + 1; j; 1; r];
comment In this case we merge the two ranges [p + 1; j; 1; t]
and [p + 1; j; l; r], since they overlap.
else
Create the range [p + 1; j; 1; t];
comment In this case we also merge the two ranges [p + 1; j; 1; t]
and [p + 1; j; l; r], since they overlap.
for each range [p; j; l; r] with r > t and l 6= r do
if Link([p; j; l; r]) + bl > p then
Link([p + 1; j; l; r]) Link([p; j; l; r]);
comment The occurrence of Bl at Link[p; j; l; r] covers position p + 1,
14
so the left bound of the range is still l. We also create
a Link for the new range of (p + 1; j ).
if Right(Link([p; j; l; r])) + bl > p then
Link([p + 1; j; l; r]) Right(Link([p; j; l; r]));
comment Note that Bl does not cover position p + 1. The function
Right points the nearest Bl to the right of the Link; if
that occurrence covers p + 1, then the left bound of the
range remains the same l; we also create the link for the
range of (p + 1; j ):
if Link([p; j; l; r]) + bl = p AND Right(Link([p; j; l; r])) > p + 1 then
Link([p + 1; j; l; r]) Link([p; j; l; r]);
l l + 1;
comment In this case Bl terminates on position p, and since r > l,
we have that the left bound of the range of (p + 1; j ) becomes
l + 1; we also create the links for the new range.
Create the range [p + 1; j; l; r];
comment Note that if r > l, then r is always the right bound of the
range of (p + 1; j ).
od
od
end.
Theorem 6.2 The algorithm above computes all ranges in O(n + R) time.2
Proof . The Right function can be preprocessed as in step 4 of Section 6.1; this requires
O(n2 ) time.
The internal for loop goes through all of the ranges in a column and the outer
loop goes through all columns; since each of the if statements requires constant time,
the loops require O(R) time.
Proof . By induction on the number of occurrences. One can show that it holds for
l = 5. Assume that it holds for l=k-1 and positions e2 ; :::; el satisfy (6.1). Let xj be the
string that starts at position ej and has length dj . Using the facts that x1 is a prex of
x2 , x2 is a prex of x3 and x3 is a prex of x4 , one can nd strings c; f; g; h such that
x1 = cfgh
x2 = cfg
x3 = cf
x4 = c
also let
z1 = cfghcfgcfc
z2 = cfgcfc
z3 = cfc
z4 = c
be the suxes of the column starting at positions e1 ; e2; e3 and e4 respectively.
Case of g < c: From the fact that z3 is a prex of z2 it follows that c = gq for some
string q. From z1 and z2 above one can see that x2q is a prex of hx2 x3 . If jhj jx2 j=2
then we have that
x2 = hk h0; for some prex h0 of h
which implies c = hlh00 and which in turn implies at least another occurrence of Bt1
between e2 and e3 a contradiction. Thus we have
h > jx2 j=2 ) d1 = jx1j > 23 d2
16
Case of g c. This is similar to the case above.
One can observe e1; :::ej cause j ranges to positions between ej and ej+1, or jdj
in total for that region. The total number of ranges is
lj=1jdj lj=1j ( 32 )j 1 d1 = O(n)
Theorem 6.5 The cardinality of the list of ranges created in each column is O(n).
Thus the total number of ranges R = O(n2 ).
Proof . Let e1; :::; el be the occurrences of a prex of a border on a column of length n.
Let fei1 ; :::eil g be the eij 's that cause a range to eij+1 but no ev causes a range to eu for
any eij ev < ru eij+1 . One can note that positions between any eij and eij+1 belong
to at most one range caused by one of the borders occurring at eij ; eij +1; eij +2; ::eij+1 .
Additional ranges may be only caused by border occurrences at positions fei1 ; :::eil g.
But the number of those ranges is linearly bounded by lemma ( 7.2).
9. References
[ABF92] A.Amir, G.Benson and M.Farach, Alphabet independent two dimensional match-
ing, Proc. 24th ACM Symposium on Theory of Computing, 59-68, 1992
[AC75] A.V.Aho and M.J.Corasick, Ecient string matching, Comm. ACM, Vol 18, No
6, 333-340, 1975
[AE93] A.Apostolico and A.Ehrenfeucht, Ecient detection of quasiperiodicities in
strings, Theoret. Comput. Sci, 119, 247-265, 1993
[AFI91] A.Apostolico, M.Farach and C.S.Iliopoulos, Optimal superprimitivity testing for
strings, Inform. Process. Lett. 39, 17-20, 1991
17
[BBIP94] A.M.Ben-Amram, O.Berkman, C.S.Iliopoulos and K.Park, The subtree max gap
problem with application to parallel string covering, Proc. 5th ACM-SIAM Symp.
Discrete Algorithms, 501-510, 1994
[Br92] D.Breslauer, An on-line string superprimitivity test, Inform. Process. Lett. 44,
345-347, 1992
[Br95] D.Breslauer, Testing string superprimitivity in parallel, Inform. Process. Lett,
to appear
[BV94] O.Berkman and U.Vishkin, Finding level-ancestors in trees, Journal of computer
and System Sciences, Vol 48, 214-230, 1994
[G93] R.Giancarlo, The sux tree of a square matrix, with applications, ACM-SIAM
Proc. 4th Symposium on Discrete Algorithms, 402-411, 1993
[IMP93] C.S.Iliopoulos, D.W.G.Moore and K.Park, Covering a string, Proc 4th Symp.
Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol 684,
54-62, 1993
[IP94] C.S.Iliopoulos and K.Park, An optimal O(log log n) time algorithm for parallel
superprimitivity testing, Journal of the Korea Information Science Society, Vol
21, No 8, 1400-1404, 1994
[KMP77] D.E.Knuth, J.H.Morris and V.R.Pratt, Fast pattern matching in strings , SIAM
Journal of Computing, Vol 6, 322-350, 1977
[ML84] G.M. Main and R.J.Lorentz, An optimal O(n log n) algorithm for nding all
repetitions in a string, Journal of Algorithms, Vol 5, 422-432, 1984.
[MS94] D.W.G.Moore and W.F.Smyth, Computing the covers of a string in linear time,
Proc. 5th ACM-SIAM Symp. Discrete Algorithms, 511-515, 1994
[SV88] B.Scheiber and U.Vishkin, On nding lowest common ancestors: simplication
and parallelisation, Siam Journal of Computing, Vol 17, No 6, 1253-1262, 1988
18