Académique Documents
Professionnel Documents
Culture Documents
TUCS Laboratory
Discrete Mathematics for Information Technology
1 Introduction
Numerical quantities such as Parikh vectors, [10], associated to words often
make the considerations easier because operations become commutative. The
numerical quantity investigated in this paper is |w|u , the number of occurrences
of a word u as a (scattered) subword of a word w. Parikh matrices introduced
in [7] and investigated further, for instance, in [8, 9, 4, 2, 15, 16, 17, 19, 20, 21]
have these quantities as their entries. Thus, Parikh matrices constitute a central
tool in investigations dealing with subword occurrences.
This paper deals with some very fundamental properties of words. Because
of the noncommutativity of words, most problems are difficult to handle mathe-
matically. By arithmetization, that is, by expressing words in terms of numbers,
one is sometimes able to reach a situation where the products are commutative.
The theory of formal power series, [5], contains numerous such constructions.
The Parikh vector, [10, 12], Ψ(w) = (i1 , . . . , ik ) indicates the number of
occurrences of the letter aj , 1 ≤ j ≤ k, in w, provided w is over the alphabet
Σ = {a1 , . . . , ak }. To get more information about a word, one has to focus the
attention to subwords and factors. In this paper, these notions are understood
as follows.
u = x1 . . . xn and w = y0 x1 y1 . . . xn yn .
The word u is a factor of w if there are words x and y such that w = xuy. If
the word x (resp. y) is empty, then u is also called a prefix (resp. suffix) of w.
(2, 4), (2, 6), (2, 7), (3, 4), (3, 6), (3, 7), (5, 6), (5, 7).
After a little practice, one is able to determine the numbers |w|u quickly, es-
pecially for words over the binary alphabet {a, b}. For instance, each of the
following six words w
satisfies |w|ab = 8. There are infinitely many binary words w with |w|ab = 8
but the above six are the only ones with the Parikh vector (5, 3).
Clearly, |w|u = 0 if |w| < |u|. We also make the convention that, for any w
and the empty word λ,
|w|λ = 1.
1
A brief description about the contents of this paper follows. This Introduc-
tion is concluded with the definition of a Parikh matrix. The notion is then
generalized in Section 2, where also the independence of individual matrix en-
tries is investigated. In the rest of the paper, considerations are mostly restricted
to the binary alphabet {a, b}. In Section 3, we introduce a difference function
D and the associated language Lsym , and investigate their basic properties.
Section 4 studies the behavior of D and Lsym on prefixes of infinite words, and
Section 5 is concerned, in this respect, with a specific infinite word, the so-called
Fibonacci word.
We assume that the reader is familiar with the basics of formal languages.
Whenever necessary, [12] may be consulted. As customary, we use small letters
from the beginning of the English alphabet a, b, c, d, possibly with indices, to
denote letters of our formal alphabet Σ. Words are usually denoted by small
letters from the end of the English alphabet. If an ordering of letters is needed,
as in connection with Parikh vectors, we use the natural alphabetic ordering.
Thus, the word a2 dcb7 ac2 has the Parikh vector (3, 7, 3, 1).
The Parikh matrix is a powerful generalization of a Parikh vector. While
a Parikh vector only indicates the number of occurrences of each letter in a
word, the Parikh matrix gives also information about the mutual positions of the
occurrences. The Parikh matrix mapping uses upper triangular square matrices,
with nonnegative integer entries, 1’s on the main diagonal and 0’s below it. The
set of all such triangular matrices is denoted by M, and the subset of all matrices
of dimension k ≥ 1 is denoted by Mk .
We are now ready to give the formal definition of a Parikh matrix.
Ψk : Σ∗k → Mk+1 ,
defined by the following condition. Let 1 ≤ q ≤ k and Ψk (aq ) = (mi,j )1≤i,j≤(k+1) .
Then for each 1 ≤ i ≤ (k + 1), mi,i = 1, mq,q+1 = 1, all other elements of the
matrix Ψk (aq ) being 0. Matrices of the form Ψk (w), w ∈ Σ∗k , are referred to as
Parikh matrices.
Observe that when defining the Parikh matrix mapping we have, similarly
as when defining the Parikh vector, in mind a specific ordering of the alphabet.
The ordering will be clear from the context. If we consider letters without
numerical indices, we assume the alphabetic ordering when numbering the rows
and columns of the matrix.
The following theorem, [7], characterizes the entries of a Parikh matrix in
terms of some subword occurrences |w|u . For the alphabet Σk = {a1 , . . . , ak },
we denote by ai,j the word ai ai+1 . . . aj , where 1 ≤ i ≤ j ≤ k.
2
By the second diagonal (and similarly the third diagonal, etc.) of a matrix in
Mk+1 , we mean the diagonal of length k immediately above the main diagonal.
(The diagonals from the third on are shorter than k.) Theorem 1 tells that the
second diagonal of the Parikh matrix of w gives the Parikh vector of w. The
next diagonals give information about the order of letters in w by indicating
the numbers |w|u for certain specific words u. Indeed, all factors of the word
a1 a2 . . . ak appear among the words u.
We mention finally the following three significant problem areas, not dis-
cussed in this paper, concerning Parikh matrices. The injectivity problem con-
cerns the ambiguity of words associated to a Parikh matrix. When is the map-
ping injective, that is, only one word corresponds to the matrix? The problem
has been settled for binary alphabets, [4, 9, 15], as well as for ternary alphabets,
[21]. The reference [21] mentions specific open problems in this area. For in-
stance, can we always reduce the exponent of a letter in an unambiguous word,
preserving unambiguity? Specifically, assume that the word w = ua2 v, where a
is a letter, is unambiguous, that is, no other word has the same Parikh matrix.
Does it follow that also uav is unambiguous? This is true for binary and ternary
alphabets.
The injectivity problem is a special case of the inference problem: which
specific values of |w|u determine w uniquely? More information is contained
in [3, 6, 16, 18]. Language-theoretic problems deal with languages associated
to (sets of) Parikh matrices, [2, 7, 8], or languages defined by (combinations
of) various values of |w|u , [8, 15, 17, 19], or equalities and inequalities between
various values |w|u , [7, 8, 14].
Ψu : Σ∗ → Mk+1 ,
defined, for a ∈ Σ, by the condition: if Ψu (a) = Mu (a) = (mi,j )1≤i,j≤(k+1) , then
for each 1 ≤ i ≤ (k + 1), mi,i = 1, and for each 1 ≤ i ≤ k, mi,i+1 = δa,bi , all
other elements of the matrix Mu (a) being 0. Matrices of the form Ψu (w), w ∈
Σ∗ , are referred to as generalized Parikh matrices.
3
which the letters appear in w. The above definition implies that if a letter a
does not occur in u, then the matrix Mu (a) is the identity matrix.
For instance, if u = baab, then
1 0 0 0 0
0 1 1 0 0
0 0 1 1 0 .
Mu (a) =
0 0 0 1 0
0 0 0 0 1
In the our definition of a Parikh matrix in the Introduction, the word u was
chosen to be u = a1 . . . ak , for the alphabet Σ = {a1 , . . . , ak }. In the general
setup, the essential contents of Theorem 1 can be formulated as follows. For
1 ≤ i ≤ j ≤ k, denote ui,j = bi . . . bj . Denote the entries of the matrix Mu (w)
by mi,j .
Theorem 2 For all i and j, 1 ≤ i ≤ j ≤ k, we have mi,1+j = |w|ui,j .
The following example of a generalized Parikh matrix might at this stage
seem a bit complicated and strange. However, it is significant for our consider-
ations in Sections 3 and 5, where also the notation w8 becomes clear. Consider
the binary alphabet {a, b}, as well as the words u = aba and
w8 = abaababaabaababaababaabaababaaba.
k(k + 1)/2
4
The next question is whether the information in some entry in a Parikh
matrix is superfluous, that is, the entry can always be computed from the other
entries. It turns out that none of the entries is superfluous in this sense. We
begin with the following formal definition.
6 |w′ |abcd .
but |w|abcd =
Indeed. the following general result, [18], holds true.
So far very little is known about the independence of the entries in a gen-
eralized Parikh matrix. For which words u do the matrices Mu consist of inde-
pendent entries?
5
3 The difference function D and the language
Lsym
In this section we will investigate words w over the binary alphabet {a, b}, with
respect to the “balance” between occurrences of subwords ab and ba in w. The
function D and the language Lsym introduced in this section will be investigated
further in Sections 4 and 5.
We begin with the following obvious relations, valid for any word w:
Analogous relations hold for |wb|x . Using these relations and matrix multipli-
cation (or directly Theorem 2) we obtain the following result.
Lemma 1 Given w ∈ {a, b}∗ , consider the generalized Parikh matrix Maba (w).
Then |w|ab = |w|ba if and only if the entries (1, 3 and (2, 4) in the matrix are
equal.
Thus, Lemma 1 characterizes the situation where the word w is “balanced”.
We come now to the central definition.
Definition 5 The difference function is defined by
The next two lemmas list some basic properties of the difference function.
Lemma 2 If a word w is of odd length, then D(w) is even.
Proof. Consider first words w, where every a precedes every b, hence w =
ai bj . Since w is of odd length, one of the numbers i and j is even. (This includes
there case, where only one of the letters occurs in w.) Consequently, D(w) = ij
is even. Clearly, every word with the same Parikh vector as w can be obtained
from w by applications of the following transformation: change an occurrence
of the factor ab to ba. Each application of the transformation decreases the
D-value by 2. (Indeed, exactly one occurrence of the subword ab is lost and
one occurrence of ba gained.) Thus, for an arbitrary w of odd length, D(w) is
obtained by subtracting an even number from an even number. This establishes
the claim. △
Apart from the exception stated in Lemma 2, all numbers (within bounds
depending on |w|) are in the range od D.
Lemma 3 The values of D(w) lie in the closed interval [−|w|2 /4, |w|2 /4].
Moreover, for all integers n ≥ 0 and m, −n2 /4 ≤ m ≤ n2 /4, where m and
n are not both odd, there is a word w such that |w| = n and D(w) = m.
Proof. For n = 2, the required words are ba, aa, ab, for n = 3, they are
baa, aba, aab, and for n = 4, they are
6
In general, for n = 2t, we begin with the words at bt and at+1 bt−1 and apply the
transformation described in the proof of Lemma 2. For n = 2t + 1, it suffices to
begin with the word at+1 bt . △
The next result tells how the D-value changes if a word is replaced by one
of its conjugates.
Lemma 4 For any w, the transition from aw to wa (resp. from bw to wb)
decreases (resp. increases) the D-value by 2|w|b (resp. 2|w|a ).
Proof. Consider the claim D(wa) = D(aw) − 2|w|b . In the transition we lose
all occurrences of ab, where a is the letter to be transferred. There are |w|b
of them. On the other hand, we gain |w|b occurrences of ba. Thus, the claim
follows. The second claim is established similarly. △
Theorem 5 For any word w and natural number n,
D(wn ) = nD(w).
Proof. The claim holds for n = 1. Assume that it holds for a fixed value
n ≥ 1, and consider D(wn+1 ). By the inductive hypothesis,
D(wn+1 = D(wwn ) = (n + 1)D(w) + X,
where X is obtained as follows. One first counts the number of all occurrences
of ab, where the a comes from the first w, and the b from the remaining part.
Call these occurrences “positive”. From the number of positive occurrences
one then subtracts the number of “negative” occurrences, that is, occurrences
of ba, where the b comes from the first w, and the a from the remaining part.
But, clearly, there is a one-to-one correspondence between positive and negative
occurrences and, hence, X = 0. The Theorem follows. △
The logarithmic property presented in Theorem 5 does not hold for general
products. For instance, D(a) = 0 and D(ab) = 1 but D(aab) = 2 and D(aba) =
0. However, the following result follows similarly as Theorem 5.
Corollary 1 For all words w and w′ such that |w|a /|w|b = |w′ |a /|w′ |b ,
D(ww′ ) = D(w) + D(w′ ).
7
4 Prefixes of infinite words
We will now investigate the values of the function D on prefixes of infinite words
and, at the same time, subsets of the language Lsym , resulting as prefixes of an
infinite word. For instance, all prefixes of an odd length ≥ 3 of the infinite word
(ab)ω = ababab . . .
valid for any w (see also Lemma 4), it is easy to compute the D-value for any
prefix of this particular infinite word.
The Thue-Morse word, [1, 12], is the infinite word obtained by iterating the
morphism
a → ab, b → ba
on the starting word a. Thus, the Thue-Morse word is the limit of the sequence
This infinite word is obtained also as follows. First write down the letter a.
Whenever you have written down the word w, write down the word wwc , where
the “complement” wc is the word obtained from w by interchanging a and b.
We denote by P ref (T M ) the set of prefixes of the Thue-Morse word.
Theorem 7 The intersection P ref (T M ) ∩ Lsym consists of all (nonempty)
prefixes whose length is divisible by 4. The range of the difference function D
on the set P ref (T M ) contains no odd numbers, apart from 1 and −1. For every
i ≥ 0, at least one of the numbers 2i and −2i is in the range.
Proof. Observe that the Thue-Morse word is a catenation of the blocks abba
and baab. Thus, by Corollary 1, all prefixes of length 4i, i ≥ 1, belong to Lsym .
That no further prefixes belong to Lsym , follows by the subsequent discussion.
If w is the prefix of length 4i, i ≥ 1, then one of the words
is the prefix of length 4(i + 1). We assume, inductively, that as regards prefixes
of w, no odd number different from 1 and −1 appears as a D-value and, for all
i1 ≤ i, at least one of the numbers 2i1 and −2i1 appears as a D-value. Now,
because |w|a = |w|b = 2i, we have
and
8
The above considerations are sufficient for the computation of the D-values
for the set of prefixes P ref (W ) of an infinite ultimately periodic (binary) word
W , as well as for determining the intersection P ref (W ) ∩ Lsym . Write W in the
form
W = xy ω = xyyy . . . ,
where x and y 6= λ are binary words.
F1 = a, F5 = abaababa, F7 = abaababaabaababaababa.
ϕ0 = ϕ1 = 1, ϕi+2 = ϕi+1 + ϕi , i ≥ 0.
9
We now introduce the following notations, for i ≥ 1.
and, for i ≥ 3,
Bi = Bi−1 + Bi−2 + ϕ2i−3 .
This yields, by the two last equations in Lemma 5, the claim concerning Ai . As
regards Bi , we obtain first
i Ai Bi Ci ϕi−3
4 4 2 2 1
5 8 7 1 2
6 22 18 4 3
7 54 50 4 5
8 141 132 9 8
9 363 351 12 13
10 946 924 22 21
11 2464 2431 33 34
12 6436 6380 56 55
13 16820 16732 88 89
14 43993 43848 145 144
15 115101 114869 232 233
16 301224 300846 378 377
10
Using the formula
ϕi−2 · ϕi−4 = ϕ2i−3 + (−1)i , i ≥ 4,
which is easily established by induction, we obtain further
Ci = Ci−1 + Ci−2 + (−1)i , i ≥ 4.
We denote ϕ′i = ϕi−3 + (−1)i , i ≥ 3. By a direct computation,
ϕ′3 = 0 = C3 , ϕ′4 = 2 = C4 .
For i ≥ 5, we obtain
ϕ′i = ϕi−3 + (−1)i = ϕi−4 + ϕi−5 + (−1)i
which yields further
ϕ′i = ϕ′i−1 + (−1)i + ϕ′i−2 + (−1)i−1 + (−1)i = ϕ′i−1 + ϕ′i−2 + (−1)i .
Consequently, by induction, Ci = ϕ′i , for i ≥ 3, which establishes the first
sentence of Theorem 9. To prove the second sentence, assume first that i ≥ 4 is
even. Then Fi has the suffix ab. We write Fi = wi ab. Since |Fi | = ϕi , the claim
is that wi ∈ Lsym . Denote D(wi ) = x. We obtain, by the already established
first sentence of Theorem 9 and because i is even,
Ci = ϕi−3 + 1 = x − |wi |b + |wi |a + 1.
Lemma 6 yields |wi |b = ϕi−2 −1 and |wi |a = ϕi−1 −1. Since ϕi−3 +ϕi−2 = ϕi−1 ,
this implies x = 0 and, consequently, wi ∈ Lsym .
If i ≥ 4 is odd, we know that Ci = ϕi−3 − 1. In this case Fi has the suffix ba.
Writing again Fi = wi ba and denoting D(wi ) = x, we obtain similarly as before
x + ϕi−1 − 1 − ϕi−2 = ϕi−3 − 1,
whence the results x = 0 and wi ∈ Lsym follow. △
Observe that the word w8 considered in Section 2 is the same as the w8
discussed in the proof above. The matrix Maba (w8 ) was presented in Section
2. The entries (1, 3) and (2, 4) are the same, as they should be according to
Theorems 6 and 9.
It is very likely that no other prefixes of the Fibonacci word are in Lsym
than those given in Theorem 9. We have no proof of this assertion.
It can be shown that the words wi considered in the proof of Theorem 9 are
actually palindromes. By Theorem 6, this gives another proof for the second
sentence of Theorem 9.
For each i ≥ 3, the function D assumes both positive and negative values
on the prefixes of the Fibonacci word between Fi and Fi+1 . Bounds on these
values, depending on i, are easily computed. For instance, for all i ≥ 5,
D(Fi abaa) ≤ 2(ϕi−3 − ϕi−2 ).
6 Conclusion
Considerations dealing with the difference function D and the language Lsym
can be extended in many ways. For instance, we may consider the infinite
words generated by D0L systems and the “D-equivalence” of D0L sequences.
Two quite different D0L sequences, even two sequences having different growth
functions, can still be D-equivalent: they assume the same D-value at each
level. How can one decide D-equivalence? Considerations can also be extended
to bigger than binary alphabets.
11
References
[1] Berstel, J. and Karhumäki, J., Combinatorics on words - a tutorial. In Păun, G.,
Rozenberg, G. and Salomaa, A. (eds.): Current Trends in Theoretical Computer
Science. The Challenge of a New Century, Vol. 2, World Scientific Publishing
Company, Singapore (2004) 415-475.
[2] Ding, C. and Salomaa, A., On some problems of Mateescu concerning subword
occurrences. Fundamenta Informaticae (2006), to appear.
[4] Fossé, S. and Richomme, G., Some characterizations of Parikh matrix equivalent
binary words. Inform. Proc. Lett. 92 (2004) 77–82.
[6] Manvel, B., Meyerowitz, A., Schwenk, A., Smith, K., Stockmeyer, P., Recon-
struction of sequences. Discrete Math. 94 (1991) 209–219.
[7] Mateescu, A., Salomaa, A., Salomaa, K. and Yu, S., A sharpening of the Parikh
mapping. Theoret. Informatics Appl. 35 (2001) 551–564.
[8] Mateescu, A., Salomaa, A. and Yu, S., Subword histories and Parikh matrices.
J. Comput. Syst. Sci. 68 (2004) 1–21.
[9] Mateescu, A. and Salomaa, A., Matrix indicators for subword occurrences and
ambiguity. Int. J. Found. Comput. Sci 15 (2004) 277–292.
[11] Rozenberg, G. and Salomaa, A., The Mathematical Theory of L Systems. Aca-
demic Press, New York (1980).
[14] Salomaa, A., Counting (scattered) subwords. EATCS Bulletin 81 (2003) 165–
179.
[15] Salomaa, A., On the injectivity of Parikh matrix mappings. Fundamenta Infor-
maticae 64 (2005) 391–404.
[16] Salomaa, A., Connections between subwords and certain matrix mappings. The-
oretical Computer Science. 340 (2005) 188–203.
12
[19] Salomaa, A. and Yu, S., Subword conditions and subword histories. TUCS Tech-
nical Report 633 (2004), Information and Computation, to appear.
[20] Şerbănuţă, T.-F., Extending Parikh matrices. Theoretical Computer Science 310
(2004) 233–246.
[21] Şerbănuţă, V.G. and Şerbănuţă, T.F., Injectivity of the Parikh matrix mappings
revisited. Fundamenta Informaticae (2006), to appear.
13
Lemminkäisenkatu 14 A, 20520 Turku, Finland | www.tucs.fi
University of Turku
• Department of Information Technology
• Department of Mathematics