TR774

Arto Salomaa
Subword balance in binary words, languages

and sequences
TUCS Technical Report

No 774, June 2006
Subword balance in binary words, languages
and sequences
Arto Salomaa
Turku Centre for Computer Science
Lemminkäisenkatu 14, 20520 Turku, Finland
asalomaa@utu.fi
TUCS Technical Report

No 774, June 2006
Abstract
We investigate binary words and languages having a balanced structure of (scat-

tered) subwords. We introduce a ”difference function“ D for binary words. For
D = 0, the resulting language is properly context-sensitive. Parikh matrices
constitute a useful technical tool in the study, we investigate also the indepen-
dence of their entries. The investigation is extended to concern ω-words and
periodicity. For the Fibonacci word, the D-values are in many ways connected
with the Fibonacci numbers.
Keywords: subword, scattered subword, counting subwords, Parikh matrix,

periodicity, Fibonacci word
TUCS Laboratory
Discrete Mathematics for Information Technology
1 Introduction
Numerical quantities such as Parikh vectors, [10], associated to words often
make the considerations easier because operations become commutative. The
numerical quantity investigated in this paper is |w|u , the number of occurrences
of a word u as a (scattered) subword of a word w. Parikh matrices introduced
in [7] and investigated further, for instance, in [8, 9, 4, 2, 15, 16, 17, 19, 20, 21]
have these quantities as their entries. Thus, Parikh matrices constitute a central
tool in investigations dealing with subword occurrences.
This paper deals with some very fundamental properties of words. Because
of the noncommutativity of words, most problems are difficult to handle mathe-
matically. By arithmetization, that is, by expressing words in terms of numbers,
one is sometimes able to reach a situation where the products are commutative.
The theory of formal power series, [5], contains numerous such constructions.
The Parikh vector, [10, 12], Ψ(w) = (i1 , . . . , ik ) indicates the number of
occurrences of the letter aj , 1 ≤ j ≤ k, in w, provided w is over the alphabet
Σ = {a1 , . . . , ak }. To get more information about a word, one has to focus the
attention to subwords and factors. In this paper, these notions are understood
as follows.
Definition 1 A word u is a subword of a word w if there exist words x1 , . . . , xn

and y0 , . . . , yn , some of them possibly empty, such that
u = x1 . . . xn and w = y0 x1 y1 . . . xn yn .
The word u is a factor of w if there are words x and y such that w = xuy. If
the word x (resp. y) is empty, then u is also called a prefix (resp. suffix) of w.
Throughout this article, we understand subwords and factors in this way.

In classical language theory, [12], our subwords are usually called ”scattered
subwords”, whereas our factors are called ”subwords”. A subword in our sense
is mathematically a subsequence. The notation used throughout the article is
|w|u , the number of occurrences of the word u as a subword of the word w.
In [13], |w|u is denoted as a binomial coefficient and, indeed, it reduces to the
ordinary binomial coefficient if the alphabet consists of one letter.
Let us first consider this notion more explicitly, following [15]. Occurrences
of u as a subword of w can be viewed as vectors. If |u| = t, each occurrence of
u in w can be identified as the t-tuple (i1 , . . . , it ) of increasing positive integers,
where for 1 ≤ j ≤ t, the jth letter of u is the ij th letter of w. For instance, the
8 occurrences of u = ab in w = baababba are
(2, 4), (2, 6), (2, 7), (3, 4), (3, 6), (3, 7), (5, 6), (5, 7).
After a little practice, one is able to determine the numbers |w|u quickly, es-
pecially for words over the binary alphabet {a, b}. For instance, each of the
following six words w
aababbaa, aabbaaba, abaababa, baaaabba, ababaaab, baaabaab
satisfies |w|ab = 8. There are infinitely many binary words w with |w|ab = 8
but the above six are the only ones with the Parikh vector (5, 3).
Clearly, |w|u = 0 if |w| < |u|. We also make the convention that, for any w
and the empty word λ,
|w|λ = 1.
1
A brief description about the contents of this paper follows. This Introduc-
tion is concluded with the definition of a Parikh matrix. The notion is then
generalized in Section 2, where also the independence of individual matrix en-
tries is investigated. In the rest of the paper, considerations are mostly restricted
to the binary alphabet {a, b}. In Section 3, we introduce a difference function
D and the associated language Lsym , and investigate their basic properties.
Section 4 studies the behavior of D and Lsym on prefixes of infinite words, and
Section 5 is concerned, in this respect, with a specific infinite word, the so-called
Fibonacci word.
We assume that the reader is familiar with the basics of formal languages.
Whenever necessary, [12] may be consulted. As customary, we use small letters
from the beginning of the English alphabet a, b, c, d, possibly with indices, to
denote letters of our formal alphabet Σ. Words are usually denoted by small
letters from the end of the English alphabet. If an ordering of letters is needed,
as in connection with Parikh vectors, we use the natural alphabetic ordering.
Thus, the word a2 dcb7 ac2 has the Parikh vector (3, 7, 3, 1).
The Parikh matrix is a powerful generalization of a Parikh vector. While
a Parikh vector only indicates the number of occurrences of each letter in a
word, the Parikh matrix gives also information about the mutual positions of the
occurrences. The Parikh matrix mapping uses upper triangular square matrices,
with nonnegative integer entries, 1’s on the main diagonal and 0’s below it. The
set of all such triangular matrices is denoted by M, and the subset of all matrices
of dimension k ≥ 1 is denoted by Mk .
We are now ready to give the formal definition of a Parikh matrix.
Definition 2 Let Σk = {a1 , . . . , ak } be an alphabet. The Parikh matrix map-

ping, denoted Ψk , is the morphism:
Ψk : Σ∗k → Mk+1 ,
defined by the following condition. Let 1 ≤ q ≤ k and Ψk (aq ) = (mi,j )1≤i,j≤(k+1) .
Then for each 1 ≤ i ≤ (k + 1), mi,i = 1, mq,q+1 = 1, all other elements of the
matrix Ψk (aq ) being 0. Matrices of the form Ψk (w), w ∈ Σ∗k , are referred to as
Parikh matrices.
Observe that when defining the Parikh matrix mapping we have, similarly
as when defining the Parikh vector, in mind a specific ordering of the alphabet.
The ordering will be clear from the context. If we consider letters without
numerical indices, we assume the alphabetic ordering when numbering the rows
and columns of the matrix.
The following theorem, [7], characterizes the entries of a Parikh matrix in
terms of some subword occurrences |w|u . For the alphabet Σk = {a1 , . . . , ak },
we denote by ai,j the word ai ai+1 . . . aj , where 1 ≤ i ≤ j ≤ k.
Theorem 1 Consider Σk = {a1 , . . . , ak } and w ∈ Σ∗ . The matrix Ψk (w) =

(mi,j )1≤i,j≤(k+1) , has the following properties:
• mi,j = 0, for all 1 ≤ j < i ≤ (k + 1),
• mi,i = 1, for all 1 ≤ i ≤ (k + 1),
• mi,j+1 = |w|ai,j , for all 1 ≤ i ≤ j ≤ k.
2
By the second diagonal (and similarly the third diagonal, etc.) of a matrix in
Mk+1 , we mean the diagonal of length k immediately above the main diagonal.
(The diagonals from the third on are shorter than k.) Theorem 1 tells that the
second diagonal of the Parikh matrix of w gives the Parikh vector of w. The
next diagonals give information about the order of letters in w by indicating
the numbers |w|u for certain specific words u. Indeed, all factors of the word
a1 a2 . . . ak appear among the words u.
We mention finally the following three significant problem areas, not dis-
cussed in this paper, concerning Parikh matrices. The injectivity problem con-
cerns the ambiguity of words associated to a Parikh matrix. When is the map-
ping injective, that is, only one word corresponds to the matrix? The problem
has been settled for binary alphabets, [4, 9, 15], as well as for ternary alphabets,
[21]. The reference [21] mentions specific open problems in this area. For in-
stance, can we always reduce the exponent of a letter in an unambiguous word,
preserving unambiguity? Specifically, assume that the word w = ua2 v, where a
is a letter, is unambiguous, that is, no other word has the same Parikh matrix.
Does it follow that also uav is unambiguous? This is true for binary and ternary
alphabets.
The injectivity problem is a special case of the inference problem: which
specific values of |w|u determine w uniquely? More information is contained
in [3, 6, 16, 18]. Language-theoretic problems deal with languages associated
to (sets of) Parikh matrices, [2, 7, 8], or languages defined by (combinations
of) various values of |w|u , [8, 15, 17, 19], or equalities and inequalities between
various values |w|u , [7, 8, 14].
2 Generalized Parikh matrices. Independence

of the entries
A Parikh matrix tells us the values |w|u , where u is a factor of the ordered
product a1 . . . ak of the letters of the alphabet. When considering generalized
Parikh matrices introduced first in [20], arbitrary values |w|u can be obtained
as entries. The price one pays is in the dimension of the matrix. The dimension
can be very high if many values |w|u are wanted as entries.
Before the formal definition, we recall the definition of the “Kronecker delta”.
For letters a and b,
1 if a = b,
δa,b =
0 if a 6= b.
Definition 3 Let u = b1 . . . bk be a word, where each bi , 1 ≤ i ≤ k, is a letter

of the alphabet Σ. The Parikh matrix mapping with respect to u, denoted Ψu ,
is the morphism:
Ψu : Σ∗ → Mk+1 ,
defined, for a ∈ Σ, by the condition: if Ψu (a) = Mu (a) = (mi,j )1≤i,j≤(k+1) , then
for each 1 ≤ i ≤ (k + 1), mi,i = 1, and for each 1 ≤ i ≤ k, mi,i+1 = δa,bi , all
other elements of the matrix Mu (a) being 0. Matrices of the form Ψu (w), w ∈
Σ∗ , are referred to as generalized Parikh matrices.
Thus, the Parikh matrix Mu (w) associated to a word w is obtained by mul-

tiplying the matrices Mu (a) associated to the letters a of w, in the order in
3
which the letters appear in w. The above definition implies that if a letter a
does not occur in u, then the matrix Mu (a) is the identity matrix.
For instance, if u = baab, then
 
1 0 0 0 0
 0 1 1 0 0 
 
 0 0 1 1 0 .
Mu (a) =  
 0 0 0 1 0 
0 0 0 0 1
In the our definition of a Parikh matrix in the Introduction, the word u was
chosen to be u = a1 . . . ak , for the alphabet Σ = {a1 , . . . , ak }. In the general
setup, the essential contents of Theorem 1 can be formulated as follows. For
1 ≤ i ≤ j ≤ k, denote ui,j = bi . . . bj . Denote the entries of the matrix Mu (w)
by mi,j .
Theorem 2 For all i and j, 1 ≤ i ≤ j ≤ k, we have mi,1+j = |w|ui,j .
The following example of a generalized Parikh matrix might at this stage
seem a bit complicated and strange. However, it is significant for our consider-
ations in Sections 3 and 5, where also the notation w8 becomes clear. Consider
the binary alphabet {a, b}, as well as the words u = aba and
w8 = abaababaabaababaababaabaababaaba.
By Theorem 2, the generalized Parikh matrix Maba (w) satisfies

 
1 |w|a |w|ab |w|aba
 0 1 |w|b |w|ba 
Maba (w) =  0 0
.
1 |w|a 
0 0 0 1
Thus,
 
1 20 120 826
 0 1 12 120 
Maba (w8 ) = 
 .
0 0 1 20 
0 0 0 1
Consider the total information content in a Parikh matrix or generalized

Parikh matrix. While a Parikh vector consists of k numbers (clearly independent
of each other), a Parikh matrix contains
k(k + 1)/2
informative numbers. What about their mutual independence? Clearly, there

are dependencies between individual entries. As an example we mention the
following result from [9, 15].
Theorem 3 Arbitrary nonnegative integers may appear on the second diagonal

of a Parikh matrix. Arbitrary integers mi,i+2 , 1 ≤ i ≤ k − 1, satisfying the
condition
0 ≤ mi,i+2 ≤ mi,i+1 mi+1,i+2
(but no others) may appear on the third diagonal of a (k+1)-dimensional Parikh
matrix.
4
The next question is whether the information in some entry in a Parikh
matrix is superfluous, that is, the entry can always be computed from the other
entries. It turns out that none of the entries is superfluous in this sense. We
begin with the following formal definition.
Definition 4 Let Σk , k ≥ 2, and Ψk be as in Definition 2. Consider integers

i and j, 1 ≤ i < j ≤ k + 1. The entry (i, j) is said to be independent in
the mapping Ψk if there are words w, w′ ∈ Σ∗k such that the matrices Ψk (w)
and Ψk (w′ ) coincide elsewhere, but the (i, j)th entries in the two matrices are
different.
It is easy to establish the independence in the case of a three-letter alphabet.

As an example about the independence in the case of a four-letter alphabet
{a, b, c, d}, consider words
w = adcbdcba3 dcb and w′ = dcba3 dcbdcba.
Then it can be verified that
|w|x = |w′ |x for x ∈ {a, b, c, d, ab, bc, cd, abc, bcd}
6 |w′ |abcd .
but |w|abcd =
Indeed. the following general result, [18], holds true.
Theorem 4 Every entry is independent in the mapping Ψk .
Theorem 4 is not valid for generalized Parikh matrices. (The extension of

Definition 4 to the case of generalized Parikh matrices is obvious.) Indeed,
whenever the word u has the factor a2 , where a is a letter, then the entry in the
matrix Mu corresponding to a2 can be computed from the entry corresponding
to a.
An analogous result concerning general squares x2 appearing as factors in
u is not valid. The word u = abab contains a square but every entry in the
generalized Parikh matrix Mabab is independent. For instance,
 
1 3 6 6 6
 0 1 4 6 6 
3 2
 
Mabab (ab a b) = 
 0 0 1 3 6 ,

 0 0 0 1 4 
0 0 0 0 1
whereas
 
1 3 6 6 0
 0 1 4 6 6 
2 3
 
Mabab (ba b a) = 
 0 0 1 3 6 .

 0 0 0 1 4 
0 0 0 0 1
So far very little is known about the independence of the entries in a gen-
eralized Parikh matrix. For which words u do the matrices Mu consist of inde-
pendent entries?
5
3 The difference function D and the language
Lsym
In this section we will investigate words w over the binary alphabet {a, b}, with
respect to the “balance” between occurrences of subwords ab and ba in w. The
function D and the language Lsym introduced in this section will be investigated
further in Sections 4 and 5.
We begin with the following obvious relations, valid for any word w:
|wa|a = |w|a + 1, |wa|b = |w|b , |wa|ab = |w|ab , |wa|ba = |w|ba + |w|b .
Analogous relations hold for |wb|x . Using these relations and matrix multipli-
cation (or directly Theorem 2) we obtain the following result.
Lemma 1 Given w ∈ {a, b}∗ , consider the generalized Parikh matrix Maba (w).
Then |w|ab = |w|ba if and only if the entries (1, 3 and (2, 4) in the matrix are
equal.
Thus, Lemma 1 characterizes the situation where the word w is “balanced”.
We come now to the central definition.
Definition 5 The difference function is defined by
D(w) = |w|ab − |w|ba , for w ∈ {a, b}∗ .
The balanced or symmetric language is defined by
Lsym = {w ∈ {a, b}∗ − (a∗ ∪ b∗ )|D(w) = 0}.
The next two lemmas list some basic properties of the difference function.
Lemma 2 If a word w is of odd length, then D(w) is even.
Proof. Consider first words w, where every a precedes every b, hence w =
ai bj . Since w is of odd length, one of the numbers i and j is even. (This includes
there case, where only one of the letters occurs in w.) Consequently, D(w) = ij
is even. Clearly, every word with the same Parikh vector as w can be obtained
from w by applications of the following transformation: change an occurrence
of the factor ab to ba. Each application of the transformation decreases the
D-value by 2. (Indeed, exactly one occurrence of the subword ab is lost and
one occurrence of ba gained.) Thus, for an arbitrary w of odd length, D(w) is
obtained by subtracting an even number from an even number. This establishes
the claim. △
Apart from the exception stated in Lemma 2, all numbers (within bounds
depending on |w|) are in the range od D.
Lemma 3 The values of D(w) lie in the closed interval [−|w|2 /4, |w|2 /4].
Moreover, for all integers n ≥ 0 and m, −n2 /4 ≤ m ≤ n2 /4, where m and
n are not both odd, there is a word w such that |w| = n and D(w) = m.
Proof. For n = 2, the required words are ba, aa, ab, for n = 3, they are
baa, aba, aab, and for n = 4, they are
bbaa, baaa, baba, abaa, abba, aaba, abab, aaab, aabb.
6
In general, for n = 2t, we begin with the words at bt and at+1 bt−1 and apply the
transformation described in the proof of Lemma 2. For n = 2t + 1, it suffices to
begin with the word at+1 bt . △
The next result tells how the D-value changes if a word is replaced by one
of its conjugates.
Lemma 4 For any w, the transition from aw to wa (resp. from bw to wb)
decreases (resp. increases) the D-value by 2|w|b (resp. 2|w|a ).
Proof. Consider the claim D(wa) = D(aw) − 2|w|b . In the transition we lose
all occurrences of ab, where a is the letter to be transferred. There are |w|b
of them. On the other hand, we gain |w|b occurrences of ba. Thus, the claim
follows. The second claim is established similarly. △
Theorem 5 For any word w and natural number n,
D(wn ) = nD(w).
Proof. The claim holds for n = 1. Assume that it holds for a fixed value
n ≥ 1, and consider D(wn+1 ). By the inductive hypothesis,
D(wn+1 = D(wwn ) = (n + 1)D(w) + X,
where X is obtained as follows. One first counts the number of all occurrences
of ab, where the a comes from the first w, and the b from the remaining part.
Call these occurrences “positive”. From the number of positive occurrences
one then subtracts the number of “negative” occurrences, that is, occurrences
of ba, where the b comes from the first w, and the a from the remaining part.
But, clearly, there is a one-to-one correspondence between positive and negative
occurrences and, hence, X = 0. The Theorem follows. △
The logarithmic property presented in Theorem 5 does not hold for general
products. For instance, D(a) = 0 and D(ab) = 1 but D(aab) = 2 and D(aba) =
0. However, the following result follows similarly as Theorem 5.
Corollary 1 For all words w and w′ such that |w|a /|w|b = |w′ |a /|w′ |b ,
D(ww′ ) = D(w) + D(w′ ).
To end this section, we list some fundamental properties of the language

Lsym .
Theorem 6 The language Lsym is context-sensitive but not context-free. A
word w is in Lsym if and only if the entries (1, 3) and (2, 4) in the generalized
Parikh matrix Maba (w) are equal and nonzero. If w ∈ Lsym , then wi ∈ Lsym ,
for all i ≥ 0 and, conversely, if wi ∈ Lsym with i ≥ 1, then w ∈ Lsym . Every
palindrome belongs to Lsym .
Proof. The first sentence follows by standard language theory. It is easy to
construct a linear-bounded automaton accepting the language Lsym . On the
other hand, the intersection
Lsym ∩ a+ b+ a+ b+
consists of words am bn ap bq with
mn + mq + pq = np,
and is not context-free. Consequently, Lsym is not context-free. The second and
third sentence of the Theorem follow by Lemma 1 and Theorem 5, respectively.
The last sentence is obvious. △
7
4 Prefixes of infinite words
We will now investigate the values of the function D on prefixes of infinite words
and, at the same time, subsets of the language Lsym , resulting as prefixes of an
infinite word. For instance, all prefixes of an odd length ≥ 3 of the infinite word
(ab)ω = ababab . . .
constitute a subset of Lsym . Using the relations
D(wa) = D(w) − |w|b and D(wb) = D(w) + |w|a ,
valid for any w (see also Lemma 4), it is easy to compute the D-value for any
prefix of this particular infinite word.
The Thue-Morse word, [1, 12], is the infinite word obtained by iterating the
morphism
a → ab, b → ba
on the starting word a. Thus, the Thue-Morse word is the limit of the sequence
a, ab, abba, abbabaab, abbabaabbaababba, . . .
This infinite word is obtained also as follows. First write down the letter a.
Whenever you have written down the word w, write down the word wwc , where
the “complement” wc is the word obtained from w by interchanging a and b.
We denote by P ref (T M ) the set of prefixes of the Thue-Morse word.
Theorem 7 The intersection P ref (T M ) ∩ Lsym consists of all (nonempty)
prefixes whose length is divisible by 4. The range of the difference function D
on the set P ref (T M ) contains no odd numbers, apart from 1 and −1. For every
i ≥ 0, at least one of the numbers 2i and −2i is in the range.
Proof. Observe that the Thue-Morse word is a catenation of the blocks abba
and baab. Thus, by Corollary 1, all prefixes of length 4i, i ≥ 1, belong to Lsym .
That no further prefixes belong to Lsym , follows by the subsequent discussion.
If w is the prefix of length 4i, i ≥ 1, then one of the words
wabba and wbaab
is the prefix of length 4(i + 1). We assume, inductively, that as regards prefixes
of w, no odd number different from 1 and −1 appears as a D-value and, for all
i1 ≤ i, at least one of the numbers 2i1 and −2i1 appears as a D-value. Now,
because |w|a = |w|b = 2i, we have
D(wa) = −2i, D(wab) = 1, D(wabb) = 2(i + 1), D(wabba) = 0
and
D(wb) = 2i, D(wba) = −1, D(wbaa) = −2(i + 1), D(wbaab) = 0.
This extends the induction to i + 1, and the Theorem follows. △

We leave it as an open problem for which even numbers 2i actually both 2i
and −2i appear in the range. They correspond to the occurrences of the factors
abbaabba and baabbaab, the smallest such numbers being 4, 12, 16, 20, 28, 36, 44.
Observe that the D-value 0 appears periodically in the Thue-Morse word
but the language P ref (T M ) ∩ Lsym is not context-free.
8
The above considerations are sufficient for the computation of the D-values
for the set of prefixes P ref (W ) of an infinite ultimately periodic (binary) word
W , as well as for determining the intersection P ref (W ) ∩ Lsym . Write W in the
form
W = xy ω = xyyy . . . ,
where x and y 6= λ are binary words.
Definition 6 The balance indicator β of an infinite ultimately periodic word

W = xy ω is defined by
β(W ) = |x|a · |y|b − |x|b · |y|a .
Theorem 8 For W, x, y as above and for any n ≥ 0,
D(xy n ) = D(x) + n(D(y) + β(W )).
Proof. By Theorem 5, D(y n ) = nD(y). To get D(xy n ), we have to add n

times the balance indicator β(W ), as well as the value D(x). △
By considering conjugates of y, the D-value of every prefix of W can be
computed by Theorem 8. If we want to compute D(xy n y1 ), where y1 is a prefix
of y, then we write y = y1 y2 and W = (xy1 )(y2 y1 )ω . In this way, also the
positions with the D-value 0 and the words belonging to the language Lsym can
be found out. For instance,
P ref ((ab3 a2 b)ω ) ∩ Lsym = (ab3 a2 b)+ ∪ {ab3 a, ab3 a2 bab}.
5 The Fibonacci word: a case study

We now investigate the D-values and subsets of the language Lsym on the pre-
fixes of the Fibonacci word, [1, 12]. It is the binary infinite word obtained
from the starting word a by iterating the morphism a → ab, b → a. Thus, the
Fibonacci word is the limit of the sequence
a, ab, aba, abaab, abaababa, abaababaabaab, abaababaabaababaababa, . . .
It is not ultimately periodic and has, for each n ≥ 1, exactly n + 1 factors of

length n, [1, 12]. It is also very significant in the development of L systems, [11].
We denote by Fi , i ≥ 1, prefixes of the Fibonacci word listed above, obtained
by successive iterations of the morphism. Thus,
F1 = a, F5 = abaababa, F7 = abaababaabaababaababa.
We define the Fibonacci numbers ϕi , i ≥ 0, as follows:
ϕ0 = ϕ1 = 1, ϕi+2 = ϕi+1 + ϕi , i ≥ 0.
The Fibonacci word is connected in many ways with Fibonacci numbers. As

will be seen below, this holds true also in our setup concerning the language
Lsym and the difference function D. We begin with the following result, [11].
Lemma 5 The prefixes Fi , i ≥ 2, satisfy the following relations:
Fi+1 = Fi Fi−1 , |Fi | = ϕi , |Fi |a = ϕi−1 , |Fi |b = ϕi−2 .
9
We now introduce the following notations, for i ≥ 1.
Ai = |Fi |ab , Bi = |Fi |ba , Ci = Ai − Bi = D(Fi ),
where D is our difference function.
Lemma 6 For i ≥ 4, we have
Ai = Ai−1 + Ai−2 + ϕi−2 · ϕi−4
and, for i ≥ 3,
Bi = Bi−1 + Bi−2 + ϕ2i−3 .
Proof. We obtain, by the first equation in Lemma 5,
Ai = Ai−1 + Ai−2 + |Fi−1 |a · |Fi−2 |b .
This yields, by the two last equations in Lemma 5, the claim concerning Ai . As
regards Bi , we obtain first
Bi = Bi−1 + Bi−2 + |Fi−1 |b · |Fi−2 |a ,
and hence the claim concerning Bi . △
As an example, we list some of the values in the subsequent table. The

interconnection between Ci and ϕi−3 will become apparent below.
i Ai Bi Ci ϕi−3
4 4 2 2 1
5 8 7 1 2
6 22 18 4 3
7 54 50 4 5
8 141 132 9 8
9 363 351 12 13
10 946 924 22 21
11 2464 2431 33 34
12 6436 6380 56 55
13 16820 16732 88 89
14 43993 43848 145 144
15 115101 114869 232 233
16 301224 300846 378 377
We are now ready to establish the main result in this section.
Theorem 9 The difference function D satisfies the equation
D(Fi ) = Ci = ϕi−3 + (−1)i , for i ≥ 3.
Moreover, all prefixes of lengths ϕi − 2, for i ≥ 4, of the Fibonacci word belong

to the language Lsym .
Proof. We compute first, using Lemma 6, for i ≥ 4,
Ci = Ai − Bi = (Ai−1 + Ai−2 + ϕi−2 · ϕi−4 ) − (Bi−1 + Bi−2 + ϕ2i−3 ).
10
Using the formula
ϕi−2 · ϕi−4 = ϕ2i−3 + (−1)i , i ≥ 4,
which is easily established by induction, we obtain further
Ci = Ci−1 + Ci−2 + (−1)i , i ≥ 4.
We denote ϕ′i = ϕi−3 + (−1)i , i ≥ 3. By a direct computation,
ϕ′3 = 0 = C3 , ϕ′4 = 2 = C4 .
For i ≥ 5, we obtain
ϕ′i = ϕi−3 + (−1)i = ϕi−4 + ϕi−5 + (−1)i
which yields further
ϕ′i = ϕ′i−1 + (−1)i + ϕ′i−2 + (−1)i−1 + (−1)i = ϕ′i−1 + ϕ′i−2 + (−1)i .
Consequently, by induction, Ci = ϕ′i , for i ≥ 3, which establishes the first
sentence of Theorem 9. To prove the second sentence, assume first that i ≥ 4 is
even. Then Fi has the suffix ab. We write Fi = wi ab. Since |Fi | = ϕi , the claim
is that wi ∈ Lsym . Denote D(wi ) = x. We obtain, by the already established
first sentence of Theorem 9 and because i is even,
Ci = ϕi−3 + 1 = x − |wi |b + |wi |a + 1.
Lemma 6 yields |wi |b = ϕi−2 −1 and |wi |a = ϕi−1 −1. Since ϕi−3 +ϕi−2 = ϕi−1 ,
this implies x = 0 and, consequently, wi ∈ Lsym .
If i ≥ 4 is odd, we know that Ci = ϕi−3 − 1. In this case Fi has the suffix ba.
Writing again Fi = wi ba and denoting D(wi ) = x, we obtain similarly as before
x + ϕi−1 − 1 − ϕi−2 = ϕi−3 − 1,
whence the results x = 0 and wi ∈ Lsym follow. △
Observe that the word w8 considered in Section 2 is the same as the w8
discussed in the proof above. The matrix Maba (w8 ) was presented in Section
2. The entries (1, 3) and (2, 4) are the same, as they should be according to
Theorems 6 and 9.
It is very likely that no other prefixes of the Fibonacci word are in Lsym
than those given in Theorem 9. We have no proof of this assertion.
It can be shown that the words wi considered in the proof of Theorem 9 are
actually palindromes. By Theorem 6, this gives another proof for the second
sentence of Theorem 9.
For each i ≥ 3, the function D assumes both positive and negative values
on the prefixes of the Fibonacci word between Fi and Fi+1 . Bounds on these
values, depending on i, are easily computed. For instance, for all i ≥ 5,
D(Fi abaa) ≤ 2(ϕi−3 − ϕi−2 ).
6 Conclusion
Considerations dealing with the difference function D and the language Lsym
can be extended in many ways. For instance, we may consider the infinite
words generated by D0L systems and the “D-equivalence” of D0L sequences.
Two quite different D0L sequences, even two sequences having different growth
functions, can still be D-equivalent: they assume the same D-value at each
level. How can one decide D-equivalence? Considerations can also be extended
to bigger than binary alphabets.
11
References
[1] Berstel, J. and Karhumäki, J., Combinatorics on words - a tutorial. In Păun, G.,
Rozenberg, G. and Salomaa, A. (eds.): Current Trends in Theoretical Computer
Science. The Challenge of a New Century, Vol. 2, World Scientific Publishing
Company, Singapore (2004) 415-475.
[2] Ding, C. and Salomaa, A., On some problems of Mateescu concerning subword
occurrences. Fundamenta Informaticae (2006), to appear.
[3] Dudik, M. and Schulman, L.J., Reconstruction from subsequences. J. Combin.

Th. A 103 (2002) 337–348.
[4] Fossé, S. and Richomme, G., Some characterizations of Parikh matrix equivalent
binary words. Inform. Proc. Lett. 92 (2004) 77–82.
[5] Kuich, W. and Salomaa, A. Semirings, Automata, Languages. Springer-Verlag,

Berlin, Heidelberg, New York, 1986.
[6] Manvel, B., Meyerowitz, A., Schwenk, A., Smith, K., Stockmeyer, P., Recon-
struction of sequences. Discrete Math. 94 (1991) 209–219.
[7] Mateescu, A., Salomaa, A., Salomaa, K. and Yu, S., A sharpening of the Parikh
mapping. Theoret. Informatics Appl. 35 (2001) 551–564.
[8] Mateescu, A., Salomaa, A. and Yu, S., Subword histories and Parikh matrices.
J. Comput. Syst. Sci. 68 (2004) 1–21.
[9] Mateescu, A. and Salomaa, A., Matrix indicators for subword occurrences and
ambiguity. Int. J. Found. Comput. Sci 15 (2004) 277–292.
[10] Parikh, R.J., On context-free languages. J. Assoc. Comput. Mach. 13 (1966)

570–581.
[11] Rozenberg, G. and Salomaa, A., The Mathematical Theory of L Systems. Aca-
demic Press, New York (1980).
[12] Rozenberg, G. and Salomaa, A. (eds.), Handbook of Formal Languages 1–3.

Springer-Verlag, Berlin, Heidelberg, New York (1997).
[13] Sakarovitch, J. and Simon, I., Subwords. In M. Lothaire: Combinatorics on

Words, Addison-Wesley, Reading, Mass. (1983) 105–142.
[14] Salomaa, A., Counting (scattered) subwords. EATCS Bulletin 81 (2003) 165–
179.
[15] Salomaa, A., On the injectivity of Parikh matrix mappings. Fundamenta Infor-
maticae 64 (2005) 391–404.
[16] Salomaa, A., Connections between subwords and certain matrix mappings. The-
oretical Computer Science. 340 (2005) 188–203.
[17] Salomaa, A., On languages defined by numerical parameters. TUCS Technical

Report 663 (2005), submitted for publication.
[18] Salomaa, A., Independence of certain quantities indicating subword occurrences.

TUCS Technical Report 748 (2006), submitted for publication.
12
[19] Salomaa, A. and Yu, S., Subword conditions and subword histories. TUCS Tech-
nical Report 633 (2004), Information and Computation, to appear.
[20] Şerbănuţă, T.-F., Extending Parikh matrices. Theoretical Computer Science 310
(2004) 233–246.
[21] Şerbănuţă, V.G. and Şerbănuţă, T.F., Injectivity of the Parikh matrix mappings
revisited. Fundamenta Informaticae (2006), to appear.
13
Lemminkäisenkatu 14 A, 20520 Turku, Finland | www.tucs.fi
University of Turku
• Department of Information Technology
• Department of Mathematics
Åbo Akademi University

• Department of Computer Science
• Institute for Advanced Management Systems Research
Turku School of Economics and Business Administration

• Institute of Information Systems Sciences
ISBN ISBN 952-12-1739-1

ISSN 1239-1891

TR774

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

TR774

Transféré par

Droits d'auteur :

Formats disponibles

Arto Salomaa

Subword balance in binary words, languages

TUCS Technical Report

TUCS Technical Report

We investigate binary words and languages having a balanced structure of (scat-

Keywords: subword, scattered subword, counting subwords, Parikh matrix,

Definition 1 A word u is a subword of a word w if there exist words x1 , . . . , xn

Throughout this article, we understand subwords and factors in this way.

aababbaa, aabbaaba, abaababa, baaaabba, ababaaab, baaabaab

Definition 2 Let Σk = {a1 , . . . , ak } be an alphabet. The Parikh matrix map-

Theorem 1 Consider Σk = {a1 , . . . , ak } and w ∈ Σ∗ . The matrix Ψk (w) =

• mi,j = 0, for all 1 ≤ j < i ≤ (k + 1),

• mi,i = 1, for all 1 ≤ i ≤ (k + 1),

• mi,j+1 = |w|ai,j , for all 1 ≤ i ≤ j ≤ k.

2 Generalized Parikh matrices. Independence

Definition 3 Let u = b1 . . . bk be a word, where each bi , 1 ≤ i ≤ k, is a letter

Thus, the Parikh matrix Mu (w) associated to a word w is obtained by mul-

By Theorem 2, the generalized Parikh matrix Maba (w) satisfies

Consider the total information content in a Parikh matrix or generalized

informative numbers. What about their mutual independence? Clearly, there

Theorem 3 Arbitrary nonnegative integers may appear on the second diagonal

Definition 4 Let Σk , k ≥ 2, and Ψk be as in Definition 2. Consider integers

It is easy to establish the independence in the case of a three-letter alphabet.

w = adcbdcba3 dcb and w′ = dcba3 dcbdcba.

Then it can be verified that

|w|x = |w′ |x for x ∈ {a, b, c, d, ab, bc, cd, abc, bcd}

Theorem 4 Every entry is independent in the mapping Ψk .

Theorem 4 is not valid for generalized Parikh matrices. (The extension of

|wa|a = |w|a + 1, |wa|b = |w|b , |wa|ab = |w|ab , |wa|ba = |w|ba + |w|b .

D(w) = |w|ab − |w|ba , for w ∈ {a, b}∗ .

The balanced or symmetric language is defined by

Lsym = {w ∈ {a, b}∗ − (a∗ ∪ b∗ )|D(w) = 0}.

bbaa, baaa, baba, abaa, abba, aaba, abab, aaab, aabb.

To end this section, we list some fundamental properties of the language

constitute a subset of Lsym . Using the relations

D(wa) = D(w) − |w|b and D(wb) = D(w) + |w|a ,

a, ab, abba, abbabaab, abbabaabbaababba, . . .

wabba and wbaab

D(wa) = −2i, D(wab) = 1, D(wabb) = 2(i + 1), D(wabba) = 0

D(wb) = 2i, D(wba) = −1, D(wbaa) = −2(i + 1), D(wbaab) = 0.

This extends the induction to i + 1, and the Theorem follows. △

Definition 6 The balance indicator β of an infinite ultimately periodic word

β(W ) = |x|a · |y|b − |x|b · |y|a .

Theorem 8 For W, x, y as above and for any n ≥ 0,

D(xy n ) = D(x) + n(D(y) + β(W )).

Proof. By Theorem 5, D(y n ) = nD(y). To get D(xy n ), we have to add n

P ref ((ab3 a2 b)ω ) ∩ Lsym = (ab3 a2 b)+ ∪ {ab3 a, ab3 a2 bab}.

5 The Fibonacci word: a case study

a, ab, aba, abaab, abaababa, abaababaabaab, abaababaabaababaababa, . . .

It is not ultimately periodic and has, for each n ≥ 1, exactly n + 1 factors of

We define the Fibonacci numbers ϕi , i ≥ 0, as follows:

The Fibonacci word is connected in many ways with Fibonacci numbers. As

Lemma 5 The prefixes Fi , i ≥ 2, satisfy the following relations:

Fi+1 = Fi Fi−1 , |Fi | = ϕi , |Fi |a = ϕi−1 , |Fi |b = ϕi−2 .

Ai = |Fi |ab , Bi = |Fi |ba , Ci = Ai − Bi = D(Fi ),

where D is our difference function.

Lemma 6 For i ≥ 4, we have

Ai = Ai−1 + Ai−2 + ϕi−2 · ϕi−4

Proof. We obtain, by the first equation in Lemma 5,

Ai = Ai−1 + Ai−2 + |Fi−1 |a · |Fi−2 |b .

Bi = Bi−1 + Bi−2 + |Fi−1 |b · |Fi−2 |a ,

and hence the claim concerning Bi . △

As an example, we list some of the values in the subsequent table. The

We are now ready to establish the main result in this section.