Vous êtes sur la page 1sur 7

Interword Distance Changes Represented by Sine Waves for

Watermarking Text Images




Ding Huang and Hong Yan
School of Electrical and Information Engineering
University of Sydney, NSW 2006
Email: hding@ee.usyd.edu.au, yan@ee.usyd.edu.au


Abstract

Digital watermarking is widely believed to be a
valid means to discourage illicit distribution of
information content. Digital watermarking
methods for text documents are limited because
of the binary nature of text documents. A distinct
feature of a text document is its space patterning.
In this paper we propose a new approach in text
watermarking where interword spaces of
different text lines are slightly modified. After the
modification, the average spaces of various lines
have the characteristics of a sine wave and the
wave constitutes a mark. Both private and public
watermarking algorithms are discussed in this
paper. Preliminary experiments have shown
promising results. Our experiments suggest
space patterning of text documents can be a
useful tool in digital watermarking.


Index Terms Copyright protection,
correlation detection, image processing, space
patterning, text watermarking, wave coding.


I. Introduction

With the wide spread use of the Internet in our
society, the distribution and access of
informat ion is greatly facilitated. However,
without methods which prevent or discourage
illicit redistribution and reproduction of
informat ion content, copyright can be easily
infringed. Digital watermarking is widely
believed to be a valid solution to the problem
and currently there is intensive research in this
area, in both academic and industrial
communit ies.
Compared to the plurality of previously
proposed methods in digital watermarking for
picture and video images, digital watermarking
methods for text documents are very limited.
One reason for this difference is that text is a
binary image and lacks rich grayscale
informat ion.
Digital watermarking for text documents is
discussed in [3], [4], and [6]-[9]. There are
primarily three types of text watermarking
methods which have been developed previously.
They are
(a) Line-Shift Coding vertically shifts the
locations of text lines to encode the
document.
(b) Word-Shift Coding horizontally shifts the
locations of words within text lines to
encode the document.
(c) Feature Coding chooses certain text
features and alters those features.
These three methods require the original
unmarked text for decoding. Furthermore, since
there are already variations in the original
unmarked text, these methods need to establish
some measurement basis to differentiate
watermarks from the original variation.
In this paper, we propose a new text
watermarking method. It makes use of the
distinct character of a text document space,
more specifically interword spaces of text lines,
to watermark a text document. Our encoding
technique adjusts interword spaces in a text
document so that mean spaces across different
lines show characteristics of a sine wave, and
informat ion can be encoded in the sine wave(s).
Since watermarking is embedded in both
horizontal and vertical directions, the method can
be inherently robust against interference.
Furthermore, information can be recovered either
with or without the original image, and control
lines or control blocks are not needed for
decoding. We propose to call our new
watermarking method Space Coding.
In Section 2, we analyze the characteristics of
space in a text document and show that average

spaces of text lines are random. We define a
random variable for average space in Section 3
and propose to use a sine wave to modify
average spaces for encoding information. In
Section 4, we describe watermarking detection
methods and present some test results. Finally
we present concluding remarks in Section 5 and
suggest future work.

II. Space Profile and Statistics

A text page in digital form can be represented
by the following function

L y W x y x f ,..., 1 , 0 , ,..., 1 , 0 ], 1 , 0 [ ) , ( = = e

that presents black and white pixels. Here, W
and L are the width and length of the page in
pixels respectively.
In digital image processing, interword space is
detected with the vertical project ion profile

=
=
b
t y
y x f x v ) , ( ) (

which is a summat ion of the on pixels along a
vertical column from t (top) to b (bottom) of a
text line. If there is no on pixel for a
consecutive run of x

c k k k x x v + + = = ,..., 1 , , 0 ) (

then interword space is detected. Figure 1 shows
a typical vertical profile of five words.
Intuitively, average space of a text line can be
used as a parameter to study the characteristics
of space patterning in a text document. For a line
with d words, we have an average space

1 , ) 1 ( = = d d S S
t a
(1)

where
t
S is the total interword space in a text
line calculated in pixels.
There are generally t wo types of texts. One is
aligned only at the left, another is justified at
both the left and the right. Our experiments show
that space marking on text which is justified at
both sides is more noticeable than the marking
on text aligned only at the left. For this reason,
the experiments in this paper focus on text which
is justified at both sides.
Figure 2 shows a typical profile of
a
S with
respect to text lines in a paragraph. It can be seen
that
a
S varies randomly across different lines.

III. Space Marking

Due to the random nature of average spaces of
text lines across a text document, we can define a
discrete random variable (rv) ) (n X as follows:

1 ,..., 1 , 0 , ) ( = = N n S n
an
X (2)

where n represents the index of a text line in a
text document with N lines.
an
S represents
a
S of line n . Another view of our space
marking method can be considered as marking
on the rv ) (n X .
A space varying sine wave, or, more
specifically a sine wave which varies over a
number of text lines, has some attractive
characteristics for watermarking interword space
as follows:
1) A sine wave varies gradually so local
variation may be unnoticed.
2) A sine waves amplitude, frequency and
phase can be used to carry coding
informat ion.
3) A sine waves periodic symmetry may make
decoding easy and reliable.
Thus, we can use different text lines across a
text document to encode a sine wave. More
specifically, the values of S
a
for different text
lines, or their variat ions, can act as sampling
values of a sine wave.
For space watermarking to be unnoticeable,
changes in interword spaces have to be kept to a
minimum. On the other hand, there should be
enough space modification so that marking can
be correctly detected. These contradicting
requirements suggest a narrow range of marking
amplitude.
The Sampling Theorem suggests that for a
sine wave to be reconstructed correctly, the
sampling frequency has to be at least twice that
of the sine wave. From the viewpoint of human
visual perception [10], there exist certain
frequencies at which variations are most
noticeable. It is reasonable to avoid marking in
the neighborhood of these frequencies. So the
capacity of a sine waves frequency to carry
watermarking information is limited.

Thus, the phase of a sine wave has been
chosen primarily to carry informat ion in space
watermarking.
In comparison to merely shift ing words
horizontally in word-shift coding, the
modification of the interword space of a text line
involves the words in this line being horizontally
expanded or shrunk so that the required total
interword space, and thus
a
S of the line, is
obtained.
Suppose a new average space '
a
S is to be
used after the modification of the interword
space of a text line. Then, the change of the total
interword space of this text line in pixels is

) 1 )( ' ( = d S S S
a a tc
(3)
where d is the number of words, and
a
S is the
original average space of the text line as in (1).
If 0 >
tc
S , then the total interword space of
this text line will expand and the words in this
text line will shrink. If 0 <
tc
S , the total
interword space of this text line will shrink and
the words in this text line will expand.
For any word in this text line with an index i ,
suppose its width before the modification is
i
Pxl in pixels, then the expansion or shrinkage
of width distributed to this word is


or



i
ES is rounded to an integer, since it presents
a number of pixels. Therefore, a difference may
exist between
tc
S and the sum of
i
ES , which is

In our implementation, the difference
d
S is
added to the largest
i
ES .
To expand or shrink a word, vertical lines of
equal intervals in the word are duplicated or
removed respectively. The interval is calculated
as


Again, the interval
i
Iv is rounded to an
integer. After the expansion or shrinkage of this
word, it will have a new width in pixels, which is


The t wo sides of a text line are not changed
while the line is being expanded or shrunk. To
shrink or expand words, at the left half of this
text line, the left side of each word is kept fixed
and the word is shrunk or expanded, i.e. vert ical
lines of equal intervals are removed or
duplicated; at the right half of this text line, the
right side of each word is kept fixed and the
word is shrunk or expanded.
A single text page or several pages of a text
document are considered to be a workplace.
Relevant text lines in this workplace are
considered as sampling points of the sine wave
for watermarking. Phase information can be
either, the absolute phase in this workplace or,
the relative phase if several waves are involved.
For the proposed space marking method, we
have implemented both private and public
watermarking algorithms.

A. Private Watermarking

1. First, the mean of
a
S is calculated
N q p
p q
S
a
q
p n
an
< < s
+
=

=
0 ,
1
1
(8)

where p and q are the indices of the first
and last text lines in the workplace between
which a watermarking sine wave will reside.
(
(

=
i
i
i
ES
Pxl
Iv
(4.a)

=
=
d
i
i tc d
ES S S
1
(5)
(6)
0 ,
1
<
(
(
(
(
(
(

=
tc i
d
i
i
tc
i
S if Pxl
Pxl
S
ES
(4.b)
i i i
ES Pxl Pxl = ' (7)
0 ,
1
>
(
(
(
(

=
tc i
d
i
i
tc
i
S if Pxl
Pxl
S
ES

2. Then for each line, a watermark component
is determined by the sine wave

) ) ( sin(
1 1 1 1
| e + = p n a C W
n
(9)

where
n
W represents the desired watermark
component of a text line with an index of n ;
1
e and
1
| are the radian frequency and
initial phase angle of the sine wave
respectively.
1
C is a constant determining
the amplitude of the sine wave.
3. Next,
n
W is added to
a
S for line n ,
thus generating a new average space which is


n an an
W S S + = ' (10)

4. Finally, the words in each text line are
modified accordingly by applying formulas
(3) ~ (7).
The private method can be considered as
adding a constant part to the original rv ) (n X ,
thus generating a new rv ) (n Y


n
W n n + = ) ( ) ( X Y (11)

B. Public Watermarking

Unlike private watermarking, after which
neighboring text lines still have random values
of
a
S , the values of
a
S of the lines used in
public watermarking should have a certain
relationship so that they can directly act as the
sampling values of a sine wave.
Our experiments show that it is inappropriate
to take all lines in a text document for public
watermarking because of the degree of variation
in
a
S of the original text lines. Observation of
the profiles of
a
S suggests text lines with a
larger number of words have closer values of
a
S . This can be contributed to two reasons.
First, in a text line with a larger number of
words, an average word and its associated space
is allocated a smaller number of pixels. Thus, the
difference between
a
S of similar lines is
smaller. Second, a text line with a larger number
of words is less likely to be justified, or is
justified to a s maller degree.
1. Based on the above observation, first a key
is chosen in a text so that all text lines whose
number of words are larger than, or equal to,
the key are watermarked.
2. Then a set
w
S of text lines is selected from
the text so that the number of words from
each line in this set is not less than the
selected key.
3. Then the mean of
a
S of text lines in set
w
S
is calculated

N v u
u v
S
a
v
u m
am
< < s
+
=

=
0 ,
1
2
(12)

where u and v are similar to p and q in
(8), but u and v are the indices of the text
lines within
w
S , rather than the indices in the
original text in our implementation; m is the
index of a text line within
w
S , and
am
S
represents
a
S of line m.
4. For each text line in
w
S , a watermark
component is determined by a sine wave.

) ) ( sin(
2 2 2 2
| e + = u m a C W
m
(13)

where
m
W represents the desired watermark
component of text line m;
2
e and
2
| are
the radian frequency and initial phase angle
of the sine wave respectively.
5. Next , for each line in
w
S ,
a
S is replaced
with the sum of
2
a and
m
W , thus generating
a new average space which is

, '
2 m am
W a S + = if line
w
S me ,
otherwise unchanged (14)

6. Finally, the words in each text line are
modified accordingly by applying formulas
(3) ~ (7).
Thus, for text lines belonging to
w
S , we have
a new rv ) (n Y

, ) (
2 m
W a m + = Y if line
w
S me ,
otherwise unchanged (15)

IV. Detection and Performance

When a text has been watermarked using the
private method, we can obtain rv ) (n Y by a
reconstruction of
a
S as in formula (1). With the
original unmarked text, we have the watermark
component
n
W from (11)

) ( ) ( n n W
n
X Y =

Figure 3 shows a reconstructed profile of
a
S
and detected watermarking information of the
text of Figure 2 after it has been subjected to
private watermarking. Interword spaces were
only expanded for this example.
When a text has been watermarked using the
public method, and supposing the watermarking
key is known, thus the set
w
S of the text lines is
reconstructed and the mean of the average spaces
2
a is recalculated as in (12), we also have the
watermark component
m
W from (15)

, ) (
2
a m W
m
=Y for text lines in
w
S

Next , the originally marked phase informat ion
can be detected by calculating the cross -
correlation of a detecting sine wave with
n
W or
n
W

=
+ =
1
0
) sin( ) (
1
) (
T
n
d d
j n A n W
T
j r e (16)

where W represents
n
W or
m
W ;
d
e is the
radian frequency of the detecting sine wave; and
j represents a lag in the number of text lines
and varies so as to detect the marked phase
informat ion. Through the j that produces an
extreme value of ) ( j r , the original marked
phase information can be recovered.
d
A is the
amplitude of the detecting sine wave. T is the
summation number, which depends on the
number of items in
n
W or
m
W as well as
d
e
[11].
One parameter that has been used in our tests
is half wave sampling points, being a number
of sampling points N in the encoding sine wave
of (9) or (13) such that

, 0 t e < s N where e is
1
e or
2
e

Our test results are shown in Table 1~3.


Table 1. Detection results of private
watermarking.


Table 2. Detection results of public
watermarking.


Table 3. Detection results after watermarked
texts have been edited.

From our experiments, it can be concluded
that space in text documents can be watermarked
unnoticeably and the watermarks can be
correctly detected, even after the watermarked
texts have been subjected to editing operations.

V. Conclusion

A distinct and unique characteristic of a text
document is its space pattern. We have
developed new algorithms to digitally watermark
text documents by utilizing their space. Our
method slightly modifies interword spaces so
that different lines across a text act as sampling
points of a sine wave. Preliminary experiments
have shown promising results. Compared to
previously proposed text watermarking methods,
our method can be implemented for both private
and public watermarking. Furthermore, by
embedding information on both horizontal and
10 7 5 3
Correct Rate 20/20 20/20 20/20 20/20
Half Wave Sampling Points
7 6 5 3
Correct Rate 14/15 15/15 14/15 21/23
Half Wave Sampling Points
6 3
Down Sampling 15/15 15/15
Skewing 15/15 15/15
Half Wave Sampling Points

vertical directions, combined with utilizing
averaging operations, it can be inherently robust
against interference. Overall, our method is
better than previously proposed methods in
digital watermarking for text documents.
Currently, we are exploring if space patterning
of a text document can be used in other ways for
digital watermarking. The principle of sine wave
coding, which is utilized in the experiments for
this paper, may also be useful for digital
watermarking in the frequency domain for
general grayscale picture and video images.

References

[1] R. Anderson, Ed. Proceedings of the First
International Information Hiding Workshop,
Cambridge, U.K., May/June, 1996, vol. 1174 of
Lecture Notes in Computer Science, Berlin,
Springer-Verlag, 1996.
[2] D. Aucsmith, Ed. Proceedings of the Second
International Information Hiding Workshop,
Portland, Oregon, USA, April 1998, vol. 1525 of
Lecture Notes in Computer Science, Berlin,
Springer-Verlag, 1998.
[3] J. Brassil, S. Low, N. Maxemchuk, and L.
OGorman, Electrical Marking and
Identification Techniques to Discourage
Document Copying, IEEE Journal on Selected
Areas in Communications, vol.13, no. 8, pp.
1495-1504, October 1995.
[4] J. Brassil, L. OGorman, Watermarking
Document Images with Bounding Box
Expansion, in Anderson [1], pp. 227-235.
[5] Special Issue on Copyright and Privacy
Protection, IEEE Journal on Selected Areas in
Communications, vol.16, no. 4, May 1998.
[6] S. Katzenbeisser, F. A.P. Petitcolas, Eds.
Information Hiding Techniques for
Steganography and Digital Watermarking,
Boston, Artech House, 2000.
[7] S. H. Low, N.F. Maxemchuk, Performance
Comparison of Two Text Marking Methods, in
special issue [5], pp. 561-572.
[8] S. H. Low, N.F. Maxemchuk, J.T. Brassil, and
L.OGorman, Document Marking and
Identification Using Both Line and Word
Shifting, Proc. Infoncom95, Boston, MA, April
1995, pp. 853-860.
[9] S. H. Low, N.F. Maxemchuk, A.M. Lapone,
Document Identification for Copyright
Protection Using Centroid Detection, IEEE
Transactions on Communications, vol. 46, no. 3,
pp. 372-383, March 1998.
[10] T.N. Cornsweet, Visual Perception, New York,
Academic Press, 1970, pp. 330-342.
[11] E.C. Ifeachor, B. W. Jervis, Digital Signal
Processing: A Practical Approach, Reading,
Mass., Addison-Wesley, 1993.
[12] S. Craver, N. Memon, B. Yeo, M. Yeung,
Resolving Rightful Ownerships with Invisible
Watermarking Techniques: Limitations, Attacks,
and Implications, in special issue [5], pp. 573-
586.

Figure 1. Vertical profile of 5 words (original resolution is 300 pixel/inch, re-sampled at interval of 3 pixels
horizontally).

0
5
10
15
20
25
30
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
1
1
0
1
1
1
1
1
2
1
1
3
1
1
4
1
1
5
1
1
6
1
1
7
1
1
8
1
1
9
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
5
1
2
6
1
2
7
1
p
i
x
e
l


Figure 2. Profile of average space for text lines in a paragraph (resolution: 300 pixels/inch).


Figure 3. Reconstructed profile of average space and detected watermark information after the text of
figure2 has been subjected to private watermarking (resolution: 300 pixels/inch).


0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
text line
T
o
p
:

a
v
e
r
a
g
e

s
p
a
c
e

(
p
i
x
e
l
)


B
o
t
t
o
m
:

w
o
r
d

n
u
m
b
e
r

0
5
10
15
20
25
30
35
40
45
50
55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
text line
T
o
p
:

r
e
c
o
n
s
t
r
u
c
t
e
d

a
v
e
r
a
g
e

s
p
a
c
e

B
o
t
t
o
m
:

d
e
t
e
c
t
e
d

w
a
t
e
r
m
a
r
k

i
n
f
o
r
m
a
t
i
o
n

(
p
i
x
e
l
)

Vous aimerez peut-être aussi