Vous êtes sur la page 1sur 5

2015 13th International Conference on Document Analysis and Recognition (ICDAR)

SRIF: Scale and Rotation Invariant Features for


Camera-Based Document Image Retrieval

Q.B. Dang*, M.M. Luqman*, M. Coustaty*, C.D. Trant and 1.M. Ogier*
* L3i Laboratory, University of La Rochelle, France
t College of Information and Conununication Technology, Can Tho University, Vietnam
e-mail: quoc_bao.dang@univ-lrJr

Abstract-In this paper, we propose a new feature vector, of establishing a way of efficiently matching document images.
named Scale and Rotation Invariant Features (SRIF), for real­
time camera-based document image retrieval. SRIF is based
on Locally Likely Arrangement Hashing (LLAH), which has A document image retrieval system with camera phones
been widely used and accepted as an efficient real-time camera­ was proposed by Liu and Doermann in 2007 [11]. They
based document image retrieval method based on text. SRIF
proposed the features called "Layout Context" descriptor. The
is computed based on geometrical constraints between pairs of
"Layout Context" features are extracted from the geometrical
nearest points around a keypoint. It can deal with feature point
location of the words' bounding boxes in a document image.
extraction errors which are introduced as a result of the camera
capturing of documents. The experimental results show that SRIF
Beginning at the center of the word and looking for the most
outperforms LLAH in terms of retrieval accuracy and processing visible n neighbors, the "Layout Context" of a word w is
time. proposed. From the angle of the view, the visibility is defined.
According to the authors, the top n visible neighbors are
I. INT ROD U C T ION AND R EL AT E D WORK
rotation invariant and the percentage of view angles that a
neighbor words occupies are also subjected to rotation. From
Recently, the search and retrieval of document images the center of w, the coordinate system origin is established
has been used in a wide range of applications [1] such as: with X-axis parallel to the baseline of w and the width of w is
word spotting [2], [3], document similarity measurement [2], used to define the unit metric. Under this coordinate system,
document image retrieval using queries by examples [4], logo the coordinates of n most visible neighbors are invariant to
& symbol spotting [5], [6], [7], [8] and retrieving scanned similarity transformation.
documents in digital libraries [9]. In the digital age, the The "Layout Context" descriptor is robust against perspective
explosion of the number of portable digital imaging devices distortion, occlusion, uneven lighting and even crinkled pages.
has created a tremendous opportunity for camera-based The experiment results showed that the system is able to
document image retrieval applications. Users can access a identify even a small patch of the document image, captured
huge amount of content on the Internet and a big challenge is by a camera phone, in a known set of documents. A drawback
to propose some tools to link real documents to those captured of this system is that it is quite slow to find every candidate
with digital devices. For instance, some augmented reality page [11].
tools appear to propose similar contents (e.g. newspapers and
magazine articles) to the users by simply capturing an image
with their smartphones or cameras. In camera-based textual document image retrieval, the
Camera-based document image retrieval can be summarized method called Locally Likely Arrangement Hashing (LLAH)
as searching for the most relevant document images regarding is known as an efficient method with regard to accuracy, time
the user's query that is captured by a digital camera. This and scalability [12], [l 3], [14], [15], [16]. To compute this
task requires us to tackle many different problems [10]: feature, the centroid of words or letters connected components
(i) Images captured with cameras usually have a low are considered as keypoints, and feature vectors are built
resolution. from the local arrangement of invariants computed from
(ii) A camera has far less control of lighting conditions on an the keypoints. Some local invariants are computed (affine
object compared to a flatbed scanner, so uneven lighting is invariant or perspective transformation invariant) from k
due to both the physical environment and the response from coplanar points (e.g. k=4 or k=5) [12], [13]. What is more
the device. important is that the authors proposed an efficient hashing
(iii) Perspective distortion problems can occur as the capture technique, and LLAH has been shown superior to Geometric
device is not parallel to the imaging plane (and to the text). Hashing method concerning computational complexity [12],
(iv) Since digital devices are designed to operate over a [17].
variety of distances, focus becomes a significant factor. At LLAH was evaluated on a dataset of 10000 scientific paper
short distances and large apertures, even slight perspective documents. Query images were captured covering entire pages
changes can cause uneven focus. with a 6.3 megapixels digital camera (CANON EOS 300D).
(v) Lastly, the acquisition of images from a camera generally The results were impressive in terms of accuracy, time and
results in acquiring a subpart of the original image. The scalability [l 3].
retrieval process can be seen as a matching process between However, all the methods presented previously are not able
the digitized image and the original one. This then consists to deal with very small portions of documents captured

978-1-4799-1805-8/15/$31.00 ©2015 IEEE 601


2015 13th International Conference on Document Analysis and Recognition (ICDAR)

by camera, or with pages with insufficient text. In order When k 5, the invariant of perspective transformation
=

to improve these features, Takeda et al. [14] proposed an called cross-ratio from 5 points A,B, C,D,E was defined as
extension of the LLAH feature by adding some additional follows:
features which are based on the rank of k area ratios of the
extracted word regions. In another work, in order to consider S(A,B,C).S(A, D, E)
(2)
the case of the capture of a portion of a document, they also S(A,B,D).S(A,C, E)
proposed to improve the LLAH features by adding additional where S(A,B,C) is the area of a triangle with apexes A, B, C.
features based on rank of areas of words regions [16].
Similarly, Kise et al. [15] improved the LLAH feature by In order to reduce the sensibility of the system to keypoint
using the rank of k areas of letter regions and the query extraction errors, multiple LLAH vectors are computed for
expansion method in order to cope with small document each keypoint. As all the possible combinations of m points
portions captured by camera-pen system [15]. An example among n are examined, (';:)
LLAH vectors have to be built
of the rank of areas of k letter or word regions is an order from each keypoint. As a consequence, the more LLAH vectors
of k values which belong to 1 to k, this order being affine are built, the more processing time and memory consumption
invariant. For example, the order of ranking result with k = 6 the system is required. Thus, nand m need to be suitably
e.g. ( 6, 1, 2, 3, 5, 4 ) means that area of region 6 is the biggest set depending on each system. Experiments presented in [12]
one, area of region 1 is the second biggest one, similarly area showed that the highest accuracy was obtained with affine
of region 4 is the smallest one. invariant (n 8, m = 7), while cross-ratio was the second
=

highest (n 8, m
= 7) and affine invariant (n
= 7, m 6) = =

was the third.


In this paper, we propose a new feature called SRIF, which
has low dimension and that can succeed with less constraint
B. Indexing phase
points compared to LLAH. Moreover, our system works on
small portions of documents. We validated it on a real-time The most impressive contribution of LLAH is that feature
document retrieval system with a textual document dataset. vectors (called r) can be indexed and retrieved very quickly us­
Our main contributions are this efficient SRIF features for ing a hash table even if they are not stored in the hash table for
camera-based document retrieval and the dataset which was checking distances of nearest neighbors [13]. Furthermore, this
used for validation. We also propose a method that validates indexing scheme allows adding new documents into database
the correct region spotting by using RANSAC and clipping without rebuilding all the database structure of indexes (Fig. 1
the overlap region between the spotting result and the ground presents this system).
truth region.
The rest of this paper is organized as follows. In Section
o
II, we present in detail how LLAH works. Our method is
described in Section III. Section IV presents the experimental Document ID
results and the discussion. Finally, the conclusion and future
Index
work are given in Section V. Document ID

II. LLAH
List
Hash
In this section, we give an overview of LLAH. LLAH in­ Table
cludes three main steps which are feature extraction, indexing
phase and retrieval phase.
Fig. I. The hash table structure.

A. Feature extraction
These performances rely on the use of integer feature
LLAH feature extraction can be summarized as follows vectors r, that are discretized and normalized as follows [18]:
[13], [18]. LLAH considers centroid of each word connected
r(i) = trunc(r(i)) * 2 + round(r(i) - trunc(r(i))) (3)
component as keypoints, which can be obtained even under
perspective distortion, noise, and low resolution. A deep
And the Hash function is defined as follows [18]:
description on the method to obtain centroid of each word
connected component can be found in [13]. From each d-l
keypoint P, the n nearest neighbor points around keypoint Hindex =
(2: ri q i) mod Hsize (4)
P are selected and organized clockwise. Then, all possible i=O
combination of m points among n are examined ( m < n). where d is the number of dimensions of vector r, q is the
From one arrangement combination of m points, the LLAH level of quantization constant (e.g. q 17), Hsize is the size
=

vector r is calculated based on a sequence of affine invariants of hash table.


calculated from all possible combinations of k points among
m (k = 4 or k 5; k < m ) .
=
In order to add a new document into database, the system
When k 4, the affine invariant from 4 points A,B, C,D was
=
firstly extracts keypoints from centroids of word connected
defined as follows: components. Then for each keypoint, all LLAH vectors are
computed and indexed. As shown in Fig. 2, both indexing and
S(A,C,D) retrieval share the LLAH feature extraction and use the same
(1)
S(A,B,C) hash function (4).

602
2015 13th International Conference on Document Analysis and Recognition (ICDAR)

C. Retrieval phase o
o
o
o o
Starting from a query image captured with a camera, o

keypoints are firstly extracted like in the indexing phase. Then


for each keypoint, all LLAH vectors are computed and looked o o
o o o
up in the hash table (using hash function in 3) in order to get
One arrangement of m A combination of A combmatlon of A combination of
the list of document ID related to each keypoint (Fig. 1). For points (01=5) a round P 2 points around P 2 points around P 2 points around P
each document in the retrieval result list, the number of votes
for it in the voting table is incremented. Finally, the document
with the majority of votes is returned as the retrieval result
[13]. Fig. 4. The arrangement of m points (m=5) and the sequence of new invariants
(SRIF) calculated from all possible combinations of 2 points among m points.

Document image from database

D
Captured query image Similarly to LLAH, keypoints for SRIF are firstly extracted
from centroids of word connected components (we can defi­
D nitely employ centroids of letters as keypoints if needed). Next,
n
from each keypoint P, nearest neighbor points around P are
selected and organized clockwise (e.g. 6). After this, alln
m n
=

possible combination of points among are examined with


m n< (e.g. m
5 in Fig. 4). Then, from one combination of
m
=

Indexing I Retrieval
points, the SRIF vector r is calculated based on a sequence
Computation of indices ¢ of scale and rotation invariants calculated from all possible
combinations of 2 points (constrained to among points. P) m
i),
" "
Finally, each value of the SRIF vector, r ( is computed using

! t:�:=
Hash table either one of two invariant values: eij.Lmaxii or eij.Lminij
as presented in equation (5, 6)


IV. EXPERIMENTATION

A. Dataset and the ground truth generation


Fig. 2. The camera-based document retrieval system.

III. O UR MET HOD

Based on LLAH, we propose SRIF, which is a new method


for keypoints description based on their spatial organization.
Similarly to LLAH, feature vectors are extracted from each
keypoint. The indexing phase and the retrieval phase are also
similar to the LLAH method, as shown in Fig. 2. SRIF relies
on the idea of using pairs of nearest constraint points around
a keypoint (see Fig. 3). Let P
be a keypoint, i and P
two Pj
points coplanar with P. I pAl
and IPP� I denote the length
of the two vectors pA
and P�,respectively, and eij is
the angle between these two vectors. It is obvious that the
three values eij, Lminii min(ltAl/lt�l,IP�I/IPAI)
=

Fig. 5. Captured video from a document at four regions, the overlap between
and LmaXij =
max(IPAI/IPPjl,IPPjI/IPAI) are scale spotting region results and captured region from a query image.

invariant and rotation invariant [8].


To evaluate the performance of our feature and in order
to compare it to LLAH, we built a public dataset chosen
from Wikibooks, titled LaTex. This book contains 700 pages
which were converted into 300 dpi JPEG files for the indexing
process. In order to assess the spotting capacities of our system,
a ground truth was generated. To build the ground truth,
the JPEG images were printed on A4 papers. Each printed
Fig. 3. Constraint between two point around one keypoint P. document was divided into 4 regions - top left, top right,
bottom left and bottom right (see Fig. 5 for details) - and
Based on these scale and rotation invariant constraints one video was recorded at each region except blank regions.
between three points (as shown in Fig. 4), we propose two Document were captured without rotations. The IPEVO VZ-l
scale and rotation invariant ratios used for SRIF: HD document camera was used for recording the videos. It
was fixed at 8 cm above surface of the captured document.
eij.LmaXii (5)
The resolution of the captured images was 1024x768.
eij.Lminij (6) For each video, we selected the first 15 frames. To validate the

603
2015 13th International Conference on Document Analysis and Recognition (ICDAR)

rotation invariance, we also rotated each frame by an angle of C. Experimental results


0, 90, and 180 degrees. We choose two specific angle because
Let SRIFmax and SRIFmin denote the SRIF features using
it does not affect too much the keypoints which were extracted
by a connected component(CC) extraction algorithm and it can eij.LmaXij and eij.Lminij respectively; both of them were
tested. For LLAH we chose the best one that was correspond­
be saved the testing time. There were 1630 captured videos,
ing to affine invariant combined from 4 constraint points. Each
and the total number of queries in the ground truth is 73350.
method was tested with both case i.e. with adding additional
This dataset is made publicly available for academic research
features (using the rank of k areas of letter regions) and without
purposesl.
adding additional features. The experimental results are shown
in Tables I and II.
B. Experimental protocol and the evaluation measure
The Table I shows the result of frames testing. It can be seen
For each query image, in order to retrieve the correct
TABLE !. T HE EXPERIMENTAL RESULTS EVALUATED BY FRAMES
document from the database, it needs to be put throughout
the retrieval phase. After getting the voting result, the top­ Method Method with ad- n m Frames Avg.
Id ditional Retrieval Retrieval
t documents with largest number of votes are selected features Accuracy Time (s)
as candidate results. In order to check the correct I SRIFmax No 6 5 61% 0.5

matched results, RANSAC is applied to find the best 2 SRIFmax Yes 6 5 68% 0.4
3 SRIFmax No 7 6 84% 0.6
homography transformation between query's keypoints and 4 SRIFmax Yes 7 6 48% 0.4
the corresponding keypoints of each document in top-t result 5 SRIFmin No 6 5 - 0.6
documents. If no best transformation can be found, the 6 SRIFmin Yes 6 5 74% 0.37
7 SRlFmin No 7 6 44% 0.6
number of votes is set to zero. Lastly, the document with
8 SRIFmin Yes 7 6 83% 0.4
majority of votes in top-t result documents is returned as the 9 LLAH No 7 6 21% 0.8
result. A correct retrieval result is validated if it has a correct 10 LLAH Yes 7 6 20% 0.6

document Id in one hand, and if it corresponds to the correct


region of the document on the other hand. that SRIFmax (n 7, m= 6) without adding features gave
=

To validate the correct region, firstly RANSAC is applied so the best result in accuracy of retrieval (with 84%), the second
that we can obtain the spotting region of query image in the highest accuracy of retrieval was SRIFmin adding features
returned document through perspective transformation. Next, (n = 7, m 6). SRIFmin (n
= 6, m 5) did not work in the
= =

the overlap between the ground truth region (where query case without adding features, and SRIFmax (n 6, m 5) = =

image was captured) and the spotting region is calculated. with only 10 dimensions gave 61% of accuracy result. As we
The frame is considered as a correct retrieval result if the can see, LLAH got very low accuracy retrieval result with our
area of the overlap is more than 60 percent of the area of the dataset. The processing time of all methods was less than a
spotting region otherwise it is considered as an incorrect result. second per query. SRIF was a little faster than LLAH, and the
An example of the overlap region validation is shown in Fig. 5. processing time of SRIFmax (n 6, m 5) was the fastest
= =

one.
The Table II shows the result of videos testing. As we can
In order to compare SRIF and LLAH, we measured the
retrieval accuracy and the average retrieval time. Both methods TABLE I!. T HE EXPERIMENTAL RESULTS EVALUATED BY VIDEOS
were tested with two protocols respectively corresponding to
Videos Retrieval Accuracy
frames evaluation and video evaluation. With frames evalua­ Method Id Avg. Retrieval Time (s)
0 90 180 Avg
tion, the frames retrieval accuracy corresponds to the precision 1 62% 62% 61% 61.6% 0.54
2 71% 70% 67% 69.3% 0.60
measured by the ratio between the number of correct retrieval
3 86% 85% 84% 85.0% 0.42
frames in the total dataset (73350 images). With videos, we 4 50% 48% 45% 47.7% 0.43
evaluated the retrieval accuracy for each video called videos re­ 5 0.62

trieval accuracy. For this evaluation, 15 frames were extracted 6 76% 75% 73% 74.7% 0.48
7 44% 44% 43% 43.7% 0.75
from each video, and each frame was rotated by an angle of 0, 8 86% 84% 83% 84.0% 0.54
90, or 180 degrees before going to the retrieval phase. If there 9 20% 19% 17% 18.7% 0.85
are more than half of correct frames, video was considered as 10 19% 17% 16% 17.3% 0.67

successful. Otherwise video was considered as failed. Videos


retrieval accuracy is the ratio between the number of correct see that all SRIF methods got approximately the same frames
retrieval videos and the total of 1630 videos. testing results with retrieval accuracy and average retrieval
All tested methods shared the same keypoint extracting ap­ time. By contrast, LLAH got slightly lower results compared to
proach, i.e. the one based on the extraction of centroids of frames testing results. By applying a rotation transformation,
letter connected components because of sparse text of the videos rotated by angle of 90, or 180 degrees got a slightly
dataset. The parameters were set to Hsize 1017, t
= 5 = lower retrieval accuracy.
for selecting top-t of best candidate retrieval results, k 51 =

for lO-dimension vectors, k 17 for IS-dimension vectors,


=
D. Discussion
k = 7 for 21-dimension vectors in order to avoid collisions
in the hash table. We implemented our method and LLAH The retrieval accuracy shown in Tables I and II are not so
on a core i7 - 8 GB PC system running in C extended C++ high. The main reason is that texts in all document belonging
environment with only a single thread. the dataset are sparse, and there are some pages which contain
similar texts (e.g. the text demonstrating latex syntax). In
I It can be downloaded from http://navidomass. univ-lr.fr/ SRIFDataseti addition, the area of captured query is less than 1/8 of A4

604
2015 13th International Conference on Document Analysis and Recognition (ICDAR)

page. REF E R EN C E S
Another reason is that the dataset contains many pages with [1] M . B . Kokare and M . Shirdhonkar, "Document image retrieval: an
overview," Int. f. Computer Applications, vol. 1, no. 7, pp. 114-119,
2010.

[2] D. Niyogi and S. N. Srihari, "Use of document structure analysis to


retrieve information from documents in digital libraries," in Electronic
Imaging·97. International Society for Optics and Photonics, 1997, pp.
207-218.

[3] M. Rusiiiol and J. LJad6s, "Word and symbol spotting using spatial
organization of local descriptors," in Document Analysis Systems, 2008.
DAS'08. T he Eighth IAPR International Workshop on. IEEE, 2008,
pp. 489-496.
Fig. 6. Insufficient number of text and uneven focus query examples
[4] K. Kunze, K. Tanaka, M. Iwamura, and K. Kise, "Annotate me:
supporting active reading using real-time document image retrieval
on mobile devices," in Proceedings of the 2013 ACM conference on
Pervasive and ubiquitous computing adjunct publication. ACM, 2013,
pp. 231-234.

[5] G. Zhu and D. Doermann, "Automatic document logo detection,"


in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth
International Coriference on, vol. 2. IEEE, 2007, pp. 864-868.

[6] M. Rusinol, D. Karatzas, and J. Llad6s, "Spotting graphical symbols


Fig. 7. Matched keypoints of SRIF (on left) and LL AH (on right) in camera-acquired documents in real time," Proceedings of the Tenth
IAPR International Workshop on Graphics Recognition, GREC13, 2013.

insufficient number of text areas (Fig.6). Another reason is [7] V. P. Le, M. Visani, C. D. Tran, and J.-M. Ogier, "Improving logo
spotting and matching for document categorization by a post-filter based
uneven lighting and uneven focus problems caused by the
on homography," in Document Analysis and Recognition (ICDAR), 2013
camera. Lastly, by applying a rotation transformation, the 12th International Coriference on. IEEE, 2013, pp. 270-274.
retrieval accuracy slightly decreased approximately from 1% [8] S. Yang, "Symbol recognition via statistical integration of pixel-level
to 3%. This is mainly due to the fact that texts were changed constraint histograms: A new descriptor," IEEE transactions on pattern
by rotation transformation, which caused some keypoint ex­ analysis and machine intelligence, vol. 27, no. 2, pp. 278-281, 2005.

traction errors. [9] S. N. Srihari, S. Shetty, S. Chen, H. Srinivasan, C. Huang, G. Agam,


Without adding additional features, SRIFmax was better than and O. Frieder, "Document image retrieval using signatures as queries,"
in Document Image Analysis for Libraries, 2006. DIAL'06. Second
SRIFmin, and SRIFmin was better when adding additional fea­
International Conference on. IEEE, 2006, pp. 6-pp.
tures. This is because the SRIFmax vectors are more discrete
[10] J. Liang, D. Doermann, and H. Li, "Camera-based analysis of text and
than the SRIFmin vectors since SRIFmin was computed using documents: a survey," International Journal of Document Analysis and
Bij.Lminij (Lminij E [0,1 D. The reason why the retrieval Recognition (IJDAR), vol. 7, no. 2-3, pp. 84-104, 2005.
accuracy of SRIF is higher than LLAH is that for most of [II] X. Liu and D. Doermann, "Mobile retriever-finding document with a
the experimented cases, the number of matched keypoints of snapshot," in Int. Workshop on Camera-Based Document Analysis and
SRIF is more than those of LLAH (Fig.7). Recognition, 2007, pp. 29-34.

[12] T. Nakai, K. Kise, and M. Iwamura, "Use of affine invariants in locally


likely arrangement hashing for camera-based document image retrieval,"
in Document Analysis Systems VII. Springer, 2006, pp. 541-552.
V. CONCL USION
[13] --, "Camera based document image retrieval with more time and
memory efficient lIah," Proc. CBDAR, pp. 21-28, 2007.
We have presented the new features namely SRIF which is [14] K. Takeda, K. Kise, and M. Iwamura, "Real-time document image
based on LLAH. In addition, we built a dataset and the ground retrieval for a 10 million pages database with a memory efficient and
truth. The experimental results show that SRIF can correctly stability improved lIah," in 2011 International Conference on Document
deal with the context of documents containing small num­ Analysis and Recognition, Sep. 2011, pp. 1054-1058.

bers of texts; furthermore SRIF outperformed LLAH without [15] K. Kise, M. Chikano, K. Iwata, M. Iwamura, S. Uchida, and S. Omachi,
"Expansion of queries and databases for improving the retrieval accu­
adding any additional features from both the retrieval accuracy
racy of document portions: an application to a camera-pen system,"
point of view, and processing time point of view. in Proceedings of the 9th IAPR International Workshop on Document
In the future, we are going to evaluate our features on other Analysis Systems. ACM, 2010, pp. 309-316.
datasets comprising of multi-lingual document images. We [16] T. Nakai, K. Kise, and M. Iwamura, "Real-time retrieval for images
will also compare our features to other features such as SIFT, of documents in various languages using a web camera," in Document
SURF, ORB, and Shape Context in textual document images. Analysis and Recognition, 2009. ICDAR'09. 10th International Confer­
ence on. IEEE, 2009, pp. 146-150.
We are planning to apply SRIF for other systems that use
[17] H. J. Wolfson and I. Rigoutsos, "Geometric hashing: An overview,"
smartphones and/or wearable cameras.
Computing in Science and Engineering, vol. 4, no. 4, pp. 10-21, 1997.

[18] T. Nakai, K. Kise, and M. Iwamura, "Hashing with local combinations


of feature points and its application to camera-based document image
ACKNOWLED GMENT retrieval," Proc. CBDAR05, pp. 87-94, 2005.

This work has been partially supported by the LabEx


PERSYVAL-Lab (ANR-11-LABX-0025), by the CNRS PEPS
Project CartoDialect, and by the Program 165 of Vietnamese
government. The authors would like to thank Miss Marwa
Mansri who helped us to build the dataset and ground truth.

605

Vous aimerez peut-être aussi