Vous êtes sur la page 1sur 4

Text-Tracking Wearable Camera System for Visually-Impaired People

Makoto Tanaka Hideaki Goto


Graduate School of Information Sciences, Cyberscience Center,
Tohoku University, Japan Tohoku University, Japan
mc41229n @ sc.isc.tohoku.ac.jp hgot @ isc.tohoku.ac.jp

Abstract nize characters in human living space and read out the
text information for the user. Some robots with char-
Disability of visual text reading has a huge impact acter recognition capability have been proposed so far
on the quality of life for visually disabled people. One [2, 3, 4, 6]. Iwatsuka et al. proposed a guide dog sys-
of the most anticipated devices is a wearable camera tem for blind people [2]. We presented a text capturing
capable of finding text regions in natural scenes and robot equipped with an active camera [8]. Our robot
translating the text into another representation such as can find and track multiple text regions in the surround-
sound or braille. In order to develop such a device, ing scene. In these robot applications, camera move-
text tracking in video sequences is required as well as ment is constrained by some simple and steady robot
text detection. We need to group homogeneous text re- movements. On the other hand, a wearable camera can
gions to avoid multiple and redundant speech syntheses be moved freely. Thus, a robust text tracking method is
or braille conversions. needed to develop a text capturing device.
We have developed a prototype system equipped with Some duplicate text strings may appear in the con-
a head-mounted video camera. Text regions are ex- secutive video frames. Recognizing all the text strings
tracted from the video frames using a revised DCT fea- in the images is a waste of time. More importantly, the
ture. Particle filtering is employed for fast and robust camera user would not want to hear repeatedly a syn-
text tracking. We have tested the performance of our thesized voice originated from the same text. Merino
system using 1,000 video frames of a hall way with eight and Mirmehdi presented a framework for realtime text
signboards. The number of text candidate images is re- detection and tracking and demonstrated a system [5].
duced to 0.98%. The text tracking performance was not so satisfactory
and a lot of improvements are still needed.
In this paper, we present a wearable camera system
1. Introduction that can automatically find and track text regions in the
surrounding scenes. The system is equipped with a
head-mounted video camera. The text strings are ex-
We human beings make the most of text information
tracted using the revised DCT-based method [1]. The
in surrounding scenes in our daily lives. Disability of vi-
text regions are then grouped into image chains by a
sual text reading has a huge impact on the quality of life
text tracking method based on particle filtering.
for visually disabled people. Although there have been
In Section 2, we present the overview of the wearable
several devices designed for helping visually-impaired
camera system and the algorithms used. In Section 3,
people to “see” objects using an alternative sense such
the text tracking algorithm is given. Section 4 describes
as sound and touch, the development of text reading de-
experimental results and performance evaluations.
vices is still at an early stage. Character recognition for
the visually disabled is one of the most difficult tasks
since the characters have complex shapes and are very 2. Wearable Camera System
small compared with physical obstacles. One of the
most anticipated devices is probably a wearable cam- 2.1. Overview of the system
era capable of finding text regions in natural scenes and
translating the text into another representation such as Figure 1 shows the prototype of the wearable camera
sound or braille. system which we have constructed. The system con-
An alternative device is a helper robot that can recog- sists of a head-mount camera (380k-pixel color CCD),

978-1-4244-2175-6/08/$25.00 ©2008 IEEE


Figure 2. Selecting the most appropriate
Figure 1. Wearable Camera System. image.

an NTSC-DV converter, and a laptop PC running Linux.


The NTSC video signal is converted to DV stream and When a chain of the text candidate regions is broken in
the video frames are captured at 352 × 224-pixel reso- a frame, the chain is extracted and its validity is exam-
lution. ined. If the chain is shorter than 3 frames, it is regarded
The text capture and tracking method proposed in as noise and discarded.
this paper consists of the following stages.
2.4. Selecting the most appropriate text images
1. Partition each frame into 16 × 16-pixel blocks.
2. Extract text blocks using revised DCT feature. All the text candidate regions belonging to the chains
3. Generate text candidate regions by merging ad- are examined and the images that seem to be non-text
joining text blocks. are removed using the edge counts method in [7].
It is necessary to pick up the most appropriate text
4. Create chains (groups) of text regions by finding images for character recognition. One text image is se-
the correspondences of the text regions between lected per chain, and fed to the next stage which is most
the (n−1)th frame and the nth frame using particle likely character recognition. As the most appropriate
filtering. image, the region whose horizontal length is the longest
5. Filter out non-text regions in the chains. is chosen as shown in Figure 2. This selection scheme
6. Select the best text image in each chain. makes sense in many cases, since a signboard should
be approached by the user close enough and as fronto-
2.2. Text region detection parallel as possible.
Our current system is not equipped with a character
recognition process, since we are concentrating only on
We have employed the revised DCT-based feature
text detection and tracking.
proposed in [1] for text extraction from scene images.
High-wide (HWide) frequency band of the DCT coef-
ficient matrix is used. The discriminant analysis-based 3. Text Tracking Using Particle Filtering
thresholding is also used. The blocks in the image are
classified into two classes; “text blocks” whose feature 3.1. Region-based matching
values are greater than the automatically found thresh-
old, and “non-text blocks.” We have employed particle filtering for the tracking
After extracting the text blocks, connected compo- of text candidate regions. Particle filtering has been
nents of the blocks are generated. The bounding box used successfully in various object tracking applica-
of each connected component is regarded as a text re- tions. We have tested two matching schemes in the text
gion. The text regions whose area is smaller than four tracking.
blocks are discarded, because they are considered to be The first method is “region-based matching.” The
too small for legible text strings. particles are scattered around the predicted center point
of text region in the current frame from the center of
2.3. Text tracking each text candidate region in the previous frame as
shown in Figure 3. The particles that fall outside any
The text tracking process corresponds to the fourth of the detected text region have their weight set to zero.
stage. The details are given in Section 3. If a particle falls into a detected text region, its weight
A new chain is created when a text candidate re- is set to the similarity value between the previous and
gion without any correspondence appears in the cur- the current text regions. All the particles have the same
rent frame. A new label is assigned to the region/chain. label as that of the source text region.
The similarity s1,2 between regions 1 and 2 is de-
fined as
1
s1,2 = , (1)
d1,2 + ε
where d1,2 is the distance defined below and ε is a small
value (ε = 1) to avoid divergence.
Cumulative histogram is used as a measure to evalu-
ate the dissimilarity between text regions, since it can
represent color distribution with small computational
cost. The cumulative histogram H(z) is given as
z Figure 3. Region-based text tracking.

H(z) = h(i) (2)
i=0

where h(i) denotes the normal histogram and i denotes


the intensity. For comparing two cumulative histograms
H1,c (z) and H2,c (z), where c denotes one of RGB color
channels, the following city block distance is used.
255
 Figure 4. Label selection by voting in
d1,2 = |H1,c (z) − H2,c (z)| (3)
block-based text tracking.
c z=0

Weighted centers of the particles are found in the


current frame. The label of a current text region is de-
termined by finding the weighted center nearest to the A sighted person wore the camera, walked in the
center of the region. hall way, and approached every signboard. 1,000 video
frames were obtained in total. All the signboards in the
3.2. Block-based matching scene were approached by the person close enough for
text detection.
The second method is “block-based matching,” The variance of particle’s random walk is 13.0. The
which is similar to the one used in our previous work number of particles per region (or block) is 500. Only
[8]. The same particle filtering algorithm is applied to the velocity is taken into account in the particle filtering.
text blocks instead of text candidate regions. A chain
label is assigned to each text block in the current frame. 4.2. Evaluations and discussions
The region label is determined by voting as shown in
Figure 4. The most popular label is used. Consequently, 5,192 text candidate regions were
When more than one regions in the current frame found using the DCT feature. After the text region
correspond to a text region in the previous frame, the tracking, the total number of text images to be passed
distance between every pair of the current regions are to the character recognition stage has been significantly
measured. If the side-to-side distance is lower than a reduced to 51. Thus, the number has been cut down to
pre-determined threshold or the regions overlap each 0.98%. Table 1 shows the numbers of extracted text im-
other, they are merged together. Otherwise a new label ages and average precessing times per frame. We have
is assigned to the region which is less close to the cor- also tested our former text tracking method [8] for com-
responding region in the previous frame. This region parison purposes.
merging process has been introduced to make the text The performance of the region-based tracking is al-
tracking more robust to temporal breakups of multiple most the same as that of the conventional method. The
text lines. particle filtering-based method with the block-based
matching outperforms others. Our text tracking pro-
4. Experimental Results and Discussions gram (single threaded) runs on the laptop PC with a
Core2Duo 1.06GHz processor. The processing speed
4.1. Experimental setup without DV decoding is 10.0fps (= 1/100msec), which
is near-realtime.
Experimental images were taken at the hall way in Some text tracking results are shown in Figure 5.
our building. Eight signboards including text exist on The text regions are successfully detected and tracked
the walls. well. The text region filter based on the edge counts
Figure 5. Results of text region tracking.

reduce the number of text candidate images down to


Table 1. Numbers of extracted text images 0.98%.
and processing times. Making the text detection and tracking more robust
to quick camera movements, and developing a filtering
Filtered No filter method to deal with blurred/corrupt frames are included
images time images time in our future work.
Conventional[8] 113 56 185 56
Region-based 111 56 176 56 References
Block-based 51 100 80 100
(msec) [1] H. Goto. Redefining the DCT-based feature for scene
text detection — Analysis and comparison of spatial
frequency-based features. International Journal on Doc-
ument Analysis and Recognition (IJDAR), 2008. (Online
First article).
[2] K. Iwatsuka, K. Yamamoto, and K. Kato. Development
of a guide dog system for the blind people with character
Figure 6. Extracted text images. recognition ability. Proceedings 17th International Con-
ference on Pattern Recognition, pages 683–686, 2004.
[3] D. Létourneau, F. Michaud, and J.-M. Valin. Au-
works quite effectively. The images of the upper door tonomous Mobile Robot That Can Read. EURASIP Jour-
edge are discarded by the filtering. Some examples of nal on Applied Signal Procesing, 17:2650–2662, 2004.
[4] D. Létourneau, F. Michaud, J.-M. Valin, and C. Proulx.
automatically selected text images are shown in Figure
Textual message read by a mobile robot. Proceedings
6. All the signboards have been successfully captured.
IEEE/RS International Conference on Intelligent Robots
Although the proposed method performs the best so and Systems, pages 2724–2729, 2003.
far, a lot of duplicate or non-text images can be seen. [5] C. Merino and M. Mirmehdi. A Framework Towards Re-
The value 51 is about 6.4 times of 8 which is the ideal altime Detection and Tracking of Text. Second Interna-
number of text images. A lot of splits of text region tional Workshop on Camera-Based Document Analysis
chains can be seen where multiple text lines are close and Recognition (CBDAR2007), pages 10–17, 2007.
each other. Further improvements will be needed. In [6] J. Samarabandu and X. Liu. An Edge-based Text Re-
gion Extraction Algorithm for Indoor Mobile Robot Nav-
addition, combining our method with a super-resolution
igation. International Journal of Signal Processing,
technique would be an interesting topic.
3(4):273–280, 2006.
[7] H. Shiratori, H. Goto, and H. Kobayashi. An Efficient
5. Conclusions Text Capture Method for Moving Robots Using DCT
Feature and Text Tracking. Proceedings 18th Interna-
tional Conference on Pattern Recognition (ICPR2006),
We have presented a wearable camera system that 2:1050–1053, 2006.
automatically finds and track text regions in surround- [8] M. Tanaka and H. Goto. Autonomous Text Capturing
ing scenes in near-realtime. The proposed text tracking Robot Using Improved DCT Feature and Text Tracking.
method is based on particle filtering and can effectively 9th International Conference on Document Analysis and
Recognition (ICDAR2007) Volume II, pages 1178–1182,
reduce the number of text images to be recognized. In
2007.
the experiment in an indoor scene, our system could

Vous aimerez peut-être aussi