Vous êtes sur la page 1sur 5

Recognition of Printed Urdu Script

U. Pal and Anirban Sarkar


Computer Vision and Pattern Recognition Unit
Indian Statistical Institute, 203 B. T. Road, Kolkata-108, India
Email: umapada@isical.ac.in

Abstract These characters are shown in Fig.1(a). Urdu has 10


numerals and the numerals are shown in Fig.1(b). Like
This paper deals with an Optical Character Recognition other Indian scripts in Urdu also two or more characters
system for printed Urdu, a popular Indian script. The may combine and create a complex shape called compound
development of OCR for this script is difficult because (i) a characters. Examples of some compound characters are
large number of characters have to be recognized (ii) there shown in Fig.2. Also depending on the positions (first,
are many similar shaped characters. In the proposed middle or last) in a word the basic shape of a character may
system individual characters are recognized using a be changed. For example see Fig.3. Here an Urdu basic
combination of topological, contour and water reservoir character in its isolated form and its shapes in first, middle
concept based features. The feature detection methods are and last positions of a word are shown. As a result, the total
simple and robust. A prototype of the system has been number of characters to be recognized is very large. Thus,
tested on printed Urdu characters and currently achieves OCR development for Urdu is more difficult than any
97.8% character level accuracy on average. European language script having a smaller number of
characters.
1. Introduction
The subject of optical character recognition has received 14 13 12 11 10 9 8 7 6 5 4 3 2 1
considerable attention in recent years. Several methods for
recognition of Latin, Chinese, and Arabic scripts have been
27 26 25 24 23 22 21 20 19 18 17 16 15
proposed [1,4,7,8]. Among Indian scripts, some pioneering
work has been done on Bangla [2], Devnagari [10,12] and
Oriya [3] scripts, and OCR systems for these scripts are 39 38 37 36 35 34 33 32 31 30 29 28

ready for commercialization. Some studies have also been


reported on Tamil [4], Telugu [9] and Gurmukhi [5] scripts. (a)
However, to the best of our knowledge, no work has been
done on Urdu script. In this paper, we are concerned with
the recognition of printed Urdu script.
In the proposed system, the document image is captured (b)
using a flatbed scanner and passed through skew Fig.1. Examples of Urdu alphabet and numerals (a) Basic
correction, line segmentation and character segmentation characters of Urdu alphabet (b) Urdu numerals.
modules. These modules have been developed by
combining conventional and newly proposed techniques. Urdu script has some different characteristics compare to
Next, individual characters are recognized by combination other Indian scripts. Writing style in Urdu is from right to
of topological features, contour based features and features left whereas it is left to right in other Indian scripts. It can
obtained from the concept of a water reservoir. be noted that an Urdu basic character may have four
components (see character number 6,8,17,19 etc. of
2. Properties of Urdu script Fig.1(a)) while in other Indian scripts this property is rare.
In India there are twelve scripts and Urdu is one of the There is a structural similarity between Urdu and Arabic
popular Indian scripts. Here we describe some properties of script. There are different types in Urdu script like Naskh,
the Urdu script that are useful for building the OCR system. Nastaliq, Aswad, Batool, Jaben etc. We consider here
The modern Urdu alphabet consists of 39 basic characters. Naskh and Nastaliq types.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
4. Skew detection and correction
The digitised text images are first converted into two-
tone images using a histogram based thresholding
approach. Here we represent object pixels by 1 and
background pixels by 0. The two-tone image generally
shows protrusions and dents in the characters as well as
isolated object pixels over the background, which are
cleaned by a logical smoothing approach [2].
Fig.2. Some examples of Urdu compound characters. Casual use of the scanner may lead to skew in the
document image. Skew angle is the angle that the text line
Isolated First Middle Last
of the document image makes with the horizontal direction.
Skew correction can be achieved by (i) estimating the skew
angle, and (ii) rotating the image by the skew angle in the
opposite direction.

Fig.3 An isolated basic character and its shapes in first,


middle, and last positions in a word are shown.

3. Water Reservoir Principle


The water reservoir principle is as follows. If water is
poured from one side of a component, the cavity regions of
the component where water will be stored are considered as
reservoirs [11]. By top (bottom) reservoirs we mean the
reservoirs obtained when water is poured from top (bottom) (a) (b)
Fig.5. (a): Example of an Urdu skewed text (b): Candidate
of the component. (A bottom reservoir of a component is
points for Hough transform are shown.
visualized as top reservoir when water will be poured from
top after rotating the component by 180°). Similarly if In this work, we use a Hough transform based technique
water is poured from left (right) side of the component, the for skew angle estimation. To reduce the amount of data to
cavity regions of the components where water will be be processed by the Hough transform, we compute some
stored are considered as left (right) reservoirs. For an candidate points considering some selected components
illustration see Fig.4. Here top, bottom, left and right from the image. For component selection, mean width bm of
reservoirs of some Urdu characters are shown. Water flow the bounding boxes of the connected components is
direction from a full reservoir is also shown in this figure. computed and components having bounding box width
greater than 0.5 u bm are selected. Thus, small and
Water flow direction
irrelevant components like dots, punctuation marks, small
modifiers, etc. are mostly filtered out of the skew
estimation process.
Let I is the image containing only selected components;
Reservoir from top Reservoir from left
B is the set of lowermost points of the top reservoirs
obtained from the selected components of I ; I c is the
Reservoir from bottom Reservoir from right anti-clockwise rotated (90o) image of I and B c is the set
Fig.4. Different reservoirs and their water flow directions are of lowermost points of the top reservoirs of the components
shown in four characters. Water flow directions are shown by belong to I c for which top reservoirs do not obtained
dotted arrow.
before rotation. Let L B ‰ B c . Then L is the candidate
All reservoirs obtained from a direction of a component points for Hough transform. Candidate points are chosen in
are not considered for future processing. The reservoirs such a way that they will lie, more or less, on parallel
having heights greater than a threshold T1 are only straight lines and hence these points will be good
considered. The value of T1 is chosen as 2/5 times the representative for Hough transform. An Urdu skewed text
corresponding component height. This threshold value is is shown in Fig.5(a) and the candidate points for Hough
obtained from the experiment. transform of this skewed text are shown in Fig.5(b). From
Fig.5(b) it can be seen that most of the candidate points of a

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
text line lie on a straight line. For skew angle detection, segmentation. During component labeling we check
usual Hough transform is used on these candidate points. vertical overlapping of the components. If two or more
After skew angle detection the image is rotated according components are fully vertical overlapped we assume that
to the detected skew angle. It has been noted that font style components are different parts of a character. If the
and size variations do not affect this skew estimation horizontal overlapping between two bounding boxes of two
method. Also, the proposed method can handle documents consecutive components is less than 35% we detect these
with skew angle between +45° to - 45°. two components are parts of two different characters.
From our experiments is it observed that about 97.4% of
the cases our method can compute the skew angles with a 6. Feature selection
tolerance of r 0.5 degree.
For the initial classification of characters, we consider
5. Line and character segmentation topological features, contour features as well as features
obtained from then concept of water reservoirs. The
The proposed OCR system automatically detects topological features used include existence of holes (loops)
individual text lines and then segments the characters in and their number, position of holes with respect to the
each line. We do not segment words from a line for the character bounding box, ratio of hole height to character
recognition purpose. The lines of a text block are height, number of different components in a character, etc.
segmented by finding the valleys of the projection profile Contour features include characteristics of different profiles
computed by counting the number of black pixels in each obtained from a portion of character’s contour. The main
row. The trough between two consecutive peaks in this water reservoir based features used in the recognition
profile denotes the boundary between two text lines. A text scheme are (a) number of reservoirs from different sides of
line can be found between two consecutive boundary lines. a component (b) position of a reservoir with respect to its
An Urdu text with its projection profile is shown in Fig.6. character bounding box (c) height of a reservoir (d) water
Line segmentations are shown by dotted lines in this figure. flow level of a reservoir (e) direction of water overflow (f)
ratio of reservoir height to component height etc. These
features are used to design a tree classifier where the
decision at each node of the tree is taken on the basis of the
presence/absence of a particular feature.
The features considered here are simple and easy to
detect. They are fairly robust to noise and stable with
respect to font and style variations. For the detection of
holes a background component labelling technique is used.
The width and height of these holes are also measured in
Fig.6. Horizontal projection profile of an Urdu text and its line
units of text line height.
segmentations are shown.
7. Character recognition
Character segmentation is done by a combination of
component labeling and vertical projection profile methods. In the present paper we recognize basic characters and
A text line is scanned vertically. If in one vertical scan two numerals of Urdu script. We do not consider the
or less object pixels are encountered then the scan is recognition of compound characters here.
denoted by 0, else the scan is denoted by the number of The characters are recognized in two stages. In the first
object pixels in that column. In this way a vertical scanning stage, the characters are grouped into a few subsets by a
histogram is constructed. Now, if in the histogram there feature based tree classifier. In the second stage, we use
exist a run of at least K1 consecutive 0's then the midpoint more sophisticated features to recognize similar characters
of that run is considered as the boundary of a character. belong to leaf nodes of the classification tree. The design of
The value of K1 is determined from the experiment. a tree classifier has three components: (1) a tree skeleton or
Sometimes because of kerned behavior [6] (kerned hierarchical ordering of the class labels, (2) the choice of
characters are the characters that overlap with neighbouring features at each non-terminal node, and (3) the decision
characters) of Urdu script some characters of a line may not rule at each non-terminal node. Our tree is a binary tree
be segmented properly by projection profile method. To where the number of descendants from a non-terminal node
take care of such cases we apply component labeling is two. While traversing the tree, only one feature is tested
approach along with projection profile method. If we notice at each non-terminal node. To choose a feature at a
that the distance between two consecutive character particular non-terminal node, we consider two points: (a)
boundaries is big we suspect there is a mis-segmentation in the feature which is robust and (b) the feature which
this position and we use component labeling for further maximally divides the characters in a node into two groups
to get an optimal tree. For a given non- terminal node, we

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
select a feature that best separates the group of patterns in
the above sense. A part of the tree classifier used for the
recognition of basic characters and numerals is shown in
Fig.7. The features used in the tree classifier are mainly
based on the concept of water reservoirs. For example, in
some nodes we check whether any reservoir exists from top
of a character. Similarly reservoirs from bottom and left are
also checked in some nodes to classify some characters.
The use of other reservoir based features can be noted from
the classification tree. From the tree it may be noted that
some characters belong both the sub-groups of a node. This
is done to reduce the errors which may occur because of
poor quality of documents.
Most leaf nodes of the tree point to a single character.
Some leaf nodes of the classification tree contain two
similar-shaped (confusion) characters. For example see
Fig.8 where character pair of four different leaf nodes are
shown. Here we shall discuss about the classification
techniques of the characters of some leaf nodes. Consider
for example two characters of Fig.8(a) which belong to the Fig.8: Examples of similar-shaped characters in four leaf
same node of the tree. These two characters came to same nodes of the classification tree.
node because water flow level of the reservoir of both these
characters coincides with the top of the character’s
bounding box. To separate them, we divide each of the
characters into two parts at the lowermost point of their
reservoir. Segmented parts are shown in Fig.9(a). For the (a) (b) (c) (d)
character marked as N2 in Fig.8(a) we note the right Fig.9: Recognition techniques of similar-shaped characters.
segmented part is smaller than left segmented part whereas
in the character marked as N6 in Fig.8(a) the right Characters shown in Fig.8(d) are very similar. Only a
segmented part is bigger than the left segmented part. small cavity is extra in the character marked as 38 in
For separation of the two characters shown in Fig.8(b) Fig.8(d). To separate them we have traced some of their
we note whether a component lies inside the bounding box border pixels. For illustration see Fig.9(d). Starting from
of the other component in the character. For example see the topmost right pixel of the character, border pixel tracing
Fig.9(b). For the character marked as 8 in Fig.8(b) a is made in clockwise direction until it reaches the
smaller component is fully lies in the bounding box of the lowermost point of the character. During border tracing we
biggest component while it not true for the character calculate the distance of each traced pixel to the right of the
marked as 3 in Fig. 8(b). character's bounding box. Noting this distance sequence we
For separation of the two characters of Fig.8(c) we use have separated these characters. For example, in the
water flow direction of the reservoir. Two characters have character marked as 38 we get at least one transition point
different water flow directions. Water flow direction is left in this distance sequence while for the character marked as
for the character marked as 18 in Fig.8(c) whereas it is right 33 we do not get any transition point. By transition we
for the other character. The water flow directions of these mean change of distance values from increasing mode to
two characters are shown in Fig.9(c). decreasing mode or vice-versa. This technique has been
used for the recognition of some of the other similar
characters also.
Because of page limitation of this conference we do not
Fig.7. A portion of the tree classifier for Urdu character discuss about the recognition strategies of the characters of
recognition is shown. See the Fig.1 where Urdu characters are other nodes. We used features like number of holes and
numbered and in the tree the numbers represent the their positions, number of horizontal crossings in a
characters.
particular portion of a character, reservoir heights, position
of a reservoir with respect to its character bounding box,
ratio of reservoir height to component height etc. for the
purpose.

(a) (b) (c) (d) 8. Results and discussion

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
The proposed OCR system was tested on a variety of [1] A. Amin, “Off-line Arabic character recognition: The
printed Urdu documents. Some of the documents contain state of the art”, Pattern recognition, vol.31, pp.517-530,
good-quality printing on clean paper some others are of 1998.
inferior printing and paper quality (e.g. a cheap alphabet [2] B. B. Chaudhuri and U. Pal, “A complete printed
book for children) etc. In this section, we summarize the Bangla OCR system”, Pattern Recognition, vol.31,
pp.531-549, 1998.
results of our experiments. [3] B. B. Chaudhuri, U. Pal and M. Mitra, “Automatic
Recognition of Printed Oriya Script”, Sadhana, vol.27,
Line and characters segmentation: Our system part 1. pp.23-34, 2002.
identifies individual text lines with an accuracy of 98.3%. [4] V. K. Govindan and A. P. Shivaprasad, “Character
Occasionally, when two adjacent text lines are close to each recognition - a survey”, Pattern Recognition, vol.23,
other, the lower part of the upper line has some overlap pp.671-683 1990.
with the upper part of the lower line. In such situations, [5] G. Lehal and C. Singh, “A Gurumukhi script
there is no clear valley between the two lines in the recognition system”, In Proc. 15th ICPR, vol.2, pp.557-
560, 2000.
projection profile, and our system fails to detect the
[6] Y. Lu, “Machine printed character segmentation-an
boundary between them. All such errors were confined to overview”, Pattern Recognition, vol.28, pp.67-18 1995.
the inferior-quality documents. [7] S. Mori, C. Y. Suen and K. Yamamoto, “Historical
The character segmentation accuracy of the system is review of OCR research and development”, Proceedings
96.9%. Most segmentation errors were caused by the of the IEEE, vol.80, pp.1029-1058,1992.
touching and compound characters. Sometimes some errors [8] G. Nagy, “Chinese character recognition-A twenty five
were caused for overlapping also. years retrospective”, In proc. ICPR, pp.109-114, 1988.
[9] A. Negi, C. Bhagvati and B. Krishna, “An OCR system
for Telugu” In proc. 6th ICDAR, pp.1110-1114, 2001.
Character recognition: We tested our recognition
[10] U. Pal and B. B. Chaudhuri, “Printed Devnagari script
system on 3050 characters and noticed that the system OCR system”, Vivek, vol.10, pp.12-24, 1997.
recognizes basic characters and numeral with an accuracy [11] U. Pal, A. Belaïd and Ch. Choisy “Touching numeral
of about 97.8%. The recognition errors can be grouped into segmentation using water reservoir concept”, Pattern
two classes: Recognition Letters, vol.24, pp. 261-272, 2003.
(a) Segmentation errors: Incorrectly segmented [12] R. M. K. Sinha, “Rule based contextual post processing
characters are obviously recognized incorrectly. These for Devnagari text recognition”, Pattern Recognition,
errors contribute 0.7% to the total error rate. vol.20, pp.475-485, 1995.
(b) Tree classification errors: Though water based
features are powerful and insensitive to font size and style
variations, topological feature may not always correctly
distinguish two similar shaped characters. Thus, a character
may sometimes be mis-recognized as another character
contained in the same leaf node of the classification tree.
This class of errors contributes 1.1% to the overall error
rate.
Other errors mainly come during the recognition of
confusion characters in leaf nodes.

Conclusion:
This paper describes a system for OCR of printed Urdu
script. The recognition accuracy of our prototype is
promising, but more work is needed. Our character
segmentation method should be improved to handle a larger
variety of characters that occur often in images obtained
from inferior-quality documents. We also need to recognize
compound characters to make it a complete OCR system. In
general, the system needs to be tested and fine-tuned on a
wider variety of images containing characters in diverse
fonts and sizes.

References:

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE

Vous aimerez peut-être aussi