Académique Documents
Professionnel Documents
Culture Documents
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
4. Skew detection and correction
The digitised text images are first converted into two-
tone images using a histogram based thresholding
approach. Here we represent object pixels by 1 and
background pixels by 0. The two-tone image generally
shows protrusions and dents in the characters as well as
isolated object pixels over the background, which are
cleaned by a logical smoothing approach [2].
Fig.2. Some examples of Urdu compound characters. Casual use of the scanner may lead to skew in the
document image. Skew angle is the angle that the text line
Isolated First Middle Last
of the document image makes with the horizontal direction.
Skew correction can be achieved by (i) estimating the skew
angle, and (ii) rotating the image by the skew angle in the
opposite direction.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
text line lie on a straight line. For skew angle detection, segmentation. During component labeling we check
usual Hough transform is used on these candidate points. vertical overlapping of the components. If two or more
After skew angle detection the image is rotated according components are fully vertical overlapped we assume that
to the detected skew angle. It has been noted that font style components are different parts of a character. If the
and size variations do not affect this skew estimation horizontal overlapping between two bounding boxes of two
method. Also, the proposed method can handle documents consecutive components is less than 35% we detect these
with skew angle between +45° to - 45°. two components are parts of two different characters.
From our experiments is it observed that about 97.4% of
the cases our method can compute the skew angles with a 6. Feature selection
tolerance of r 0.5 degree.
For the initial classification of characters, we consider
5. Line and character segmentation topological features, contour features as well as features
obtained from then concept of water reservoirs. The
The proposed OCR system automatically detects topological features used include existence of holes (loops)
individual text lines and then segments the characters in and their number, position of holes with respect to the
each line. We do not segment words from a line for the character bounding box, ratio of hole height to character
recognition purpose. The lines of a text block are height, number of different components in a character, etc.
segmented by finding the valleys of the projection profile Contour features include characteristics of different profiles
computed by counting the number of black pixels in each obtained from a portion of character’s contour. The main
row. The trough between two consecutive peaks in this water reservoir based features used in the recognition
profile denotes the boundary between two text lines. A text scheme are (a) number of reservoirs from different sides of
line can be found between two consecutive boundary lines. a component (b) position of a reservoir with respect to its
An Urdu text with its projection profile is shown in Fig.6. character bounding box (c) height of a reservoir (d) water
Line segmentations are shown by dotted lines in this figure. flow level of a reservoir (e) direction of water overflow (f)
ratio of reservoir height to component height etc. These
features are used to design a tree classifier where the
decision at each node of the tree is taken on the basis of the
presence/absence of a particular feature.
The features considered here are simple and easy to
detect. They are fairly robust to noise and stable with
respect to font and style variations. For the detection of
holes a background component labelling technique is used.
The width and height of these holes are also measured in
Fig.6. Horizontal projection profile of an Urdu text and its line
units of text line height.
segmentations are shown.
7. Character recognition
Character segmentation is done by a combination of
component labeling and vertical projection profile methods. In the present paper we recognize basic characters and
A text line is scanned vertically. If in one vertical scan two numerals of Urdu script. We do not consider the
or less object pixels are encountered then the scan is recognition of compound characters here.
denoted by 0, else the scan is denoted by the number of The characters are recognized in two stages. In the first
object pixels in that column. In this way a vertical scanning stage, the characters are grouped into a few subsets by a
histogram is constructed. Now, if in the histogram there feature based tree classifier. In the second stage, we use
exist a run of at least K1 consecutive 0's then the midpoint more sophisticated features to recognize similar characters
of that run is considered as the boundary of a character. belong to leaf nodes of the classification tree. The design of
The value of K1 is determined from the experiment. a tree classifier has three components: (1) a tree skeleton or
Sometimes because of kerned behavior [6] (kerned hierarchical ordering of the class labels, (2) the choice of
characters are the characters that overlap with neighbouring features at each non-terminal node, and (3) the decision
characters) of Urdu script some characters of a line may not rule at each non-terminal node. Our tree is a binary tree
be segmented properly by projection profile method. To where the number of descendants from a non-terminal node
take care of such cases we apply component labeling is two. While traversing the tree, only one feature is tested
approach along with projection profile method. If we notice at each non-terminal node. To choose a feature at a
that the distance between two consecutive character particular non-terminal node, we consider two points: (a)
boundaries is big we suspect there is a mis-segmentation in the feature which is robust and (b) the feature which
this position and we use component labeling for further maximally divides the characters in a node into two groups
to get an optimal tree. For a given non- terminal node, we
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
select a feature that best separates the group of patterns in
the above sense. A part of the tree classifier used for the
recognition of basic characters and numerals is shown in
Fig.7. The features used in the tree classifier are mainly
based on the concept of water reservoirs. For example, in
some nodes we check whether any reservoir exists from top
of a character. Similarly reservoirs from bottom and left are
also checked in some nodes to classify some characters.
The use of other reservoir based features can be noted from
the classification tree. From the tree it may be noted that
some characters belong both the sub-groups of a node. This
is done to reduce the errors which may occur because of
poor quality of documents.
Most leaf nodes of the tree point to a single character.
Some leaf nodes of the classification tree contain two
similar-shaped (confusion) characters. For example see
Fig.8 where character pair of four different leaf nodes are
shown. Here we shall discuss about the classification
techniques of the characters of some leaf nodes. Consider
for example two characters of Fig.8(a) which belong to the Fig.8: Examples of similar-shaped characters in four leaf
same node of the tree. These two characters came to same nodes of the classification tree.
node because water flow level of the reservoir of both these
characters coincides with the top of the character’s
bounding box. To separate them, we divide each of the
characters into two parts at the lowermost point of their
reservoir. Segmented parts are shown in Fig.9(a). For the (a) (b) (c) (d)
character marked as N2 in Fig.8(a) we note the right Fig.9: Recognition techniques of similar-shaped characters.
segmented part is smaller than left segmented part whereas
in the character marked as N6 in Fig.8(a) the right Characters shown in Fig.8(d) are very similar. Only a
segmented part is bigger than the left segmented part. small cavity is extra in the character marked as 38 in
For separation of the two characters shown in Fig.8(b) Fig.8(d). To separate them we have traced some of their
we note whether a component lies inside the bounding box border pixels. For illustration see Fig.9(d). Starting from
of the other component in the character. For example see the topmost right pixel of the character, border pixel tracing
Fig.9(b). For the character marked as 8 in Fig.8(b) a is made in clockwise direction until it reaches the
smaller component is fully lies in the bounding box of the lowermost point of the character. During border tracing we
biggest component while it not true for the character calculate the distance of each traced pixel to the right of the
marked as 3 in Fig. 8(b). character's bounding box. Noting this distance sequence we
For separation of the two characters of Fig.8(c) we use have separated these characters. For example, in the
water flow direction of the reservoir. Two characters have character marked as 38 we get at least one transition point
different water flow directions. Water flow direction is left in this distance sequence while for the character marked as
for the character marked as 18 in Fig.8(c) whereas it is right 33 we do not get any transition point. By transition we
for the other character. The water flow directions of these mean change of distance values from increasing mode to
two characters are shown in Fig.9(c). decreasing mode or vice-versa. This technique has been
used for the recognition of some of the other similar
characters also.
Because of page limitation of this conference we do not
Fig.7. A portion of the tree classifier for Urdu character discuss about the recognition strategies of the characters of
recognition is shown. See the Fig.1 where Urdu characters are other nodes. We used features like number of holes and
numbered and in the tree the numbers represent the their positions, number of horizontal crossings in a
characters.
particular portion of a character, reservoir heights, position
of a reservoir with respect to its character bounding box,
ratio of reservoir height to component height etc. for the
purpose.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE
The proposed OCR system was tested on a variety of [1] A. Amin, “Off-line Arabic character recognition: The
printed Urdu documents. Some of the documents contain state of the art”, Pattern recognition, vol.31, pp.517-530,
good-quality printing on clean paper some others are of 1998.
inferior printing and paper quality (e.g. a cheap alphabet [2] B. B. Chaudhuri and U. Pal, “A complete printed
book for children) etc. In this section, we summarize the Bangla OCR system”, Pattern Recognition, vol.31,
pp.531-549, 1998.
results of our experiments. [3] B. B. Chaudhuri, U. Pal and M. Mitra, “Automatic
Recognition of Printed Oriya Script”, Sadhana, vol.27,
Line and characters segmentation: Our system part 1. pp.23-34, 2002.
identifies individual text lines with an accuracy of 98.3%. [4] V. K. Govindan and A. P. Shivaprasad, “Character
Occasionally, when two adjacent text lines are close to each recognition - a survey”, Pattern Recognition, vol.23,
other, the lower part of the upper line has some overlap pp.671-683 1990.
with the upper part of the lower line. In such situations, [5] G. Lehal and C. Singh, “A Gurumukhi script
there is no clear valley between the two lines in the recognition system”, In Proc. 15th ICPR, vol.2, pp.557-
560, 2000.
projection profile, and our system fails to detect the
[6] Y. Lu, “Machine printed character segmentation-an
boundary between them. All such errors were confined to overview”, Pattern Recognition, vol.28, pp.67-18 1995.
the inferior-quality documents. [7] S. Mori, C. Y. Suen and K. Yamamoto, “Historical
The character segmentation accuracy of the system is review of OCR research and development”, Proceedings
96.9%. Most segmentation errors were caused by the of the IEEE, vol.80, pp.1029-1058,1992.
touching and compound characters. Sometimes some errors [8] G. Nagy, “Chinese character recognition-A twenty five
were caused for overlapping also. years retrospective”, In proc. ICPR, pp.109-114, 1988.
[9] A. Negi, C. Bhagvati and B. Krishna, “An OCR system
for Telugu” In proc. 6th ICDAR, pp.1110-1114, 2001.
Character recognition: We tested our recognition
[10] U. Pal and B. B. Chaudhuri, “Printed Devnagari script
system on 3050 characters and noticed that the system OCR system”, Vivek, vol.10, pp.12-24, 1997.
recognizes basic characters and numeral with an accuracy [11] U. Pal, A. Belaïd and Ch. Choisy “Touching numeral
of about 97.8%. The recognition errors can be grouped into segmentation using water reservoir concept”, Pattern
two classes: Recognition Letters, vol.24, pp. 261-272, 2003.
(a) Segmentation errors: Incorrectly segmented [12] R. M. K. Sinha, “Rule based contextual post processing
characters are obviously recognized incorrectly. These for Devnagari text recognition”, Pattern Recognition,
errors contribute 0.7% to the total error rate. vol.20, pp.475-485, 1995.
(b) Tree classification errors: Though water based
features are powerful and insensitive to font size and style
variations, topological feature may not always correctly
distinguish two similar shaped characters. Thus, a character
may sometimes be mis-recognized as another character
contained in the same leaf node of the classification tree.
This class of errors contributes 1.1% to the overall error
rate.
Other errors mainly come during the recognition of
confusion characters in leaf nodes.
Conclusion:
This paper describes a system for OCR of printed Urdu
script. The recognition accuracy of our prototype is
promising, but more work is needed. Our character
segmentation method should be improved to handle a larger
variety of characters that occur often in images obtained
from inferior-quality documents. We also need to recognize
compound characters to make it a complete OCR system. In
general, the system needs to be tested and fine-tuned on a
wider variety of images containing characters in diverse
fonts and sizes.
References:
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03)
0-7695-1960-1/03 $17.00 © 2003 IEEE