Vous êtes sur la page 1sur 4

Preprocessing and Structural Feature Extraction for a

Multi-Fonts Arabic / Persian OCR

Mandana Kavianifar Adnan Amin


School of Computer Science and Engineering
University of New South Wales, Sydney
NSW 2052, Australia
mandanak@cse.unsw.edu.au amin@cse.unsw.edu.au

Abstract are not important or alive languages but technical


difficulties due to special characteristics of these scripts,
English and Chinese are languages, which have have caused the problem. These unique characteristics
tremendously attracted interests of character recognition can be categorized as :
researchers. In contrast, research in the field of • Arabic / Persian script (hand written or printed) is
character recognition for Arabic / Persian scripts face cursive and letters normally connected to each other
major problems mainly related to the unique on an imaginary line called baseline. Hence,
characteristics of these two like being cursive, multiple separation of the characters from each other in this
shapes of one character in different positions in a word kind of script is a difficult task.
and connectivity of characters on the baseline. The • Each letter in Arabic/Persian script can have two to
proposed work consists of three major phases. After four different shapes within a word (at the
digitizing the text, the original image is transformed into beginning, in the middle, at the end or stand-alone).
a gray scale image using a 300-dpi scanner. Different Figure 1 shows these four different positions for the
steps of preprocessing are then applied on the image file. character ”AIN”. This fact increases the characters
In the next phase, sub-words of all words are recognized to be recognized from 28 (for Persian 32) to 100 (for
and global features for each word are extracted. Contour Persian 114).
tracing plays a very important role in the phase of • Arabic/Persian is written and read from right to left.
feature extraction.

1. Introduction

Character recognition as one of the most important


fields of pattern recognition has been the center of Figure 1. Four different shapes of character
attention for researchers in the last four decades. The “AIN”. From left to right: stand alone, at the end,
modern version of OCR appeared in the middle of in the middle and at the beginning
1940’s with the development of the digital computers[1].
Since then several character recognition systems for • Most letters have one, two or three dots that can be
English, Chinese and Japanese characters have been positioned above, below or inside them (Figure 2).
proposed [12]-[14]. However, development of character In several characters, the characters’ main shapes are
recognition systems for other languages such as Arabic the same and only number and position of dots make
and Persian hasn’t had such a fast trend. them different. Hamzah (zigzag) also has this
Arabic is the official language of all countries in characteristic. Dot and Hamzah are called
North Africa and most of the countries in the Middle “complementary characters”.
East and it is the sixth most commonly used language in • An Arabic/Persian word is composed of a number of
the world. In addition, Farsi or Persian with almost 90 portions which are named “sub-word” in this paper.
million speakers (the official language of Iran, also A sub-word comprises one stand-alone character or
spoken in Afghanistan), Kurdi, Pashto and Urdu are all a group of connected characters. See Figure 3.
written using modified Arabic alphabets [8], [2].
These figures support the fact that slow
advancement in Arabic/Persian OCRs is not because they
The first phase of this system is data acquisition,
which is done by scanning an Arabic text with 300 dpi
resolution. The input file for the system is a gray scale
(a) (b) (c) PGM-formatted file, so any other format of grayscale is
Figure 2. Three different Arabic characters. converted and then used as an input file. Gray scale
(a) 1 dot below, (b) 1 dot in the middle, (c ) 3 image files cover more pixels of the original image than
dots above the character binary images, so the file provides more information
about it. Also, noise or some features like loop can be
distorted in a binary version of an image file, therefore
unlike a gray scale file, useful information is corrupted.

2.2. Preprocessing
(a) (b) ( c)
Preprocessing (pixel level processing) on the input file
Figure 3. Three Arabic words. (a) contains one makes it ready for further processes. This major phase
sub-word with 4 characters, (b) contains three includes the following steps:
sub-words with 1,1 and 2 characters
respectively, (c) contains two sub-words with 2 2.2.1. Global thresholding. A suitable threshold among
and 2 characters potential thresholds is selected as a global threshold by
employing Otsu’s method [15]. Pixels with value greater
Due to these major differences between Arabic/Persian than the global threshold are assumed as background and
and Latin or Chinese scripts, proposed methods for the others as foreground pixels.
latter are not suitable for the former. Single researches in
this subject did not start until the early 1980’s. IRAC, 2.2.2. Connected components recognition. Connected
which was suggested by Amin et al. [3] used a structural components (cc) are rectangular boxes bounding together
classification method. IRACII [4], was based on regions of connected foreground pixels. The objective of
segmentation technique. Badi and Shimura used the this step is to form these rectangles around distinct
concept of contour tracing and the identification of the components of the input file [16].
component cursive in their syntactic method [5]. In
another method [6] , sub-words are identified and 2.2.3. Grouping. The next step is the grouping of
separated in the text. Then a histogram is used to neighboring connected components of similar dimension.
segment each sub-word. Mahmoud [7] adopted a The algorithm takes one cc at a time and tries to merge
combination of Fourier descriptors and contour tracing it into a group from a set of existing groups. If it
for Arabic characters. Contour tracing also plays a very succeeds, the group’s dimensions are altered to cater for
crucial role in the system proposed by Allam [2]. During the new cc. If the cc can not be merged with any of the
last five years, researchers have suggested several other existing group then a new group is formed with its sole
methods and for a complete literature on the subject of member being the cc. Figure 4 shows an Arabic word, its
Arabic OCR, the reader is referred to [9]. It must be connected components and group.
mentioned that Sakhr Automatic reader no-3.01 and
Shonut’s Omnipage Pro Version 2.0 are two samples of
commercially available Arabic character recognition
systems [17].
In this paper, a Multi-font Arabic/Persian character
recognition system and its major phases have been
explained. Within the paper, “Arabic” refers to both
Arabic and Persian unless it is mentioned explicitly. (a) (b) (c)
Figure 4. (a) An Arabic word, (b) Its connected
2. The proposed recognition system components (sub-words), (c) Its group

The proposed character recognition system for Arabic 2.2.4. Skew detection and correction. The adopted
character set is composed of the following phases: skew detection algorithm [11], attempts to determine the
skew angle of the entire document by calculating a skew
2.1. Digitization angle for each group. Then, skew correction algorithm is
applied on the input file.
2.3. Feature extraction pixel of the contour, necessary information about each
pixel of the contour , contour’s length and class.
The proposed work has adopted a global approach for
character recognition and no character segmentation is 2.3.3. Sub-words detection. Another analysis on the set
required. of contours is begun to find a main body which
determines a sub-word. To find vertical boundaries for
2.3.1. Contour tracing. The basic step for determining this contour, two pixels of it with the largest and smallest
sub-words within each word is tracing the outer contours values of Y-coordinate must be identified. Tracing the
of all its elements. Within the boundaries of each group linked list of information about all pixels of the contour
(representing a word) a raster scan from top to bottom and comparing their Y-coordinates with the current
and left to right is started until the first foreground pixel amount of minimum and maximum does the task.
is reached. From this point, contour tracing is begun by
adopting Freeman chain code and the Left-Most-Looking 2.3.4. Detecting a Sub-word’s complementary
(LML) rule [7]. External and loop are two different types characters. All the contours of complementary
of contours which are produced in this step. An Arabic characters which belong to the detected sub-word should
word and its contours are shown in Figure 5. be found. Therefore, another search is taken place
through the array to find all the external contours which
2.3.2. Contour analysis. In this step, first all the word’s the Y coordinate of their starting points are fallen within
sub-words are determined and then each sub-word’s the two boundaries. As mentioned before, a dot can be in
contours are analyzed and classified. Each sub-word can a form of 1 dot, 2 or 3 dots. In some fonts, dots can be
have three types of external contours: main body, attached together and form a bigger island.
complementary character or noise. Big noises in size, For detecting non-attached dots, the distance between X
are another sort of contour that can be find in the image and Y coordinates of the starting point of their contour is
file. During the digitization phase, some spurious pixels checked. If these distances were less than two certain
may result in the image file . Fortunately, they are not thresholds, the contours belong to non-attached dots and
so big and would be recognized by their contour’s number of contours determines type of dots (2 or 3). To
length. detect type of attached dots (See Figure 6), three
different methods have been designed and tested on all
available Arabic fonts for Windows. Theses methods are
conducted based on comparison of length and area of the
attached dots’ contour with different defined thresholds
and also by comparing the chain-code sequence of the
contour with certain patterns. The achieved results have
shown the third method works best among the three. In
fact the third method enables the proposed system to
work as an Omni-font OCR for all the available fonts in
Arabic for Windows word processor.

(a) (b) (c) (d ) (e)

Figure 5. (a) An Arabic word, (b) Its main body


contours (external), (c) The contour of upper dot (a) (b)
(external), (d) The contour of lower dot Figure 6. (a), (b) Two Arabic words, each with
(external), (e) The contour of loop (internal) one 3 attached dots and one 2 attached dots
Two thresholds, TN and TM are set to help finding
contour’s class. If the Contour’s Length (CL) is less than Dots in Arabic characters can be positioned under,
TN, the contour is considered a noise and its class gets above or in the middle of them. Because position of dots
the value of 0. If CL >= TM, then the contour belongs to with respect to baseline can determine their position with
a main body and value of 1 is assigned to the field. If respect to character, correct detection of baseline is
TN < CL < TM , then a complementary character’s important. Two different methods have been suggested
contour has been found and this field will have the value and tested for baseline detection. The tests’ results show
that the second method has been more successful than
of 2. At the end of this step an array of information
the first one.
about contours for each word’s sub-word is built. This 1. Baseline detection using contour’s chain code [10].
array contains contour’s type, coordinates of the first
2. Baseline detection using horizontal projection References
profile and finding a row with the largest density.
Comparing the starting point’s X coordinate with [1] V .K .Govindan and A.P.Shivaprasad, “ Character
baseline location would do determining position of dots. Recognition – A Review”, Pattern Recognition, Vol. 23, No. 7,
pp. 671-683, 1990
2.3.5. Producing output file. At the end of feature [2] M. Allam, “Segmentation versus Segmentation-free for
Recognizing Arabic Text”, Document Recognition II, SPIE,
extraction step, an output file is produced which contains Vol. 2422, pp.228-235, 1995
all the information about every word in the image file. [3] A. Amin, A. Kaced, J. P .Haton and R .Mohr, “Handwritten
Each line contains the following information: Name of Arabic Character Recognition by the IRAC system”,
the word, number of sub-words, number of peaks in the Proceedings of the 5th Int. Conference on Pattern Recognition,
histogram of sub-word, type and position of pp. 729-731, 1980
complementary characters and number of loops. The last [4] A. Amin, “Machine recognition of hand written Arabic
four fields are repeated for every sub-word of the word. words by the IRAC II system”, Proceedings of the 6th Int. Joint
Conference on Pattern Recognition, pp.34-36, 1982
[5] K. Badi and M. Shimura, “Machine recognition of Arabic
3. Experimental Results and Conclusion cursive scripts”, Pattern Recognition in practice, Amsterdam:
North Holland, 1980
The proposed work has been tested on some printed [6] A. Amin and G. Masini,“Machine recognition of multi-font
Arabic words with 14 of 19 different fonts available in printed Arabic texts ”, Proceedings of the 8th International
Arabic for Windows (famous fonts like Naskh, Andalus, Conference on Pattern Recognition, pp. 392-395, 1986
Kufi and traditional Arabic). The system has been [7] S. A. Mahmoud , “ Arabic character recognition using
Fourier descriptors and character contour encoding ” , Pattern
developed by Microsoft Visual C++ on Windows NT Recognition, Vol.27, no. 6, pp. 815-824, 1994
platform. Table 1 shows the result of feature extraction [8] M. R. Hashemi , O. Fatemi and R. Safavi, “ Persian script
phase on several sample Arabic words. recognition ”, Proceedings of the third Int. Conference on
document analysis and recognition , Vol. II, pp. 869-873, 1995
Name of Entity % of Correct [9] A. Amin, “ off – line Arabic Character Recognition : The
Recognized state of the art”, Pattern Recognition, Vol. 31, No. 5, pp. 517-
530, 1998
Sub-Word 97% [10] M. Kavianifar , “ A Persian/Arabic Character Recognition
One dot ~ 91% System ”, Proceedings of Pan-Sydney Area Workshop on
Visual Information Processing, pp. 207-211, 1997
Two dots ~78% [11] A. Amin , S. Fischer , A. Parkinson and R. Shin, “
Three dots ~63% Comparative Study of Skew Detection Algorithms”, Journal
Loop 100% of Electronics Imaging 5 (4), pp. 443-451, 1996
[12] I. Sekita , K. Toraichi, R. Mori, K. Yamamoto and H.
Yamada, “ Feature extraction of handprinted Japanese
Table 1. Tests’ results of feature extraction characters by spline function or relaxation matching”, Pattren
phase Recognition, Vol. 21, pp. 9-17, 1988
[13] X. L. Xie and M. Suk, “ On machine recognition of
Some Arabic fonts and other complex features of Arabic handprinted Chinese characters by feature relaxation “, Pattern
script have caused poor results for the recognition of 2 Recognition, Vol. 21, pp. 1-7, 1988
and 3 dots. In these fonts contour’s length and area for 2 [14] H. Matsumura, K. Aoki,T. Iwahara, H. Oohama, and K.
or 3 dots are different from most of them ,so finding Kogura, “ Desktop optical handwritten character reader “,
Sanyo tech. Rev. Vol. 18, pp. 3-12, 1986
correct values for thresholds would be difficult and in [15] N. Otsu, “ A threshold selection method from gray level
some cases impossible. Another reason for rather poor histogram “, IEEE Trans. Syst. Man Cyb., SMC-9(1), pp. 62-
recognition of dots is the attaching dots to main bodies in 66, 1979
some fonts. In these cases no contours would be traced [16] A. Amin and M. Kavianifar, “ Automatic recognition of
for dots, therefore they wouldn’t be recognized as dots at printed Arabic text using neural network classifier “, Proc. of
all. The system has an excellent performance for mono 9th Int. Conference in Image Analysis and Processing, Vol. II,
font in feature extraction phase because the problem of pp. 616 – 623, 1997
attached dots together and/or to the main body doesn’t [17 ]http://www.sakhrsoft.com, http://www.shonut.com
occur. Also the behavior of the system’s algorithms are
predictable on noisy documents. The system recognizes
all the existent loops of the words.

Vous aimerez peut-être aussi