Vous êtes sur la page 1sur 7

218

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 2, FEBRUARY 1995

High Accuracy Optical Character Recognition Using Neural Networks with Centroid Dithering
Hadar I. Avi-Itzhak, Thanh A. Diep, and Harry Garland

Abstract--Optical character recognition (OCR) refers to a process whereby printed documents are transformed into ASCII files for the purpose of compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. The recognition stage of an OCR process is made difficult by added noise, image distortion, and the various character typefaces, sizes, and fonts that a document may have. In this study a neural network approach is introduced to perform high accuracy recognition on multi-size and multi-font characters; a novel centroid-ditheringtraining process with a low noise-sensitivity normalization procedure is used to achieve high accuracy results. The study consists of two parts. The first part focuses on single size and single font characters, and a two-layered neural network is trained to recognize the full set of 94 ASCII character images in 12-pt Courier font. The second part trades accuracy for additional font and size capability, and a larger two-layeredneural network is trained to recognize the full set of 94 ASCII character images for all point sizes from S to 32 and for 12 commonly used fonts. The performance of these two networks is evaluated based on a database of more than one million character images from the testing data set.

CII character images in 12-pt Courier font3. The second part trades accuracy with additional font and size capability, and a larger twolayered neural network is trained to recognize the full set of 94 ASCII character images for all point sizes from 8 to 32 and for 12 commonly used fonts. The recognition process works as follows: First, the twodimensional pixel array of the input character is preprocessed, normalized and decomposed into a vector (see Section 11. D.). Second, the vector is processed by the neural network to yield an output of 94 numbers (see Section 11. E.). Third, the neuron in the output layer with the highest value is declared the winner, identifying the input character image (see Section 11. E.). Fourth, a simple postprocessing algorithm is used to detect invalid characters and to discriminate between characters whose images become indistinguishable during preprocessing. These include single quotes and commas of certain fonts and the case information of some characters (see Section 11. (3.). The latter part of postprocessing applies to the multi-size and multifont case only.
11. NEURAL NETWORK IMPLEMENTATION

A. Network Architecture
The neural network used to recognize single-size and single-font character images has 3000 inputs, 20 neurons in the first layer, and 94 neurons in the second or output layer. The neural network used to recognize multi-size and multi-font character images has 2500 inputs, 100 neurons in the first layer, and 94 neurons in the output layer. Both networks are fully connected and feedforward with the nonlinearity in each neuron generated by the following sigmoid function:

Index Terms-Pattern neural networks.

recognition, optical character recognition,

I. INTRODUCTION
In todays world of information, countless forms, reports, contracts, and letters are generated each day; hence, the need to archive, retrieve, update, replicate, and distribute printed documents has become increasingly important[ l ,3]. An available technology that automates these tasks on computer media is optical character recognition (OCR); printed documents are transformed into ASCII files, which enable compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. An essential requirement for OCR lies in the development of an accurate recognition algorithm by which digitized images are analyzed and classified into the corresponding characters. A comprehensive and recent benchmark for OCR performance with proprietary commercial algorithms was reported by Jenkins et a1.[4], and the best recognition accuracy that has been achieved on data of three point sizes (10, 12, and 14) and three different fonts (Courier, Helvetica, and Times-Roman) ranges from 99.21% to 99,95%. Published literature report error rates in the order of one percent for single-font recognition and higher error rates for multi-fonts [5], [ 6 ] .While an error rate in the order of one percent may appear impressive, it would generate 30 errors on an average page containing 3000 characters. Such error rates limit the usefulness in many applications[l], [2], [8] and illustrate the need for a more accurate recognition algorithm. In this paper, we introduce a feedforward neural network scheme to recognize multi-size and multi-font character images with extreme accuracy. This high accuracy character recognition is achieved by using a centroid dithering training process and a low noise sensitivity normalization procedure. The study consists of two parts. The first part focuses on single size and single font characters, and a twolayered neural network is trained to recognize the full set of 94 ASManuscript received Jan. 21, 1993; Revised Feb. 21, 1994. H.I. Avi-Itzhak and T.A. Diep are with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305.. H. Garland is with Canon Research Center America, Palo Alto, CA 94304 IEEECS Log Number P95053.

Theoretically, the multi-font neural network could be implemented by a bank of single-font neural networks. Each neural network would be dedicated to recognizing one font, and character recognition would be based on the highest output, The obvious advantage of this approach is that it is straightforward. The disadvantages lie in the computational costs that are involved in training and testing these networks. Perhaps more importantly, the networks cannot benefit from associating the correct characters and disassociating the wrong characters of other fonts, since each network would employ training data from only one particular font. The lack of interdependency between the networks creates redundancy and increases the complexity.

B. Training Algorithm
The training algorithm used in this study is known as backpropagation[9]. It is a steepest-descent algorithm for finding a set of weights which minimizes the squared-error cost function:

3Courier font is important because it is the font most often used in legal documents. The technique used in the development of a neural network for Courier font is general and can be applied to any other single font.

0162-8828/95$04.000 1995 IEEE

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 2, FEBRUARY 1995

219

Thrtrholding
t

Centering
t

bo
Y 40

m
I#

10

12 Point Courier

(a) Single-Size and Single-Font

Preprocessing

'hresholding

m.

kling

a
bo
40

----

bo.

U'
0 ) .

A
I

m
. 10

IO

m m

bo

m m w

10 U Y bo

m m m

. I

io m m U

Y bo

m m w

16 Point Timer

(b) Multi-Size and Multi-Font Preprocessing (b) Multi-Size and Multi-Font Preprocessing

Fig. 1. Preprocessing.

where di and yi designate the desired and actual outputs of the ith cedure serves several purposes; it reduces noise sensitivity and makes neuron in the last layer in response to the pthinput pattern. N is the the system invariant to point size, image contrast, and position displacement. i number of neurons in the output layer. The desired output for the " The reduction of background noise sensitivity is achieved by output neuron under the pth input pattern is a discrete Delta function: thresholding. The input image is filtered by zeroing those pixels whose values are less then 20% of the peak pixel value, while the re(3) maining pixels are unchanged. The threshold setting is heuristic and .=r': ' 0 : i# =p p has been empirically shown to work well for white paper. The threshold setting should be adjusted accordingly when a different paper product is used, e.g. newspaper. C. Database Following thresholding, the resultant image is centered by positionThe data consisted of pre-segmented (though not all perfect) charing the centroid of the image to the center of a fixed size frame. The acter images, scanned at 400 d.p.i. and in 8-bit gray scale resolution. centroid (y, y) of an image is defined as follows: The first neural network deals with single-size and single-font character recognition, and the training and testing data is of 12 point Courier font. The training data is comprised of 94 digitized character im(4) ages with a one-to-one correspondence between each training data and each member of the 12 point Courier font character set. The neural network is thoroughly evaluated with testing data, comprised of 1,072,452 character images from a library of English short stories. The second neural network deals with multi-size and multi-font character recognition. The allowable point size ranges from 8 to 32, and the fonts include Arial, AvantGarde, Courier, Helvetica, Lucida Bright, Lucida Fax, Lucida Sans, Lucida Sans Typewriter, New Century Schoolbook, Palatino, Times, and Times New Roman. The For the 12 point Courier font case, a frame size of 50-by-60 pixels training data is comprised of 1,128 (or 94 x 12) character images; has been found to be adequate in enclosing all character images. For each member of the complete character set for each font appears ex- the multi-size and multi-font case, a frame size of SO-by-SO pixels is actly once in the training set. It is important to note that all training employed, and an additional scaling process must be applied to the character images are of 16 point size, even though the network is images. The scaling entails initially computing the radial moment M,: trained to perform recognition on multi-size characters. An explanation of this property is provided in the next section. The testing data consists of 347,7 12 characters or 28,976 characters for each font and has an even mixture of 8, 12, 16, and 32 point sizes.

D. Data Preprocessing and Normalization


Before it is fed to a neural network, the digitized image is preprocessed and normalized. The preprocessing and normalization proNext, the image is enlarged or reduced with a gain factor of lO/M,, producing images of constant radial moments. The value of this ra-

220

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 2, FEBRUARY 1995

(log scale)

1
10'

lo2 -

loa lo4

IO-'
lod

Fig. 2 . Learning curve of the single-size and single-font neural network.

dial moment is linked to the selected frame size of 50-by-50. From a broader perspective, this scaling process is equivalent to a point-size normalization procedure and enables a neural network to treat all character images the same way regardless of the point size. An illustration of the thresholding, scaling, and centering operations is provided in Fig. 1. The next step in the preprocessing and normalization procedure is to convert the two-dimensional images into vectors. The conversion is achieved by concatenating the rows of the two-dimensional pixel array. It follows that the vector for the single-size and single-font case has 3,000 elements and that the vector for the multi-size and multi-font case has 2,500 elements. Additionally, each vector is normalized to unit power:

tion. This is particularly useful when bold face characters are encountered.

E. Training
The training procedure is executed using an initial learning-rate parameter of p = 10 for both the single-size and single-font neural network and the multi-size and multi-font neural network. The learning progress is monitored by computing the mean squared error (m.s.e.) for each output neuron: m.s.e. =
C

no. of no. of no. of [shift positions] * [training patterns] * [ fonts

1' [output neurons]

no. of

(8)

The normalization reduces sensitivity to varying scanner gains (image-to-background contrast) as well as different toner darkness (shades of ink). This unit-norm vector is then fed into the neural network. During training, there is an additional step performed on the input data: centroid dithering. The centroid dithering process applies to both the single-size and single-font case as well as the multi-size and multi-font case. The process involves dithering the centroid of the two-dimensional input image. After centering and scaling, the input image is displaced randomly and independently in both the horizontal and vertical directions over the range of [-2, +2] pixels in each dimension; the image is shifted at random in one of twenty-five possible displacement positions. The resultant image is then converted into a vector, normalized, and fed into the network as previously described. Centroid dithering effectively creates many "different" images from a single image. The neural network is exposed to the same character at different displacement positions, making the recognition system invariant to input displacements. It is important to emphasize that the dithering is performed exclusively during training and not during testing. The approach also enables the network to tolerate width variations in character strokes which might be caused by different printer setting, toner levels, and variations in font implementa-

where C is the cost function as defined in equation (2). With small random initial weights, each output neuron generates an approximate error of + O S , and the resultant mean squared error is 0.25. The m.s.e. values as a function of training iterations are plotted and shown in Fig. 2 for the single-size and single-font neural network. The single-size and single-font neural network is trained with 430,000 iterations, and the final m.s.e. is approximately The multi-size and multi-font neural network is trained with 8,650,000 iterations, and the final m.s.e. is approximately 2.10-'.

E Testing
The testing process entails holding the weights of the neural network constant and evaluating the network with testing data. Each testing pattern is fed to the network, and the squared error for the output layer is computed. This process is repeated for all of the testing patterns, and the cumulative squared error provides a measure for the neural network's performance. More discussion may be found in [91, [lo], [ I l l . Fig. 3 graphically summarizes two types of error indicators for the single-size and single-font neural network: (a) the worst-case squared error over all 94 output neurons and all 25 offset positions for each input character and (b) the worst-case squared error over all 94 output neurons for each input character without any offsets. Fig. 4 pertains to the multi-size and multi-font neural network, and the worstcase squared errors are taken over all fonts with a fixed point size.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 2, FEBRUARY 1995

221

x lo"
1.6

1.2

j 1
m"
l.' 0.8

i
t
zero
displ.ctment only
, ..
........

Worst case over d l displacement positions

i k o displacement only

Fig. 3. Worst case squared errors for the single-size and single-font neural network.

Wont case over d 1 displacement positions

0.030 Oms

m"
E

3 f

0 . m
0.015 0.010

'3

..................

A*..'

1 .

Fig. 4. Worst case squared errors for the multi-size and multi-font neural network.

It is evident from the figures that the weights obtained from the training process have achieved extremely close approximations of the desired mappings.

G. Postprocessing
An important task of postprocessing pertains to the detection of invalid character inputs. More specifically, the detection is accomplished by observing the occurrence of small responses on all output neurons. This is an intrinsic property of a trained neural network and is very useful in discounting bad images which might result from segmentation errors or other defects. This information may be used

as an error flag which provides feedback to the segmentation algorithm. The second function of postprocessing involves recovering lost information from scaling and centering multi-size and multi-font character images and is used for the multi-size and multi-font system only. The characters c, C, k, K, 0,0, p, P, s, S, v, V, w, W, x, X, z, and Z of certain fonts lose their case information after scaling and are therefore recognized by the neural network without an affirmative upper/lower case identification. This case information, however, can easily be reconstructed by a context-based approach. The technique resorts to examining the radial moments of the original character im-

222

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 17, NO. 2, FEBRUARY 1995

ages prior to scaling and is best explained by an example. Without loss of generality, it is assumed that the neural network identifies an image as the character c. The first step is to deduce the point size of this c by computing the gain 10 / M,,where M, is the radial moment of a neighboring character that is case distinguishable. The next step is to calculate the radial moments of a fabricated upper case C and a fabricated lower case c of this point size. The case information is then obtained by comparing the radial moment of the input character c with those of the fabricated ones. Commas and single quotes of certain fonts also become indistinguishable after centering and scaling. The discrimination between these two characters is made by comparing the centroid location of the input character image before preprocessing to the height of the line. Finally, the numeral zero cannot be reliably distinguished from the letter 0in some fonts, and similarly the numeral l , lower case L, upper case I, and vertical bar I are ambiguous for some fonts. Under these circumstances the characters are left as they are without any postprocessing.

PERFORMANCE 111. RECOGNITION

A. Results
In order to determine the recognition performance, a computer program is used to compare the output of the recognition system with the ASCII files which were used to generate the testing data. The computer program examines each and every output character, and all discrepancies excluding spaces, tabs, and carriage returns are recorded. The discrepancies are individually examined and classified as either an erroneous or a correct recognition.

There are two situations where discrepancies are classified as correct recognition. The first case involves image corruption which renders the invalid images unrecognizable. Fig. 5 provides examples of invalid inputs due to segmentation error, scanning error, and paper residue. As explained in Section 11. G., the neural network automatically generates small responses on all the output neurons to indicate bad inputs. Such occurrence is detected during postprocessing, and the discrepancy is not counted as a recognition error. The second case involves characters that are indistinguishable. These include the numeral O, letter O, lower case L,upper case I, numeral l , and the vertical bar I) in some fonts. The ambiguity arises from the fact that one character from one font looks identically to a different character of another font. Henceforth, the neural network may output anyone of the ambiguous characters when the input is ambiguous, and it is not counted as an error. Any other discrepancies which do not fall into one of these two categories are counted as recognition errors. The character-ambiguity problem does not apply to the single-size and single-font experiment, since all characters of the Courier font are distinguishable. All discrepancies except those of corrupted images are treated as errors. The neural network is required to recognize all 94 characters including the difficult distinction between the lower case L (1)and the numeral one (1). The single-size and singlefont neural network is tested with 1,072,452 characters of 12 point Courier font, and a perfect recognition accuracy has been achieved. This recognition performance exceeds any previously known results by at least an order of magnitude. Using The multi-size and multi-font neural network was tested with 347,712 characters or 28,976 characters for each font with an even mixture of 8, 12, 16, and 32 point sizes (see Section 11. part C). the performance criteria as previously described, the multi-size and multi-font neural network has achieved a perfect recognition accuracy. The same network was also tested with the data used for the single-size and single-font neural network. There was one recognition error among the 1,072,452 testing characters of 12 point Courier font. The error is documented in Fig. 6 .

INPUT
II,I1

OUTPUT

Recognized

As qn

B. Analysis
The question arises: if n independent trials of an experiment have resulted in success, what is the probability that the next trial will result in success? In this context, we employ Laplaces Special Rule of Succession[7] which yields an estimate of the probability of success
p=Fig. 5. Neural Responses to Invalid Images as a Result of (a) Segmentation Error, (b) Scanning Error, and (c) Paper Residue.

n+l
n+2

For the single-size and single-font neural network, we obtain

IEEE TRANSACTIONSON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.

17, NO. 2,

FEBRUARY 1995

223

p=--

1,072,453 - 99.99991% , 1,072,454

TABLE I1 CHEBYCHEV UPPER BOUND FOR THE PROBABILITY OF ERROR FOR THE MULTI-SIZE AND MULTI-FONT NEURAL NETWORK

and for the multi-size and multi-font neural network,


p=

347,7 13 = 99.99971% . 347,714

Chebychev

In order to avoid bias towards any one particular font, the n used in computing the latter probability of error is based on the
multi-size and multi-font database and does not include the million Courier characters. Alternately, we introduce the following statistical analysis in order to quantify a lower bound for the recognition accuracy on future testing data. Given a testing image corresponding to the pthcharacter 941, we define two random variables: wherep E [ 1,2,...,

e t a
<O

38,784 25,920 124,528 126,112

3.9. 1.1 10.99 10.99 19.2.

The Chebychev bound is known in general to be a conservative bound, and the upper bounds on the probabilities of error in Tables I and I1 are much higher than those estimated by the Laplace Rule of Succession. IV. CONCLUSIONS

The study presents a neural network scheme with centroid dithering and a low noise-sensitivity normalization procedure for high accuracy optical character recognition. The single-size and single-font neural network has been successfully trained to recognize 12 point As in equation (2), yj, represents the output of the pthneuron of Courier font characters. The neural network was trained with a datathe output layer. While random variable A,, is always linked to the pth base of 94 character images. The neural network was tested on a daneuron, B,, assumes the value of the highest output among the re- tabase of 1,072,452 character images and achieved perfect recognimaining neurons, whichever it may be. The correct recognition of a tion. Even the lower case L (1) was successfully distinguished character requires that A,, > B . The conditional probability of error from the numeral one (1) in every instance. Based on the experigiven an input image of the ptR character is derived below. ence of this network, a larger neural network was successfully designed and trained to recognize characters of 12 commonly used Prob(error I p ) = Prob(A, < B, I p ) (1 1) fonts and point sizes from 8 to 32. The latter neural network was trained with 1,128 character images, and it achieved perfect recognition on a testing database of 347,7 12 multi-size and multi-font characters. To gauge the tradeoff between the two networks, the multisize and multi-font neural network was tested on the 1,072,452 Cou= Var(A, - B,,) (14) rier character database, and one error was incurred. In addition to their high recognition accuracy, the neural networks Inequality (1 3) invokes the Chebychev inequality, and Inequality are computationally efficient. The input layer performs 20 (100) dot(14) approximates A,, as 1 and B, as 0. These assumptions are veri- product operations on the full input image for the single-size and fied by the sample averages obtained from testing data. single-font (multi-size and multi-font) neural network, rather than 94 To illustrate the concept, we apply the Chebychev bound to the (1 128) such operations inherent to most template-matching schemes. four most frequent characters in our data: e, t, a, and 0. The second layer is computationally negligible. In this sense, the These letters are all in lower case except the letter 0. Since it is an network is effectively performing recognition based on only 20 (100) ambiguous character in the multi-font case, the samples could also be features for the single-size and singlq-font (multi-size and multi-font) an upper case or zero. The following tables summarize the sample neural network, and the training process automatically determines averages and variances computed during the test run and the resultant these features. upper bound on the probability of error.
ACKNOWLEDGMENT

Character

Samples

E(A, - B,,)

Var(A,, - B,,)
(

The first two authors would like to thank the United States Air Force and the Fannie and John Hertz Foundation for their fellowship funding. The valuable suggestions and comments of the referees are very much appreciated.
REFERENCES
[ 1]

e t a
0

Chebychev upper bound)

1 13,060

80,273 72,565 70.423

0.99 0.99 0.99 0.98

1.7.10.~ 5.8.10.~ 2.9.10.~ 5.6.10-~

G. R. Cote and B. Smith, Profiles in Document Managing, Byte, pp. 198-212, Sept. 1992 [2] John C. Dvorak, Some Wishful Thinking about Computing in the Coming Year, PC-Computing, vol. 4, no. 1 1 , p. 62, Nov. 1991. [3] L. Grunin, OCR Software Moves into the Mainstream,PC Magazine, pp. 299-350, Oct. 30, 1990. [4] F. Jenkins, J. Kania, and Thomas A. Nartker, Using Ideal Images to Establish a Baseline of OCR Performance,in Symposium on Ducument Analysis and hformation Retrieval, (Las Vega, Nevada), 1993.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 11, NO. 2, FEBRUARY 1995

S. Kahan, T. Pavlidis, and H. S. Baird, On the Recognition of Printed Characters of Any Font and Size, IEEE-PAMI, vol. 9, no. 2, pp. 274288, 1987.

S. Mori, C. Y. Suen, and K. Yamamoto, Historical Review of OCR Research and Development, lEEE Proceedings, vol. 80, no. 7, pp. 1029-58, July 1992. E. Parzen. Modern Probability Theory and Its Applications. New York: John Wiley & Sons, Inc., 1960. S.V. Rice, J. Kanai,and T. A. Nartker, A Report on the Accuracy of OCR Devices, in Symposium on Document Analysis and Information Retrieval, (Las Vega, Nevada), 1992. D. E. Rumelhart, G.E. Hinton, and R.J. Williams , Learning Representation by Error Backpropagation, In Parallel Distributed Processing, vol. 1, MIT Press, Cambridge, MA, 1986, Chap. 8, pp. 318-362. B. Widrow and M. Lehr, 30 Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation, Proceedings o f the IEEE, vol. 78, no. 9, pp. 1415-1442, Sept. 1990. B. Widrow and R. G. Winter, Neural Nets for Adaptive Filtering and Adaptive Pattem Recognition, IEEE Compurer, vol. 21, no. 3, pp. 2539, March 1988.

Erratum
Correction to Analysis Of Camera Movement Errors In Vision-Based Vehicle Tracking
In the January 1995 issue of this Transactions, in the abovementioned paper authored by W. Sohn and N.D. Kehtamavaz, the equation following number 15 was incorrectly reproduced. The correct equation is
6,=oy = f tan E,,

Also, in the second last sentence of the paper, the phrase absolute value of the relative height should read depression angle.

Vous aimerez peut-être aussi