Vous êtes sur la page 1sur 28

1.

Introduction

Optical Character Recognition is an important and practical technology in the computer age. More people than ever before are using personal computers, laptop, tablets, and e-readers to read documents and books. This means that old print media must be scanned and converted to a digital format in order to be accessed from these devices. Optical Character Recognition (OCR) programs are used to read scanned images and convert them into a digital character-based format. This project provides a general survey and basic implementation of Optical Character Recognition.First,the history and current state of OCR technology is examined. Then, there is an overview of the methods employed by OCR programs and the classification.Optical Character Recognition is an important and practical technology in the computer algorithms therein. Finally, there is a basic MATLAB implementation of an OCR program that will take text in an image and convert it to plain text.

2. History
Since late 1920s, there have been attempts by many engineers to develop OCR systems; however, it was not until the 1950s that the first commercial OCR system became available. This was because the technology was not needed in many places, and it was too expensive to implement OCR due to its immature algorithms during the era. David H.Shepard, however, invented the technology and developed the first machine to convert printed texts into machine language for computer processing, and was issued U.S. Patent number 2,663,758, called Gismo. Based on the technology, he also found Intelligent Machines Research Corporation (IMR), and became the first one who sold the first commercial OCR machines. The practicality of the IMR scanner was the different way of scanning from what Gismo had. While Gismo could scan when a printed text is reasonably close and vertically fit, IMR scanners were able to scan any characters in the scanned field.

3. Applications
Although the market for OCR isn't large, many developers are taking advantage of basic principles. Currently, the largest use of OCR among consumers is from Adobe Acrobat

Figure no. 1: An Industrial OCR station manufactured by Vision Group Inc.

and Microsoft OneNote. This allows for easy character recognition and conversion to digital form. The software incorporates simple drop down menus, and scans the digital format quickly. This includes business cards, books, white boards, and can even export receipts to spreadsheets. The software even goes one step further and does text to speech, for visually impaired users who are using the OCR as a second set of eyes.

3.1 CAPTCHA
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), uses a form of anti-OCR. Its main purpose is to tell if an input is from a user, or a script. It prompts for a randomized code to the re-input into a text field. The code is distorted or disguised to fool OCR into generating false positives, and ultimately limiting access. This is no limit spam from scripts, and other computer generated access to a site.

Figure no. 2: A typical CAPTCHA

A service owned by Google, Inc., known as reCaptcha, uses humans to help digitize old document in very little time. These features are useful when scanning textbooks, as search functions becomes available. One may also export novels to text for easier storage after scanning. Another application of OCR technology is at the post office. Addresses and zip codes are often handwritten on the envelope. Optical Character recognition allows the post office to automatically read the address of a piece of mail and sort it to the appropriate bin for delivery books. This software takes words not easily recognized by OCR software and pairs it with words that it knows the answer to. Many OCR systems throw flags for words that are ambiguous or for which there is no answer for. If such a flag is thrown it will save the word into a database and uses it in conjunction for generated known words in a Captcha field. If the answer for the correct word is input into the text field, the program will assume the other word is correct. This process is repeated until the certainty for the unknown word increases. This is an innovative method and also increases OCR algorithm recognition by comparing questionable positives with human input. Currently, Captcha is displayed over 100 million times a day, with the most popular being from Facebook, Twitter, TicketMaster, and online forums.

Though fairly secure, DEF-CON hacking conference detailed a method to reverse distort the image, and produce a valid response 10% of the time. Recently, this algorithm has been updated with a 38% response rate. This uses a fairly complex OCR method to read the Captchas, and isn't publically available. Further methods by Google, such as altering distortion patterns and lockout after a certain number of negatives may also frustrate potential abusers.

4. Procedure
There are a number of simple and heuristic algorithms for optical character recognition. Most algorithms, however, include two main areas of processing: Training and Recognition. The training portion of OCR involves teaching the program how to recognize letter based on shape, spelling, etc. The recognition portion of OCR uses the data obtained from the teaching to classify each character and output the best guess answer.

5. Program Training
Training an OCR program is a process that creates a library of characters that the program will later use to compare input data against known characters. To determine a character, the program searches through its library to determine which character are the closest match to the input data. The library is created by providing the program with a set of data in which the characters are known. The more pictures for one character, the more accurate the program will become. Training the program also allows an OCR program to recognize fonts and handwriting that are uncommon or oddly shaped.

Figure no.3: The two portions of Optical Character Recognition

6. Recognition
There are main steps employed in order to determine a string of characters from an input image: 1)Pre-Processing: Background noise and unimportant elements in the image are removed and only the characters remain. Images are often converted to grayscale. Individual characters are also segmented into their own blocks for further processing. 2)Feature Extraction: Important or unique features of a character are determined. 3)Classification: The character is classified. This can be based on a number of algorithmic approaches. In its most basic form, extracted features are simply compared to the library. In more advanced OCR programs, elements such as spelling, grammar, and previous history also aid in the recognition process.

For example, if the program has lexicon as its input, and it successfully determines all letters but the i in the center, it can search its built-in dictionary for words that include the letters lex and con for the best match in the middle letter.

7. Algorithms
Many pattern recognition algorithms have been used for determining the correct character in an OCR program. There are three main categories of algorithms used (Liu, Cai, &Buse, 2003). The first is statistical pattern recognition, which selects the character from the library with the highest probability of being the character in the image. The second technique is structural pattern recognition, which uses 2-dimensional structures extracted from the image and attempt to match them to structures in the library. The third technique used is soft computing for pattern recognition, which uses fuzzy logic as well as a combination of the other two algorithms.

8. Technologies to be used

Software:
Operating System: Microsoft Windows (Matlab compatible) Software Required: a) MATLAB 7.0

9. Implementation
The implementation of a simple Optical Character Recognition program will be performed with MATLAB. Because of the built in image-reading tools and native array types in MATLAB, it is a natural choice for a simple OCR program. The implementation will take place in three steps: 1) Create the library of characters. A simple and universal font should be chosen to maximize recognition of characters. 2) Create the pre-processing routines to convert to grayscale, remove noise, segment etc. 3) Create classification algorithm. The program will use a simple comparison algorithm to compare the characters. No grammar, spelling, or language information will be used to classify the characters. The testing of the implementation will not be overly complicated. Bitmap or JPG images will be created (using MS Paint or a similar program) that contain text in a uniform font. Handwritten but easily legible words can also be used to test the extremes of the program.

10.PROCESS MODELCHOSEN/METHODOLOY
WATERFALL MODEL

11. Modules
OCR systems consist of five major modules:

11.1

Pre-processing:

The raw data is subjected to a number of preliminary processing steps to make it usable in the descriptive stages of character analysis. Pre-processing aims to produce data that are easy for the OCR systems to operate accurately. The

main objectives of pre-processing are: 11.1.1 Binarization 11.1.2 Noise reduction 11.1.3 Stroke width normalization 11.1.4 Skew correction 11.1.5 Slant removal

11.2

Segmentation:

Text Line Detection (Hough Transform, projections, smearing) Word Extraction (Vertical projections, connected component analysis) Word Extraction 2 (RLSA)

11.2.1

Explicit Segmentation:

In explicit approaches one tries to identify the smallest possible word segments (primitive segments) that may be smaller than letters, but surely cannot be segmented further. Later in the recognition process these primitive segments are assembled into letters based on input from the character recognizer. The advantage of the first strategy is that it is robust and quite straightforward, but is not very flexible.

11.2.2 Implicit Segmentation:


In implicit approaches the words are recognized entirely without segmenting them into letters. This is most effective and viable only when the set of possible words is small and known in advance, such as the recognition of bank

checks and postal address.

11.3 Feature Extraction:


In feature extraction stage each character is represented as a feature vector, which becomes its identity. The major goal of feature extraction is to extract a set of features, which maximizes the recognition rate with the least amount of elements. Due to the nature of handwriting with its high degree of variability and Imprecision obtaining these features is a difficult task. Feature extraction methods are based on 3 types of features: Statistical Structural Global transformations and moments

11.4 Classification:
k-Nearest Neighbor (k-NN) , Bayes Classifier, Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM), etc. There is no such thing as the best classifier. The use of classifier depends on many factors, such as available training set, number of free parameters etc.

11.5 Post-processing:
Goal: the incorporation of context and shape information in all the stages of OCR systems is necessary for meaningful improvements in recognition rates. The simplest way of incorporating the context information is the utilization of a dictionary for correcting the minor mistakes. In addition to the use of a dictionary, a well-developed lexicon and a set of orthographic rules (lexicon-driven matching approaches) during or after the recognition stage for verification and improvement purpose. Drawback: Unrecoverable OCR decisions.

12. Module wise Description:


1-Binarization and Noise Cleaning. 2-Skew Detection and Correction. 3-Text Orientation Detection. 4-Image Cropping. 5-Page Segmentation. 12.1 Binarization and Noise Cleaning
Description:
The input image is binarized or noise is removed from the input page, so that it is fit for segmentation and recognition. There are three methods available for that: Adaptive Binarization. Sauvola Morphological Noise Cleaning Actors The users involved are either a general user/client or system administrator. Precondition The input image should be grayscale and not colored. User Interface The input grayscale image is selected and any one of the three binarization routine is chosen.

After the process has been completed GUI prompts for saving the result. Use case diagram for Binarization and Noise Cleaning:

Figure No.5 Use case diagram for Binarization and Noise Cleaning

12.2 Skew Detection and Correction


Description The module is used to remove skew of +15 degrees or -15 degrees. Skew gets introduced during scanning. This hampers the recognition accuracy. Actors General User and system administrator. Precondition Image should be grayscale. The document page should have pictures. Scenario Text While scanning a document page, the image can get skewed by a small amount. User Interface The skewed image is selected and skew detection and correction process is selected. The output image that is the deskewed image is displayed and GUI prompts for saving.

Figure no.6 Use case diagram for skew detection and correction

12.3 Text Orientation Detection


Description This module checks if the image is in portrait or landscape format. Actors General User or System administrator Precondition Image should be binarized. Scenario Text The input document image is converted from landscape to potrait format.

User Interface The user selects the input image and selects to change the orientation and save the output image.

Figure no. 7 Use Case diagram for Text Orientation Correction

12.4 Image Cropping


Description This module is used for cropping a selected area of text. Actors General user and System administrator. Precondition NIL. Scenario Text

If the user is interested in cropping a selected area from the input document page. User Interface The user chooses an input image, selects the cropping area by dragging the move over the selection and then finally cropping it. The gui prompts for saving the cropped area.

Figure no. 8 Use Case Diagram for Cropping

12.5 Page Segmentation


Description This module is used for segmenting document pages into component blocks. These blocks are labeled either as text, graphics or picture. There are two routines available for segmentation: Profile Based Segmentation Texture Based Segmentation Actors

General user and System administrator Precondition For profile based segmentation the input image should be binarized. Scenario Text For running recognition engine, we need to extract text regions only. User Interface GUI allows us to choose an input image and selects any one of the segmentation routines.

Figure no. 9 Use Case Diagram for Page Segmentation

Block Diagram

Figure no. 10

Figure no. 11

Figure no. 12

Figure no.13 Use case diagram for segmentation

Figure no. 14

0-Level DFD

Figure no.15

Level 1 DFD

Figure no.16

Sequence Diagram

13. Testing
The testing phase of the implementation is simple and straightforward. Since the program is coded into modular parts the same routines vectors in the training phase can be reused in the testing phase as well. Algorithm: 1. Load image file.
2. Analyze image for character lines for each character line detect consecutive

character symbols.

3. Analyze and process symbol image to map into an input vector. 4. Feed input vector to OCR code in Matlab and compute output. 5. Convert the binary output to the corresponding character and render to a

box.

Figure no.17 Flowchart

15. Current State and Future Work


Preliminary research on the subject has been conducted, and a short survey was compiled. More research will be required in order to support the final report. Specific algorithms used in OCR are being researched in order to further expand the Procedure section. Diagrams that aid in the explanation of these algorithms will need to be located through research or created by the group. The implementation of the MATLAB program is still in the research stages. Specific tools and functions that are built into MATLAB that can ease the implementation process are being researched. The first functions created will be the training of the program. Once the training is fully implemented, then classification functions will be written to verify the training. References are being compiled, but need to be added to the references section. In-text citations also need to be added where appropriate.

17. References
Liu, Z.-Q., Cai, J., & Buse, R. (March, 2003). Handwriting Recognition.Germany: Springer. Optical Character Recognition (OCR) what you need to know. (n.d.). Retrieved May 2, 2011, from Phoenix Software International: www.phoenixsoftware.com/pdf/ocrdataentry.pdf Learning on the fly: a font-free approach toward multilingual OCR (24 march 2011) Andrew Kae David A. Smith Erik Learned-Miller Improving State-of-the-Art OCR through High-Precision Document-Specific Modeling (August, 2010) Andrew Kae Gary Huang Carl Doersch Erik Learned-Miller

Vous aimerez peut-être aussi