Vous êtes sur la page 1sur 36

PRESENTED BY:

RASHMI RANJAN BEHERA


REGISTRATION NO
0501212312
ITER, BHUBANESWAR
(Sept 2008)
TALK FLOW
 Pixels

 What and Why of OCR

 Steps in OCR

 Conclusion
PIXELS
 Pixel – (Picture Elements) or pels (Picture Elements), an
image sample area that is almost always square.
 All pixels are identical in size and arrangement.
 All pixels are processed the same way.
 All pixels are scanned, displayed, and printed in the
same way.
 Each pixel has a location and a colour.
 Both given as numbers.
 Location: By Coordinates.
 Color: Amount of Red ,Green and Blue.
 Max on all 3 is white, minimum on all 3 is black.
WHAT AND WHY OF OCR…
 Optical Character Recognition (OCR) is the process of
translating scanned images of typewritten text into
machine-editable information.

 Problem Statement : Given a digitized image, how


should an algorithm analyze its content, recognize the
identity of any character contained in the image and
return this information?
BLOCK DIAGRAM OF OCR SYSTEM

(Courtesy : ISI Kolkata)


• Input image is scanned
from books or paper
documents.
• This image acts as the
input to OCR system.
Normalise the image for :

 Brightness
 Contrast
 Distance
 Type of Scanner
(Courtesy : Scanned from G K Today)
BINARISATION
 Converts a Gray level (8bit)
TIFF (Tagged Image File
Format) image to Binary
Image.
 Histogram based global
threshold approach.
 One bit black , Other bit
white.
 Helps in segregating
background from text.
SKEW CORRECTION
 Determine the degree of skewness.
 Use HEADLINE or page edges for correction by
rotating the image.
BACKGROUND NOISE

Non-Textual
Textual Noise
Noise
• Extraneous symbols • Black Borders.
from the neighboring • Speckles.
page.
• Hand Written
Material.
TEXTUAL NOISE

(PHOTO CREDIT : TIMES OF INDIA)


TEXT FROM NEIGHBOURING PAGE

(Courtesy : Scanned from G K Today)


NON-TEXTUAL NOISE
STRAY MARKS AND
BLACK BORDER SPECKLES

(Courtesy : Pratt , Y . Kanetkar)


PRESENCE OF PICTURES

(Scanned from G K Today)


NOISE REMOVAL
 Page frame : Rectangular region enclosing all the
foreground pixels in the document image.
 Document cleaning :
 Filtering
 Layout Analysis
 Parameters :
 Size
 Aspect Ratio
 Limitations : Fail if characters from adjacent page are
present.
 Assumption : The page borders are very close to
edges and borders are separated from page
contents by a white space.
 Relies upon :
 Classification of blank/textual/non-textual rows
and columns.
 Object segmentation.
 Basic Idea : Dissect the image of a line of text into
locations between characters, i.e., character breaks.
 Character Break Depends On :
 Fonts
 Type sizes
 Printing quality
 Binarization
 Imaging artifacts
 Bottom – Up

 Top – Down

 Mixed
BOTTOM UP APPROACH
BOTTOM UP APPROACH :
Segmentation starts with
individual letters on a
page, then based on text-
layout conventions, group
letters into words, words
into paragraphs, and so on.
Line-art and half-tones are
often detected by their
size, or their non-text
layout.
TOP DOWN APPROACH
 Method 1
 Top-down approaches
take advantage of the fact
that formatted documents
usually have margins
surrounding each region.
The page can be
subdivided into different
regions by examining the
white-space in the
document.
 Method 2
 Top-down method also use the bit-density or texture
of the document to identify and classify regions.
OVER SEGMENTATION
 Dot matrix printing or insufficient inks
 Characters tend to be fragmental

UNDER SEGMENTATION
 Ink smudging
 Small fonts
 Signatures
SEGMENTATION
PROBLEMS

Character Character
Merging Fragmentation

Over
Characters Under threshold Thin strokes.
like threshold binarization. Dot matrix or
“ry”, “%l”, binarization,
ink jet
“m” appear Ink Poor quality
smudging printing
connected printing.
OVERCOMING SEGMENTATION PROBLEMS
Separation by Valley Separation by
of Vertical Projection Connected White Path

 Searches for vertical white  Very often in italic fonts,


space between characters. characters overlap vertically
but do not touch. These
 Projection of character
characters can be segmented
pixels along the vertical by finding a connected path
direction and the detection of white pixels separating the
of valleys. characters from top bottom.
This technique is equivalent
to segmenting characters
according to connected
components.
Fixed-Pitch Segmentation Cut And Test
 Nearly all typewriter fonts,  This technique dissects the
dot matrix and daisy wheel character image at several
printer fonts, and a large candidate locations and
number of laser printer fonts evaluates the result of the
are fixed pitch. Since all segmented pieces. The
characters in these fonts are candidate locations are
the same width, if the pitch determined by considering
can be determined, the factors such as the average
characters can be easily break point distances.
separated by breaks at
regular intervals.
GLOBAL GEOMETRIC
 Global feature is obtained by
 Geometric features extract
mapping the pixels of a various shape primitives
character image into a feature from a character.
vector according to some  Allow compact font and size-
transformation function. independent representation
 Transformation function is of characters.
designed to bring out some  Can be derived from the
particular shape characteristic contour or the skeleton of
that is useful for the character, or from the
classification. character image itself.
 Projections of the character  Lines and curves from
image along the horizontal, contours, horizontal and
vertical strokes from
vertical, and diagonal axes. skeletons, and cavities and
 Fourier transform of the holes from the original
character contour, and character image.
moments of character pixels.
CONCLUSION
 OCR is useful in converting type written text into
machine editable format which can be further
processed as per requirements.
 Digitization of Libraries.
 Digitization of old manuscripts.
 Helpful for visually challenged .
 Fan, Lay et al, Marginal Noise Removal of document images,
No 11, November 2002,Page no 2593-2611.
 Shafait, F., Keysers, D., Breuel, T.M.: Pixel-accurate
representation and evaluation of page segmentation in
document images. In: 18th Int. Conf. on Pattern Recognition,
Hong Kong, China (Aug. 2006) 872–875.
 S. Mori, K. Yamamoto, and M. Yasuda, 1984. “Research on
Machine Recognition of Hand printed Characters,” IEEE
Trans. on PAMI. Vol. 6, No. 4, pp. 386-405.
 A. Jain, B. Yu, “Document Representation and Its Application
to Page Decomposition”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 20, No. 3,March 1998.
 HUANG, TS.: „Coding of two-tone images‟, IEEE Trunx,
November 1977, COM-25, pp. 1406-1424.