SahanaOCR Research Paper

Extending the Sahana OCR project for automating the reading process of hand written forms.
Thilanka Kaushalya Department of Computer Science & Engineering University of Moratuwa Sri Lanka 070235L@uom.lk
Abstract
Sahana OCR project is carried out as a submodule under the main Sahana project. The Sahana project collects its disaster victims' information via the Internet as well as from the hand filled forms. When they collected data by using hand written forms it is difficult and time consuming to enter those data back to the Sahana system manually. Sahana OCR is designed to automate the data abstracting process from the forms. This paper discuss how the OCR module works and what extensions I did in this module as my level 3 project. 1. INTRODUCTION The Sahana Free and Open Source Disaster Management System was conceived during the 2004 Sri Lanka tsunami. It is a web based collaboration tool that addresses the common coordination problems during a disaster from finding missing people, managing aid, managing volunteers, tracking camps effectively between Government groups, the civil society and the victims themselves.The system was developed to help manage the disaster and was deployed by the Sri Lankan government's Center of National Operations , which included the Center of Humanitarian Agencies . A second round of funding was provided by the Swedish International Development Agency . The project has now grown to become globally recognized, with deployments in many other disasters such as the Asian Quake in Pakistan -2005, Southern Leyte Mudslide Disaster in Philippines -2006 and the Jogjarkata Earthquake in Indonesia -2006. After the Tsunami, the system was rebuilt from scratch in Apache, MySQL, and PHP/Perl. The system is available for free for anyone to download and customize it. The Sahana projects objective was to give solutions to many problems involved with a disaster situation and its involvements. But the in my point of view the most useful as well as the important feature in this project is that it provides the protectiveness to victim data and reduce the opportunity for data abuse. There are many sub projects carried through this main Sahan project in-order to support the main project and increase the efficiency of the main progect. SahanaOCR
project is a sub project of Sahana disaster Management System and helps to automate the collecting information process of the victims. In this project the hand filled forms are scanned and those images are segment into small images which contains separate image of a letter. Then those images are sent to a neural network which trained to identify characters. Then those information are send in a form of XML file to a database. During the years this project has developed by many developers as volunteers GSoc contributors etc. So I did an extension on this Sahan OCR project, which will help to enhance the performance of the module.
2. literature review
Sahana Architecture The Sahana project aims to provide an integrated set of pluggable, Web-based disaster management applications that provide solutions to large-scale humanitarian problems in the relief phase of a disaster. Subsequent phases are planned to extend the scope to the prevention, rehabilitation and reconstruction phases. The Sahana phase2 project has 7 main modules that address common disaster coordination and collaboration problems. They are: Missing Person Registry-The Missing Person Registry is an online bulletin board of missing and found people. It captures information about the people missing and found, and also the information of the person seeking them. Organization Registry.-The Organization Registry is a collaborative Who is doing what, where tool which enables tracking of the relief organizations and other stakeholders working in the disaster region. It captures information about the places where each organisation is active and the range of services being provided. A.
Request/Aid Management System-The Sahana Request/Aid Management System is a central online repository where all relief organizations, relief workers, government agents and camps can effectively match requests of aid and supplies to pledges of support. It tracks aid provision from request to fulfillment. Shelter Registry-This sub-application of Sahana keeps track of the location and basic data of all the shelters in the region. It also provides a geospatial view to plot the location of the camps in the affected area. Inventory Management andCatalogue-Keeps track of inventories at a high enough granularity to account for the chaotic transfer of goods and aid. Situation Awareness-This module gives an overview of the situation and allows people to add information on what is happening on the ground. It features the ability to plot a note and a photo with additional information on a Map, so that people can collaboratively capture the current disaster situation. Volunteer coordination-The Volunteer Coordination System helps NGOs keep track of all their volunteers, their contact information, project allocation, availability and skills to help them allocate them effectively especially in a disaster.
OCR- The most critical component in the system. This was done using FANN framework which is used for efficient neural network implementation. ScannerMngr - This is wrapper module for Twain API. This library provides functions to enquiry available scanners, connect to required scanner, configure and view scanner properties. Then application can continuously scan if the scanner supports auto feed otherwise one by one. NtMngr - This component deal with the network. Main functionalities are downloading correct XForm XML from server and Uploading result to the server. For a given path, it loads the XML that contains the details about the server and available XForms, in order to user can select proper XForm from available list. The required XML is read form disk and also this module has to handle Proxy servers, if there is any FormProcessor - This is the class that coordinates all the above components. FormProcessor segments image according to XForm using Imageprocessor library, then pass those segmented images to the OCR and create Result XML with result of OCR. Finally result XML will upload to the server through NtMnger module. The following diagram shows an the flow of the operations taken place in the Sahana OCR module.
B.
Sahana OCR Architecture
In all these modules, specially in missing persons registry most of the data are collected using the forms delivered to the victims after a disaster. When they collected data by using hand written forms it is difficult and time consuming to enter those data back to the Sahana system manually. Sahana OCR is designed to automate the data abstracting process from the forms. This was done on the .net framework using Visual C++. The sahana OCR module also developed as several components and integrated them to get the required output. The basic architecture of the Sahana OCR is as follows. These are the major components in the Sahana OCR project. XformParser - This is the part that handle XForm XMLs, and the component is consist of following classes 1. XForm 2. XFormParser 3. DataField 4. TextArea 5. XFormDomErrorHandler This module load xml file into the type of xform data structure in memory. Later XForm structure is used by Form processor to segment the scanned image. Module was implemented using Xerces open source XML library. ImageProcessor - All image processing methods was enclosed in to this component. Main objective is to segment the scanned image into letters. Currently all necessary image processing need to segment and page aligning has implemented using OpenCV open source Vision library. In this when application starts, user has to select correct scanner from the list of available scanners or a folder that contains scanned images from disk. If needed, user can change or see the configurations of the scanner. Then, correct XForm XML has to be selected. When user enter correct server path it will list available XForms. Then he can select proper one or he can select XForm from Disk. After selecting scanner and XForm, user can start processing forms one by one or continuously, if canner supports auto paper feeding. User can set direct uploading or buffer processed forms until he evaluate manually and correct any mistakes made by OCR.
C.
Project
In my project I did an extension mainly in the FormProcessor, Neural-Network and the GUI of the OCR Module. As mentioned above the Neural-Network was developed using the FAAN libraries. But since the lack of the training, the letter recognition of the neural network was very weak. So instead of using the same library I have used the Tesseract OCR engine which was an freely available software. It is probably one of the most accurate open source OCR engines engines available and initially developed in The core developer on the project is Ray Smith. After that the Tesseract projects was funded by the Google. Then I design a GUI to input the necessary inputs and to show the outputs to the users. Finally I changed the FormProcessor as it supports for above mentioned changes.
Mainly this system contains three functional classes and these are working on three layers. The form processor is in the layer which distributes the tasks to the functional classes and the Tesseract and Image processor classes are in the layer which does the operations handed over by the upper level. GUI class is on the top most level and it does all the interactions with the User. Constrains That I faced Since the existing system follows this architecture my extension also had follow the same architecture. The Tesseract and the Image processor extensions had to be able to communicate with the existing Form processor. Im going to use the Tesseract library to develop the OCR system. The performance of the Tesseract will directly influence to the main system performance.
3. Design and implementation

Architectural Goals and Constraints The architectural of the Sahana OCR system can be simply shown as the following model. Logical View
Initially XForm is loaded into memory. Then, image from scanner or disk is loaded into memory. FormProcessor preprocess, aligned the image and segmented using ImageProcessor library. Segmented letters are passed to Tesseract and with the result FormProcessor create the output. The sub system that my project involves can be shown as below. FormProcessor - Set XForm and load image before process the image. Tesseract provides a simple interface to perfomr optical character recognition ImageProcessor - All image processing methods was enclosed in to this component. Main objective is to aligned the image and segment the scanned image into letters.
GUI
Form Processor
To make success these requirements and to organized these design this is the development process carried out during the project development period.
Image Processor
Tesseract
First week Tested Tesseract to recognize some handwritten letters and it gave correct results most of the time. Used openCV libraries ( IplImage* ) to convert the jpg images to uncompressed tiff format and used it with the Tesseract which is only supports to that format. Second week Initially the existing sahana source cord gave 123 errors at the building process. Finally Able to build the existing Sahana OCR source cord by including the missing libraries as xerces and OpenCV. Tested it for the given data sheets. Segmentation of the existing system is quite good but the recognition module is so bad. So I tried to recognize some segmented images using tesseract separately. The results were 100% correct. Third week Since now I have used the Tesseract exe which was downloaded by the official site. When it runs it makes a text file in each iteration and I have to read the data by that text file. That consumes a lot of time. But this week I was able to build the source code of Tesseract so now Ill be able to do changes in that coda in order to make it compatible with my purpose. Tested it for the given data sheets. Segmentation of the existing system is quite good but the recognition module is so bad. I have made a small testing tool similar as the Sahana OCR module using Tesseract and it works better than the existing code. Fourth week Developed the GUI separately, and tried to connect it with the existing code. Fifth week Combine the whole project and build the code. Runtime errors have occurred since trying to read the images in the memory. Fixing those errors.
4 . Results
The final project has successfully finished after a great effort and the final product gave the results for a given form image as follows.
Add paths window
Preview window
The progress window
The result window
5 . Conclusion
The project was successful in the contest of the improving the accuracy of the OCR module. The initial accuracy was bellow 30% but now it has improved about 80%. Further we can improve it further by training the Tessaract for more hand written data sets. And also since Tessaract has trained for more than 6 languages we can use this module easily when we are changing the system for locale option. REFERENCES [1] Sahana official website http://www.sahana.lk/ [2] Sahana wiki http://en.wikipedia.org/wiki/Sahana_FOSS_Disaster_Ma nagement_System [3] Sahana Free and Open Source Disaster Management System Project Overview December 20, 2006 DRAFT 0.9 Author: Chamindra de Silva, Sahana Project Lead [4] Architectural document of the project [5] weekly reports of the project

SahanaOCR Research Paper

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

SahanaOCR Research Paper

Transféré par

Droits d'auteur :

Formats disponibles

Extending the Sahana OCR project for automating the reading process of hand written forms.

Sahana OCR Architecture

3. Design and implementation

Add paths window

The progress window

The result window

Vous aimerez peut-être aussi