Vous êtes sur la page 1sur 34

Ref: 2022/Summer Internship II2/ Defended at the October 2022 session

University of Manouba
National School of Computer Science

Internship report of company immersion

Complicated Table Extraction Into CSV File Using


Paddle OCR Text Detection And Recognition

Author:
Melek SAHLIA

Organism : Axefinance
Supervised by : Abdelmajid AYADI
Adress : 3éme étage, Immeuble La Couverture, Rue du Lac Huron, Tunis 1053
TEL : +216 28 98 90 43 FAX : +216 71 96 32 29
eMail : info@axefinance.com

University year:
2022-2023
Supervisor Signature
Abstract — This report was prepared as part of the second-year summer internship that
took place under the supervision of the Axe Finance company.
The project involves developing a system for extracting data from the balance sheets into a
CSV file. This is in order to facilitate the manual process of data entry that has proven much
time-consuming and very error-prone.
The system will be able to make a good detection of the table presented by the balance sheet,
good detection, and recognition of the texts as well as a clear strategy of reconstruction of the
table through mathematical logic, and finally conversion to the desired format.
In order to achieve our goal, we have come to an understanding of some of the basics and
solutions developed in this report.
Key Words: Python, OCR, CRNN, Paddle OCR, Computer Vision,..

Résumé — Le présent rapport a été élaboré dans le cadre d’un stage d’été de la deuxième
année qui s’est fait sous l’encadrement de l’entreprise Axe Finance.
Le projet consiste à développer un système pour l’extraction des données se trouvant dans
les bilans financiers vers un fichier CSV. Ceci est dans le but de faciliter le processus manuel
d’entrée des données qui a prouvé plein de temps consommé et d’erreurs commises.
Le système sera capable de faire une bonne détection du tableau présenté par le bilan financier,
une détection et une bonne reconnaissance des textes ainsi qu’une claire stratégie de recon-
struction du tableau à travers une logique mathématique et enfin aboutir à une conversion
vers le format souhaité.
Afin d’atteindre notre but, nous aboutissons à la compréhension de certaines notions de base
et certaines solutions élaborées dans ce rapport.

Mots Clés: Python, OCR, CRNN, Paddle OCR, Vision par ordinateur,..

‘jÊÓ
áK
ñºK áÓ  Ï @ áÓ
éJ
KA JË@ éÊgQÖ éJ
KA JË@ éJ‚Ë@
éK
AîE áÓ ù®J
“ I.K
PYK áÓ ZQm .» QK
Q® JË@ @ Yë PAm ' @ Õç'

.
    
ð , éJ
ºJJ.Ë@ QK
PA®JË@ ú
¯ èXñk. ñÖÏ @ HAKAJ
J.Ë@ h. @QjJƒB ÐA¢ QK
ñ¢ð éƒYJë úÍ@ ¨ðQå„ÖÏ @ ¬YîE
. á 
ƒYJêÖ
    Ï@

ƒ . A¢jÊË
àñºJ é“Q«ð 
CK
ñ£ AJ¯ð
        
†QªJ‚ AîE @ IJ.K @ ú
æË@ð AK
ðYK
HA
J.Ë@ ÈAgX@
 KAJ éJ ÊÔ« ÉJîD„JË @ Yë

AîD
Ê« ¬Qª JË@ð ñ’JË@ á« ­ ‚ºË@  ð , úÍAÖÏ @ àAJ
J.Ë@ ú¯ ÐY®Ö Ï @ ÈðYj.ÊË YJ
k. ­ ‚»  Z@Qk. @ úΫ @PXA  ¯ ÐA¢ JË@

éK
AîDË@ ú
¯ ø
XñK
ø
YË@ ð úæ•AK
P ‡¢  JÓ ÈC g áÓ ÈðYm.Ì '@ ZAJK . èXA«B ém • @ð éJ
j.
K@Qƒ@ úÍ@ é¯A “B
AK. @ YJ
k.




‘ªK. ð éJ
ƒAƒ B@ Õæ
ëA®ÖÏ @ ‘ªK. Ñê¯ úÍ@ AJʓñK , AJ¯Yë ‡J
®m' Ég. @ áÓ  
. H. ñÊ¢ÖÏ @ ­ÊÖÏ @ úÍ@ ÉK
ñjJË@ úÍ@
ú¯ ém • ñÖÏ @ ÈñÊmÌ '@
.QK
Q® JË@ @ Yë

ii
Acknowledgments

At the end of this internship, I would like to express my deepest and most sincere thanks to
all those who have kindly offered the necessary assistance to make this period a very enriching
experience. May they find here collectively and individually the expression of all my gratitude.

I want to express more precisely my deep gratitude to my dear supervisor at Axefinance,


Mr.AYADI Abdelmajid, Mr.AMMOURI Bilel and Mr.AYAT Wissem for the patience and
availability they have generously given me and for the time they have devoted to answer my
questions. I would also like to express my deep gratitude to my family for their moral and
financial support.

I cannot conclude this section without expressing my gratitude to the members of the Jury who
took the trouble to evaluate this work showing attention and patience.

My thanks also go to the staff of the National School Of Computer Sciences (ENSI) and Axefi-
nance who gave me the opportunity to enrich my theoretical and practical knowledge.
I hope that this project will be an expression of our deep esteem and sincere gratitude.

iii
Contents

General Introduction 1
1 Preliminary Study 2
1.1 Academic Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Presentation of the host company . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Context of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Axe Credit Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Analysis of existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Problems statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.5 Expected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Table Structure Recognition Challenges . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Study of the existing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1.1 Global-object-based methods . . . . . . . . . . . . . . . . . . . . . 5
1.4.1.2 Local-object-based methods . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2.1 Overview of PaddleOCR Features . . . . . . . . . . . . . . . . . . 6
1.4.2.2 PP-OCR Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Computer Vision: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Object Detection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.3 Instance Segmentation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.4 OCR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Specification and analysis of needs 9


2.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Identifying actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Software and hardware requirements . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Software attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4.1 Quality requirements . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4.2 Performance requirements . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Requirements Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Diagram of General use cases (ACP) . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Statechart diagram: Table Structure Recognition System . . . . . . . . . . 11
3 Design 12
3.1 Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Detailed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Table Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Summer Internship report II2 iv


CONTENTS

3.2.1.1 Deep Learning Modules for Layout Detection (Layout Parser


Table detection) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1.2 Chosen Model for table detection . . . . . . . . . . . . . . . . . . 14
3.2.2 PP-OCR Text Detection and Recognition . . . . . . . . . . . . . . . . . . . 14
3.2.2.1 Text Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2.2 Detection Frame Calibration: . . . . . . . . . . . . . . . . . . . . . 15
3.2.2.3 Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Achievement 19
4.1 Work Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Implementation and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1.2 Table Detection and Extraction . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Text Detection and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Get Horizontal and Vertical Lignes . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Non-Max Suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 CSV Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Added Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Work Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Conclusion And Perspective 27
Bibliography 27

Summer Internship report II2 v


List of Figures

1.1 Axefinance’s LOGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Analysis of existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Complicated Table Structure Recognition Example . . . . . . . . . . . . . . . . . . 5
1.4 Overview of PaddleOCR Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Table Structure Recognition Using PP-OCR Flow Chart . . . . . . . . . . . . . . . 7
1.6 The R-CNN Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 The ACP Product Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 10


2.2 Table Structure Recognition System Statechart . . . . . . . . . . . . . . . . . . . . 11

3.1 Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


3.2 Deep Learning Modules for Layout Detection . . . . . . . . . . . . . . . . . . . . 13
3.3 PP-OCR’s architectural style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 The architecture of the text detector DB . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 CRNN’s architectural style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Feature sequence extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 The structure of a basic LSTM unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Work Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


4.2 Test Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Table Detection and Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Text Detection and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Get Horizontal and Vertical Lignes . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Horiz_Vert.jpg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Horizontal Boxes Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Chosen Horizontal Boxe (High probability) . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Nms_im.jpg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.10 Result.CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.11 Added Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.12 Work Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Summer Internship report II2 vi


General Introduction

W hile financial institutions have faced difficulties over the years for a multitude of reasons,
the major cause of serious banking problems continues to be directly related to lax credit stan-
dards for borrowers and counterparties, poor portfolio risk management, or a lack of attention
to changes in economic or other circumstances that can lead to a deterioration in the credit
standing of a bank’s counterparties.

To assist them, Axe Credit Portal (ACP) is a highly flexible end-to-end credit process automa-
tion solution for all aspects of the lending lifecycle, including application processing, the credit
assessment, automatic document generation, limits collateral management, covenant manage-
ment, portfolio management, loss recovery, and provisioning.

Analyzing the problem, we can notice that financial companies usually deal with overwhelm-
ing amounts of documents that need to be digitized and entered into a system.

Before, the process of document processing and entry was done manually, and no one can only
imagine how time-consuming and mundane it was. Not to mention the fact that manual doc-
ument processing is usually very error-prone as the possibility of a human error is high.

Computer vision technology can successfully replace human employees when it comes to doc-
ument processing. The technology can analyze the documents, digitize them, and enter them
into the system, thus enabling employees to perform the review only.

As part of the ACP product solution and knowing that financial documents usually contain
complicated table structures, my main work as an intern at Axefinance will focus on how to ex-
tract data from these tables into a CSV file and I choose to adopt Paddle OCR text detection
and recognition during this project.

Summer Internship report II2 1


Chapter 1

Preliminary Study

Introduction

T his chapter is intended to give a general overview of our project. It will discuss the different
contexts of this work and make it clear how relevant it is.
To begin with, we will expose the academic context. Then we’ll switch to the professional side
by presenting basic concepts.
Also, this chapter will include the problem statement, the ACP, and the goals that we aim to
achieve and will end by explaining the methodology adapted during this work.

1.1 Academic Context


This project was elaborated in the form of the second-year summer internship, which is one
of the assessments required at the National School Of Computer sciences (ENSI).
The internship took place between 04/07/2022 and 31/08/2022 as a part of Axefinance’s Product
Machine Learning team.

1.2 Presentation of the host company


Founded in 2004, Axe Finance is a global software provider focused on lending automation
for financial institutions looking for an edge in productivity and customer service for any and
all client segments:
— Retail Lending.
— Commercial Lending.
— Corporate Lending.
— Limit, collateral management, collection and provisioning.
Joining Axe Finance is joining an expanding FinTech company where the projects are varied
and the vision is global, Axe Finance is a fast-growing financial software provider of Lending
Automation solutions to a wide range of financial institutions and it is a trusted partner with
respected global financial institutions.

1.3 Context of the Project


In this section we will present ACP, the issues it faces, our solution’s expected results and
the methodology we adapted.

Summer Internship report II2 2


chapitre 1 : Preliminary Study

Figure 1.1 – Axefinance’s LOGO

1.3.1 Axe Credit Portal


Faced with increasingly sophisticated customers and more and more pressure from regula-
tors and competitors, but often weighed down by manual processes or too rigid legacy systems,
financial institutions can increase their efficiency, develop their market position and raise their
profitability by further automating the lending process and risk measurement techniques.
To assist them, Axe Credit Portal (ACP) is a highly flexible end-to-end credit process automa-
tion solution for all aspects of the lending lifecycle, including application processing, credit
assessment, automatic document generation, limits collateral management, covenant manage-
ment, portfolio management, loss recovery, and provisioning.

1.3.2 Analysis of existence


Following a credit application, no one can deny that the process of handling financial doc-
uments can take a long time in addition to the high rate of error when reading and extracting
the data contained in the financial balance sheet table.

Figure 1.2 – Analysis of existence

Summer Internship report II2 3


chapitre 1 : Preliminary Study

1.3.3 Problems statement


Before granting a loan to a company, the credit bank manager must ensure that it is able to
repay its loan. The financial balance sheet is the best way to verify this, as it makes it possible
to analyze the company’s financial situation in a solvency approach.
There are several issues with the lengthy application process( especially while putting the data
on the platform). We believe that the manual process is:
— So much time-consuming.
— Very error-prone as the possibility of a human error is high.

1.3.4 Objectives
In order to digitize the manual process and develop a solution for the data extraction con-
tained in the table of the financial balance sheet, we aim to follow the following procedure:
— Financial statements database collection.
— Table detection model selection.
— Table recognition model selection.
— Table reconstruction
— CSV file conversion.

1.3.5 Expected Results


In fact, there was a solution to perform the task. This one was using CascadeTabNet, but it
generates several errors, especially at the recognition scale of characters and signs of numbers.
Recognizing a positive number when it is actually negative can make catastrophic calculations
and decisions, especially at the level of the study of a credit risk.

We are waiting for a value added by the system to be designed in order to have an exact
and perfect recognition.

1.4 Table Structure Recognition Challenges


Due to the different structures and the cell-spanning complex relationships, table structure
recognition has become a major task. Previous methods handled the problem starting from
elements in different granularities (rows/columns, text regions), which somehow fell into the
issues like lossy heuristic rules or neglect of the empty cell division.

The figure 1.3 illustrates an intuitive example of a complicated table with spanning cells. The
presented table is shown in (a), and (b) is the desired structure of the dashed box area. The
recognized structures using existing methods are shown in (c) - (e). Note that the four cells, on
the right side in (c), are unfortunately recognized as a single cell.

1.4.1 Study of the existing


The existing table structure recognition methods can be classified into two categories:

Summer Internship report II2 4


chapitre 1 : Preliminary Study

Figure 1.3 – Complicated Table Structure Recognition Example

1.4.1.1 Global-object-based methods


Overview:
These methods focus on the characteristics of global table components. For the most part, the
detection started from row/column or grid boundaries. It will be interesting to enumerate the
early works related to this approach that have been done in different ways. Here are some
previous works and their explanation:
— First method: The grid cells are the result of the intersection between the rows and
columns regions which were obtained from detection or segmentation models.
— Second method: In an end-to-end manner, the table detection and recognition tasks are
handled by mask learning applied to the table region and the rows/columns of the table.
— Third method: We can fuse separate cells together after predicting an indicator. This
one can be obtained after detecting the rows and columns by learning the interval areas’
segmentation between rows and columns.
— Other methods: There are also a few methods that directly perceive entire image infor-
mation and output table structures as text sequences within an encoder-decoder frame.
Although these methods appear graceful and completely avoid the human beings in-
volved, the models are generally difficult to form and rely on a large amount of training
data.
Limits: Not able to handle various complicated table structures. Such as
— Cells-spanning multiple.
— Multiple rows/columns.
— Containing text in multi-lines.

1.4.1.2 Local-object-based methods


Overview:
These methods start from the smallest basic component: the cells.
Given the text box annotation at the cell level, the text detection task is relatively easy to com-
plete with general detection methods like Yolo, Faster RCNN, etc..

Summer Internship report II2 5


chapitre 1 : Preliminary Study

Subsequently, a group of methods may be applied and it will be interesting to list the various
methods related to this approach. Below are the various groups and their explanation:
— First method: Attempts to recover cellular relationships on the basis of some heuristic
rules and algorithms.
— Second method: Treats detected boxes as nodes in a graph and tries to predict relation-
ships based on techniques of Graph Neural Network (GNN).
— Third method: Predicts relationships among nodes in three classes (horizontal connec-
tion, vertical connection and no connection) using multiple features such as visual fea-
tures, text positions, word embedding, etc..
— Other methods: Adopts graphic attention mechanism to improve prediction precision.
Limits:
— Since no empty cells are detected, local-object-based methods usually fall into the ambi-
guity of empty cells.
— problem of large graph nodes numbers.

1.4.2 Proposed Framework


To satisfy the required task, the proposed solution will be built upon the PaddleOCR
(PP-OCR) text detection and recognition framework.

1.4.2.1 Overview of PaddleOCR Features


PaddleOCR supports a variety of cutting-edge algorithms related to OCR, develops in-
dustrial featured models/solutions PP-OCR and PP-Structure on this basis, and provides the
whole process of data production, model training, compression, inference and deployment.

Figure 1.4 – Overview of PaddleOCR Features

1.4.2.2 PP-OCR Flow Chart


The table recognition flow chart is as illustrated by the following image:

Summer Internship report II2 6


chapitre 1 : Preliminary Study

Figure 1.5 – Table Structure Recognition Using PP-OCR Flow Chart

1.5 Key Concepts


1.5.1 Computer Vision:
Computer vision is a field of artificial intelligence (AI) that enables computers and systems
to derive meaningful information from visual inputs and take action or make recommendations
based on that information. If AI enables computers to think, computer vision enables them to
see, observe and understand.

1.5.2 Object Detection:


Object detection is the process of finding and classifying objects in an image.
A deep learning approach, regions convolutional neural networks (R-CNN), combines propos-
als of rectangle regions with features of convolutional neural networks.
R-CNN is a two-stage detection algorithm:
— First stage: Identifies a subset of regions in an image that might contain an object.
— Second stage: Classifies the object in each region.
Computer Vision Toolbox offers object detectors for the R-CNN, Fast R-CNN and Faster R-
CNN algorithms.

1.5.3 Instance Segmentation:


Instance segmentation expands on object detection to provide pixel-level segmentation of
individual detected objects.
The computer Vision Toolbox provides layers that support a deep learning approach, for in-
stance segmentation called Mask R-CNN. The figure 1.6 represents the R-CNN family.

1.5.4 OCR:
OCR refers to "Optical character recognition." It is a technology that recognizes text within
a digital image. It is generally used to identify text in scanned documents and images.
The OCR software can be used to convert a physical paper document or picture into an elec-
tronic version accessible with text.

Summer Internship report II2 7


chapitre 1 : Preliminary Study

Figure 1.6 – The R-CNN Family

For instance, if you scan a paper document or photograph using a printer, the printer will
probably create a file containing a digital image.

1.6 Methodology
This project started by having an idea about the ACP product of the company Axefinance
and its objectives, the idea of extracting data from the financial statements was defined and
stated from the beginning.
Following the methodology adopted at Axefinance, it was recommended to be present during
the daily meetings as well as the preparation of a presentation summarizing the progress of
each week.
A lot of time was devoted to the research and then development part in order to better under-
stand the framework and the approach by which the project should be implemented.

Conclusion
In this preliminary study chapter, we started with the general context while continuing with
the exposition of the problem as well as the proposed solution.
Then, we ended up developing the key and useful concepts for the understanding of the model.
In the next chapter, we will focus on the specification and analysis of needs.

Summer Internship report II2 8


Chapter 2

Specification and analysis of needs

Introduction

A fter defining the general context of the project, we will start the second chapter, which is
making the analysis and specification of needs.
In order to orient our solution, we will start by targeting the principal actor and his different
needs, which we will be detailed in what follows. We will be presenting the Hardware and
Software components in addition to the functional requirements and the software attributes.
Finally, we will present a use case diagram for the ACP product, then a statechart diagram for
the table structure recognition task to elaborate on the different finite states of our system.

2.1 Requirements Analysis


2.1.1 Identifying actors
The only actor considered in our system is the Credit Bank Manager (CBM) who is the
principal actor.

2.1.2 Software and hardware requirements


There is no additional equipment involved apart from a device connected to the Internet.
Our table structure recognition system will be embedded in the Axe Credit Portal interface.
Therefore, the user does not have to install any third-party plugins, extensions, or software
other than the ACP platform and the system environment.

2.1.3 Functional requirements


The services offered by our system in relation to our single actor are:
— Detect table structure from an input Image.
— Detect and recognize texts in the detected table.
— Reconstruct table structure.
— Convert the reconstructed table into a CSV file.

2.1.4 Software attributes


The following quality and performance specifications shall be satisfied:

Summer Internship report II2 9


chapitre 2 : Specification and analysis of needs

2.1.4.1 Quality requirements


— Privacy: Private information will be protected in the system.
— Compatibility: The system will be ACPcompliant and will not affect its operation.
— Maintainability: Due to the responsive code refactoring approach, the source code will
be fully documented and clearly implemented.

2.1.4.2 Performance requirements


— Reliability: Without unexpected behavior, the program will fully cater to all functional
criteria..
— Scalability: The program could process large volumes of applications with ease.
— Efficiency: Particular precautions must be taken to ensure algorithmic efficacy in terms
of response time and use of resources..

2.2 Requirements Specifications


2.2.1 Diagram of General use cases (ACP)
In order to have a global view of the ACP product (Axe credit Portal), the general specifica-
tions are presented using the UML diagram of use cases.
The diagram sets out the actions to be carried out by the actor of the system. During this work,
we are going to focus on table structure recognition.

Figure 2.1 – The ACP Product Use Case Diagram

Summer Internship report II2 10


chapitre 2 : Specification and analysis of needs

2.2.2 Statechart diagram: Table Structure Recognition System


In order to better describe the transition states of our table structure recognition system,
the following diagram presents a behavioral model of the system during the data extraction
mission.

Figure 2.2 – Table Structure Recognition System Statechart

Conclusion
In this part of the report, we started specifying the main and developing functional and
non-functional needs. Then we went through diagrams describing the states related to the
development and organization of our system. In the next chapter, we will explain the design
we have used in our system.

Summer Internship report II2 11


Chapter 3

Design

Introduction

I n this chapter, we aim to explain the logic behind the realization of our system.
For this, we will first present the global architecture to be followed. Then we will give an
overview of the used models for table detection as well as for text detection and recognition.

3.1 Global Architecture


The primary purpose of the design is to enable the creation of a system or process that meets
a need, taking into account constraints. The system must be sufficiently defined to be installed,
manufactured, built and functional, and to meet customer needs.
After a thorough search, and as illustrated by the following figure, it was decided that the
overall architecture to be adopted must be completed when implementing the system.

Figure 3.1 – Global Architecture

Summer Internship report II2 12


chapitre 3 : Design

3.2 Detailed Architecture


3.2.1 Table Detection
The detection of tables is undoubtedly one of the most important features of any balance
analysis application, particularly when analyzing the data contained in the table cells. For the
Table detection task, we have chosen the LayoutParser toolkit. LayoutParser is intended to
provide a wide variety of tools to streamline document image analysis (DIA) tasks.
LayoutParser is delivered with a set of layout data structures with carefully crafted APIs
that are optimized for document image review missions. For example:
— OCR execution for every layout region detection.
— Flexible APIs to display detected layouts.
— Loading layout information stored in JSON, CSV, and even PDF.
Behind the scene is a series of carefully engineered features and they can be characterized
into four key components which are:
— Model Customization..
— Deep learning models for layout detection.
— Layout Parser Open Platform.
— Infrastructure APIs.
Let’s focus on probably the most important one which is deep learning models for layout
Detection:

3.2.1.1 Deep Learning Modules for Layout Detection (Layout Parser Table detection)
— Challenges: There are no standard ways for sharing and reusing existing models, there-
fore we built a series of standardized model APIs.
— Standardizing Modeling API: The first step to creating a model is to specify the model
configuration file and it is highly semantic. It is composed of the training dataset name
and the model architecture. The standardized modal API can be used to initialize differ-
ent models. We can specify the deep learning backends. Finally, we just need to run the
same detection API to identify layouts in any image.

Figure 3.2 – Deep Learning Modules for Layout Detection

Summer Internship report II2 13


chapitre 3 : Design

Model Zoo:Layout parser has pre-trained dataset on 6 different datasets with the following
model label_map, including:
— HJDataset:
1:"Page Frame", 2:"Row", 3:"Title Region", 4:"Text Region", 5:"Title", 6:"Subtitle", 7:"Other"
— PubLayNet:
0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"
— PrimaLayout:
{1:"TextRegion", 2:"ImageRegion", 3:"TableRegion", 4:"MathsRegion", 5:"SeparatorRegion",
6:"OtherRegion"
— NewspaperNavigator:
0: "Photograph", 1: "Illustration", 2: "Map", 3: "Comics/Cartoon", 4: "Editorial Cartoon",
5: "Headline", 6: "Advertisement"
— TableBank:
0: "Table"
— MFD:
1: "Equation"

3.2.1.2 Chosen Model for table detection


I used a PubLayNet model for table detection which is the mask_rcnn_X_101_32x8d_FPN_3x
model as is trained on the whole training set, and it has an eval result (mAP) of 88.98.

3.2.2 PP-OCR Text Detection and Recognition


PP-OCR architecture is based on lightweight neural networks that have been improved.
The proposed framework consists of three steps:
— Text border detection.
— Detection frame calibration.
— Text recognition.
All three modules use, lightweight backbones to speed things up. This allows the trained mod-
els to be used in embedded devices. The following figure illustrates PP-OCR’s architectural
style.

Figure 3.3 – PP-OCR’s architectural style

Summer Internship report II2 14


chapitre 3 : Design

3.2.2.1 Text Detection


Text detection focuses on locating the text box in the image. In general, it is hoped that
the detection outcome is divided into fields. PP-OCR builds upon a segmentation-based DB
text detection algorithm. The starting point for using this method is the segmentation-based
text detection method, which may obtain the text bounding box. Moreover, database post-
processing is relatively easy and convenient for practical applications.

Figure 3.4 – The architecture of the text detector DB

3.2.2.2 Detection Frame Calibration:


In order to improve the text recognition effect on the detected bounding box and maintain
the consistency of the text, it is generally desired that the text frame to be recognized is in a
positive horizontal direction. Given that the DB text detection result is represented by a 4-
point polygon ((X1, Y1) and (X2, Y2)), it is easy to transform the detection result in a horizontal
direction after a fine transformation. If the transformed image is vertically oriented, it becomes
horizontal after a 90-degree rotation. But after switching to a horizontal orientation the text can
be upside down. Thus a text direction classifier is necessary to determine if the text is inverted.
If the text is upside down, it may be recognizable when corrected. Training a text direction
classifier is a basic image classification task.

3.2.2.3 Text Recognition


The purpose of text recognition is to convert text row images into text. PP-OCR uses CRNN,
which is widely used for text recognition.
CRNN Architecture:
CRNN is a mixture of networks of convolutional and recurrent networks. This is where the
name Convolutional Recurrent Neural Network (CRNN) comes in. This network is made up
of three layers, CNNs followed by RNNs, and then the transcription layer. CRNN uses CTC
or Connectionist Temporal Classification Loss which is responsible for aligning the expected
sequences. The following figure illustrates CRNN’s architectural style:

Summer Internship report II2 15


chapitre 3 : Design

Figure 3.5 – CRNN’s architectural style

— Feature Extraction:
The convolutional neural network (CNN), which has convolutional and max-pooling lay-
ers, is the top layer. These are in charge of sifting through the input images to extract
features, and the outputs are featured maps. Feature maps are initially converted into a
sequence of feature vectors, which are then used to feed output to the following layer. On
the feature maps, each feature vector in a feature sequence is generated from left to right.
As a consequence, all of the maps’ i-th columns are combined to produce the i-th feature
vector.
Each column of the feature maps corresponds to a rectangular area of the input image as
a result of the feature extraction; this area is referred to the receptive field. The receptive
field is connected to each feature vector in the feature sequence, and these features can be
thought of as appearance descriptors for that area. The following layer of RNNs gets the
feature sequence at this point.
— Sequence Labeling:
This layer, which is built on top of the Convolutional Neural Network, is the Recurrent
Neural Network (RNN). In order to tackle the vanishing gradient problem and build
a deeper network, two bi-directional LSTMs are used in the CRNN architecture. Each
feature vector or frame in the feature sequence received from the CNN layers will have
the label forecasted by the recurrent layers. The layers potentially predict the label y for
each frame x in the feature sequence x = x1,....., xt.

Summer Internship report II2 16


chapitre 3 : Design

— Transcription:
This layer is in charge of creating the final sequence based on the per-frame predictions
with the highest probability. The model can learn to decode the output by calculating
CTC (Connectionist Temporal Classification loss) using these predictions.
CTC Loss:
The RNN layer output is a tensor that includes the probability of each label for each receptive
field. But how does this influence the final product? The Connectionist Temporal Classification
(CTC) loss enters the picture at that point. The inference that is decoding the output tensor
as well as training the network both are carried out through CTC loss. The CTC focuses its
activities on the following central principles:

Figure 3.6 – Feature sequence extraction

— Text encoding:
CTC solves the when a character takes more than one time step. CTC solves this by
merging all the repeating characters into one. And, when that word ends it inserts a
blank character “-”. This goes on for further characters. For example in the previous
figure, ‘S’ in ‘STATE’ has three-time steps. The network might predict those time steps as
‘SSS’. Now, the CTC will merge those outputs and predict the output as ‘S’. For the word,
a possible encoding could be SSS-TT-A-TT-EEE, hence the output ‘STATE’.
— Loss Calculation:
For a model to learn, loss needs to be calculated and back-propagated into the network.
Here the loss is calculated by adding up all the scores of possible alignments at each time
step, that sum is the probability of the output sequence. Finally, the loss is calculated by
taking a negative logarithm of the probability, which is then used for back-propagation
into the network.

Figure 3.7 – The structure of a basic LSTM unit

Summer Internship report II2 17


chapitre 3 : Design

— Decoding:
At the time of inference, We need a clear and accurate output. For this, CTC calculates
the best possible sequence from the output tensor or matrix by taking the characters with
the highest probability per time step. Then it involves decoding which is the removal of
blanks “-” and repeated characters. So, by these processes, we have the final output.

Conclusion
In this chapter, we have presented the architectural and detailed design, which is a funda-
mental phase in which we specify the skeleton of the system that will be realized in the next
chapter.

Summer Internship report II2 18


Chapter 4

Achievement

Introduction

I In this chapter, we describe the implementation phase, which is the culmination of the previ-
ous stages of the preliminary study of the design while presenting the working environment,
the implementation process, and the tools used to successfully complete a functional solution.

4.1 Work Environment

Figure 4.1 – Work Environment

4.2 Implementation and results


In this section, we will review the software and hardware environments, as well as the
technologies, frameworks, and libraries we have used. The figure (name) gives an overview of
the work environment used.
— Financial statements database collection:
In order to test the performance of the using models, the internal database of financial
statements was collected from the Axefinance company and it has about 256 images.
— Frameworks and Libraries Installation:
In order to prepare a general good working environment for the execution of the code,
we started by installing the frameworks and libraries that were mentioned before.

Summer Internship report II2 19


chapitre 4 : Achievement

4.2.1 Layout
4.2.1.1 Installation
We imported the PaddleOCR library from the Paddle Paddle project available on GitHub
and which contains all the libraries we need. In addition, we have also imported the pre-trained
model LayoutParser which will be used for table detection. Import our test image to the envi-
ronment:

Figure 4.2 – Test Image

4.2.1.2 Table Detection and Extraction


In order to detect the presence of the table in the entered image, we will be using a pre-
trained model which is the LayoutParser (LP), more specifically “lp://PubLayNet/mask_rcnn
_X_101_32x8d_FPN_3xconfig”. The detected table is followed by extraction from the image
through a bounding box whose limits are described by the coordinates (x1,y1) and (x2,y2) as
explained in the figure below. We will name the extracted image “ext_im.jpg”.

Summer Internship report II2 20


chapitre 4 : Achievement

Figure 4.3 – Table Detection and Extraction

4.2.2 Text Detection and Recognition


— By applying the OCR text detection and recognition model to “ext_im.jpg”, we can make
text detection by specifying each one of the 4 sides of the bounding box surrounding it as
well as the text and the probability that it corresponds to this type.
— In order to organize the result obtained previously, we will store the coordinates ((x1,y1)
and (x2,y2)) to specify the bounding boxes of each detected text, the text for each bound-
ing box in addition to the probability that it satisfies this type each in named variables, in
other words, we will have:
— We can now generate an image called “detections.jpg” to ensure that the detection and
reconnection of texts are completely successful.

Figure 4.4 – Text Detection and Recognition

Summer Internship report II2 21


chapitre 4 : Achievement

4.3 Reconstruction
4.3.1 Get Horizontal and Vertical Lignes
For the reconstruction of the table, we will follow a two-pronged approach: For any hor-
izontal box, presented by (x1,y1) and (x2,y2) we will expand the rectangle according to the
width of the detected image. In another way:
xh1,yh1= 0,y1
xh2,yh2= x2+width, yh1+(y2-y1)

For any vertical box, presented by (x1,y1) and (x2,y2) we will expand the rectangle accord-
ing to the height of the detected image. In another way:
xv1,yv1= x1,0
xv2,yv2= xv1+(x2-x1), yv1+height

The following figure shows the principle:

Figure 4.5 – Get Horizontal and Vertical Lignes

The output of our system is an image named “Horizvert.jpg” where the horizontal lines
are drawn in red and the vertical lines are draped in green.

Figure 4.6 – Horiz_Vert.jpg

Summer Internship report II2 22


chapitre 4 : Achievement

4.3.2 Non-Max Suppression


As shown by the following image, there’s interference between the boxes we got. It is
recommended to have a single box surrounding all texts at the same line level. As a result, the
following two questions must be asked:

Figure 4.7 – Horizontal Boxes Interference

— How can we verify that these bounding boxes interfere?


Answer:
By use of By use of IOU ( Intersection over Union) score.
If IOU (box1, box2) >0.1 then boxes interfere else boxes do not interfere
— Which aligned box to choose from among these?
Answer:We choose the box with the highest probability (Non max suppression Algo-
rithm)

Figure 4.8 – Chosen Horizontal Boxe (High probability)

After applying the max suppression algorithm we will obtain “im_nms.jpg”.

Figure 4.9 – Nms_im.jpg

Summer Internship report II2 23


chapitre 4 : Achievement

4.4 CSV Conversion


The Obtained CSV file is shown as follows:

Figure 4.10 – Result.CSV

4.5 Added Value


Before, Axe Finance tried to do this task of table structure recognition using CascadeTab-
Net. The latter has shown catastrophic results, especially in the recognition of data. In fact, as
shown in the following figure, CascadeTabNet failed to detect the negative signs in the bal-
ance sheet which is unfortunate, with partial recognition of numbers in addition to so many
errors.

Such recognition errors are not allowed by use of PP-OCR.

Figure 4.11 – Added Value

Summer Internship report II2 24


chapitre 4 : Achievement

4.6 Work Timeline


Le projet s’est déroulé pendant une durée de deux mois et il s’étend sur la période entre
le 4 juillet et le 31 Aout. La figure suivante permet d’illustrer le plan prévisionnel qui a été
mis, représentant les étapes principales permettant d’aboutir à une solution pour le système
demandé.

Figure 4.12 – Work Timeline

Conclusion
In this last chapter, we have opted to highlight the hardware and software environment of
work. The results of our system and the approach that we followed. Mention was made of the
value added to the ACP product and the chronogram of the major parts of the development of
our solution and their planning.

Summer Internship report II2 25


Conclusion And Perspectives

T he purpose of this report is to synthesize and implement the work done during the com-
pany immersion course at Axe Finance.
During this internship, we focused on the problem of recognition and the extraction of data
located in the financial statements.
The purpose of the proposed solution is to build a system for extracting the data contained in
the financial statements in order to reduce errors and the consumed time noticed during the
manual process.

To do this, we started by setting up the general context of the project. Indeed, after pre-
senting the basic problem, we tried to develop a solution while defining the actors as well as
functional and non-functional needs.

Then we explained the design part where we explained the overall architecture of the
project as well as the detailed architecture of the different models used. In a second place,
we presented the approach and implementation of the solution step by step and we ended by
presenting the added value to the ACP product.

Implementing such a solution was not easy to achieve. Through this project, we discovered
the field of computer vision through the detection of objects and the recognition of characters.
To conclude, the solution we have proposed and developed meets expectations.
However, it will be possible to improve the results by Fine-tuning the pre-trained models with
the company’s internal balance sheet database, it’s a processor making little improvements in
order to attain the required output or performance. This latest improvement will make our
results more consistent.

It should be mentioned that the references presented below have greatly helped to develop
this report and the approach taken during the development of the project.

[3] [1] [5] [2] [7] [6] [4]

Summer Internship report II2 26


Bibliography

[1] https://arxiv.org/pdf/2105.06224.pdf.
[2] https://github.com/PaddlePaddle/PaddleOCR1.
[3] https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/table/README.md?
fbclid=IwAR18yOH5pOqvtLbZW8d5MrhTxzq4enmWpUUJqNKMtQpRoHPn
B_Pgk_rcYg.
[4] https://layout parser.readthedocs.io/en/latest/.
[5] https://learnopencv.com/optical-character-recognition-using paddleocr/.
[6] https://www.axefinance.com/brochure-axefinance-ACP-solution-lending-automation
2019.pdf.
[7] https://www.ibm.com/topics/computer vision.

Summer Internship report II2 27

Vous aimerez peut-être aussi