Académique Documents
Professionnel Documents
Culture Documents
University of Manouba
National School of Computer Science
Author:
Melek SAHLIA
Organism : Axefinance
Supervised by : Abdelmajid AYADI
Adress : 3éme étage, Immeuble La Couverture, Rue du Lac Huron, Tunis 1053
TEL : +216 28 98 90 43 FAX : +216 71 96 32 29
eMail : info@axefinance.com
University year:
2022-2023
Supervisor Signature
Abstract — This report was prepared as part of the second-year summer internship that
took place under the supervision of the Axe Finance company.
The project involves developing a system for extracting data from the balance sheets into a
CSV file. This is in order to facilitate the manual process of data entry that has proven much
time-consuming and very error-prone.
The system will be able to make a good detection of the table presented by the balance sheet,
good detection, and recognition of the texts as well as a clear strategy of reconstruction of the
table through mathematical logic, and finally conversion to the desired format.
In order to achieve our goal, we have come to an understanding of some of the basics and
solutions developed in this report.
Key Words: Python, OCR, CRNN, Paddle OCR, Computer Vision,..
Résumé — Le présent rapport a été élaboré dans le cadre d’un stage d’été de la deuxième
année qui s’est fait sous l’encadrement de l’entreprise Axe Finance.
Le projet consiste à développer un système pour l’extraction des données se trouvant dans
les bilans financiers vers un fichier CSV. Ceci est dans le but de faciliter le processus manuel
d’entrée des données qui a prouvé plein de temps consommé et d’erreurs commises.
Le système sera capable de faire une bonne détection du tableau présenté par le bilan financier,
une détection et une bonne reconnaissance des textes ainsi qu’une claire stratégie de recon-
struction du tableau à travers une logique mathématique et enfin aboutir à une conversion
vers le format souhaité.
Afin d’atteindre notre but, nous aboutissons à la compréhension de certaines notions de base
et certaines solutions élaborées dans ce rapport.
Mots Clés: Python, OCR, CRNN, Paddle OCR, Vision par ordinateur,..
jÊÓ
áK
ñºK áÓ Ï @ áÓ
éJ
KA JË@ éÊgQÖ éJ
KA JË@ éJË@
éK
AîE áÓ ù®J
I.K
PYK áÓ ZQm .» QK
Q® JË@ @ Yë PAm ' @ Õç'
.
ð , éJ
ºJJ.Ë@ QK
PA®JË@ ú
¯ èXñk. ñÖÏ @ HAKAJ
J.Ë@ h. @QjJB
ÐA¢ QK
ñ¢ð éYJë úÍ@
¨ðQåÖÏ @ ¬YîE
. á
YJêÖ
Ï@
.
A¢jÊË
àñºJ éQ«ð
CK
ñ£ AJ¯ð
QªJ AîE @ IJ.K @ ú
æË@ð AK
ðYK
HA
J.Ë@ ÈAgX@
KAJ éJ ÊÔ« ÉJîDJË @ Yë
AîD
Ê« ¬Qª JË@ð ñJË@ á« ºË@ ð , úÍAÖÏ @ àAJ
J.Ë@ ú¯ ÐY®Ö Ï @ ÈðYj.ÊË YJ
k. » Z@Qk.
@ úΫ @PXA ¯ ÐA¢ JË@
éK
AîDË@ ú
¯ ø
XñK
ø
YË@ ð úæAK
P ¢ JÓ ÈC g áÓ ÈðYm.Ì '@ ZAJK . èXA«B
ém @ð éJ
j.
K@Q@
úÍ@
é¯A B
AK. @ YJ
k.
ªK. ð éJ
A B@ Õæ
ëA®ÖÏ @ ªK. Ñê¯ úÍ@
AJÊñK , AJ¯Yë J
®m' Ég. @ áÓ
. H. ñÊ¢ÖÏ @ ÊÖÏ @ úÍ@
ÉK
ñjJË@ úÍ@
ú¯ ém ñÖÏ @ ÈñÊmÌ '@
.QK
Q® JË@ @ Yë
ii
Acknowledgments
At the end of this internship, I would like to express my deepest and most sincere thanks to
all those who have kindly offered the necessary assistance to make this period a very enriching
experience. May they find here collectively and individually the expression of all my gratitude.
I cannot conclude this section without expressing my gratitude to the members of the Jury who
took the trouble to evaluate this work showing attention and patience.
My thanks also go to the staff of the National School Of Computer Sciences (ENSI) and Axefi-
nance who gave me the opportunity to enrich my theoretical and practical knowledge.
I hope that this project will be an expression of our deep esteem and sincere gratitude.
iii
Contents
General Introduction 1
1 Preliminary Study 2
1.1 Academic Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Presentation of the host company . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Context of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Axe Credit Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Analysis of existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Problems statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.5 Expected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Table Structure Recognition Challenges . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Study of the existing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1.1 Global-object-based methods . . . . . . . . . . . . . . . . . . . . . 5
1.4.1.2 Local-object-based methods . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2.1 Overview of PaddleOCR Features . . . . . . . . . . . . . . . . . . 6
1.4.2.2 PP-OCR Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Computer Vision: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Object Detection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.3 Instance Segmentation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.4 OCR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Achievement 19
4.1 Work Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Implementation and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.1.2 Table Detection and Extraction . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Text Detection and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Get Horizontal and Vertical Lignes . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Non-Max Suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 CSV Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Added Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.6 Work Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Conclusion And Perspective 27
Bibliography 27
W hile financial institutions have faced difficulties over the years for a multitude of reasons,
the major cause of serious banking problems continues to be directly related to lax credit stan-
dards for borrowers and counterparties, poor portfolio risk management, or a lack of attention
to changes in economic or other circumstances that can lead to a deterioration in the credit
standing of a bank’s counterparties.
To assist them, Axe Credit Portal (ACP) is a highly flexible end-to-end credit process automa-
tion solution for all aspects of the lending lifecycle, including application processing, the credit
assessment, automatic document generation, limits collateral management, covenant manage-
ment, portfolio management, loss recovery, and provisioning.
Analyzing the problem, we can notice that financial companies usually deal with overwhelm-
ing amounts of documents that need to be digitized and entered into a system.
Before, the process of document processing and entry was done manually, and no one can only
imagine how time-consuming and mundane it was. Not to mention the fact that manual doc-
ument processing is usually very error-prone as the possibility of a human error is high.
Computer vision technology can successfully replace human employees when it comes to doc-
ument processing. The technology can analyze the documents, digitize them, and enter them
into the system, thus enabling employees to perform the review only.
As part of the ACP product solution and knowing that financial documents usually contain
complicated table structures, my main work as an intern at Axefinance will focus on how to ex-
tract data from these tables into a CSV file and I choose to adopt Paddle OCR text detection
and recognition during this project.
Preliminary Study
Introduction
T his chapter is intended to give a general overview of our project. It will discuss the different
contexts of this work and make it clear how relevant it is.
To begin with, we will expose the academic context. Then we’ll switch to the professional side
by presenting basic concepts.
Also, this chapter will include the problem statement, the ACP, and the goals that we aim to
achieve and will end by explaining the methodology adapted during this work.
1.3.4 Objectives
In order to digitize the manual process and develop a solution for the data extraction con-
tained in the table of the financial balance sheet, we aim to follow the following procedure:
— Financial statements database collection.
— Table detection model selection.
— Table recognition model selection.
— Table reconstruction
— CSV file conversion.
We are waiting for a value added by the system to be designed in order to have an exact
and perfect recognition.
The figure 1.3 illustrates an intuitive example of a complicated table with spanning cells. The
presented table is shown in (a), and (b) is the desired structure of the dashed box area. The
recognized structures using existing methods are shown in (c) - (e). Note that the four cells, on
the right side in (c), are unfortunately recognized as a single cell.
Subsequently, a group of methods may be applied and it will be interesting to list the various
methods related to this approach. Below are the various groups and their explanation:
— First method: Attempts to recover cellular relationships on the basis of some heuristic
rules and algorithms.
— Second method: Treats detected boxes as nodes in a graph and tries to predict relation-
ships based on techniques of Graph Neural Network (GNN).
— Third method: Predicts relationships among nodes in three classes (horizontal connec-
tion, vertical connection and no connection) using multiple features such as visual fea-
tures, text positions, word embedding, etc..
— Other methods: Adopts graphic attention mechanism to improve prediction precision.
Limits:
— Since no empty cells are detected, local-object-based methods usually fall into the ambi-
guity of empty cells.
— problem of large graph nodes numbers.
1.5.4 OCR:
OCR refers to "Optical character recognition." It is a technology that recognizes text within
a digital image. It is generally used to identify text in scanned documents and images.
The OCR software can be used to convert a physical paper document or picture into an elec-
tronic version accessible with text.
For instance, if you scan a paper document or photograph using a printer, the printer will
probably create a file containing a digital image.
1.6 Methodology
This project started by having an idea about the ACP product of the company Axefinance
and its objectives, the idea of extracting data from the financial statements was defined and
stated from the beginning.
Following the methodology adopted at Axefinance, it was recommended to be present during
the daily meetings as well as the preparation of a presentation summarizing the progress of
each week.
A lot of time was devoted to the research and then development part in order to better under-
stand the framework and the approach by which the project should be implemented.
Conclusion
In this preliminary study chapter, we started with the general context while continuing with
the exposition of the problem as well as the proposed solution.
Then, we ended up developing the key and useful concepts for the understanding of the model.
In the next chapter, we will focus on the specification and analysis of needs.
Introduction
A fter defining the general context of the project, we will start the second chapter, which is
making the analysis and specification of needs.
In order to orient our solution, we will start by targeting the principal actor and his different
needs, which we will be detailed in what follows. We will be presenting the Hardware and
Software components in addition to the functional requirements and the software attributes.
Finally, we will present a use case diagram for the ACP product, then a statechart diagram for
the table structure recognition task to elaborate on the different finite states of our system.
Conclusion
In this part of the report, we started specifying the main and developing functional and
non-functional needs. Then we went through diagrams describing the states related to the
development and organization of our system. In the next chapter, we will explain the design
we have used in our system.
Design
Introduction
I n this chapter, we aim to explain the logic behind the realization of our system.
For this, we will first present the global architecture to be followed. Then we will give an
overview of the used models for table detection as well as for text detection and recognition.
3.2.1.1 Deep Learning Modules for Layout Detection (Layout Parser Table detection)
— Challenges: There are no standard ways for sharing and reusing existing models, there-
fore we built a series of standardized model APIs.
— Standardizing Modeling API: The first step to creating a model is to specify the model
configuration file and it is highly semantic. It is composed of the training dataset name
and the model architecture. The standardized modal API can be used to initialize differ-
ent models. We can specify the deep learning backends. Finally, we just need to run the
same detection API to identify layouts in any image.
Model Zoo:Layout parser has pre-trained dataset on 6 different datasets with the following
model label_map, including:
— HJDataset:
1:"Page Frame", 2:"Row", 3:"Title Region", 4:"Text Region", 5:"Title", 6:"Subtitle", 7:"Other"
— PubLayNet:
0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"
— PrimaLayout:
{1:"TextRegion", 2:"ImageRegion", 3:"TableRegion", 4:"MathsRegion", 5:"SeparatorRegion",
6:"OtherRegion"
— NewspaperNavigator:
0: "Photograph", 1: "Illustration", 2: "Map", 3: "Comics/Cartoon", 4: "Editorial Cartoon",
5: "Headline", 6: "Advertisement"
— TableBank:
0: "Table"
— MFD:
1: "Equation"
— Feature Extraction:
The convolutional neural network (CNN), which has convolutional and max-pooling lay-
ers, is the top layer. These are in charge of sifting through the input images to extract
features, and the outputs are featured maps. Feature maps are initially converted into a
sequence of feature vectors, which are then used to feed output to the following layer. On
the feature maps, each feature vector in a feature sequence is generated from left to right.
As a consequence, all of the maps’ i-th columns are combined to produce the i-th feature
vector.
Each column of the feature maps corresponds to a rectangular area of the input image as
a result of the feature extraction; this area is referred to the receptive field. The receptive
field is connected to each feature vector in the feature sequence, and these features can be
thought of as appearance descriptors for that area. The following layer of RNNs gets the
feature sequence at this point.
— Sequence Labeling:
This layer, which is built on top of the Convolutional Neural Network, is the Recurrent
Neural Network (RNN). In order to tackle the vanishing gradient problem and build
a deeper network, two bi-directional LSTMs are used in the CRNN architecture. Each
feature vector or frame in the feature sequence received from the CNN layers will have
the label forecasted by the recurrent layers. The layers potentially predict the label y for
each frame x in the feature sequence x = x1,....., xt.
— Transcription:
This layer is in charge of creating the final sequence based on the per-frame predictions
with the highest probability. The model can learn to decode the output by calculating
CTC (Connectionist Temporal Classification loss) using these predictions.
CTC Loss:
The RNN layer output is a tensor that includes the probability of each label for each receptive
field. But how does this influence the final product? The Connectionist Temporal Classification
(CTC) loss enters the picture at that point. The inference that is decoding the output tensor
as well as training the network both are carried out through CTC loss. The CTC focuses its
activities on the following central principles:
— Text encoding:
CTC solves the when a character takes more than one time step. CTC solves this by
merging all the repeating characters into one. And, when that word ends it inserts a
blank character “-”. This goes on for further characters. For example in the previous
figure, ‘S’ in ‘STATE’ has three-time steps. The network might predict those time steps as
‘SSS’. Now, the CTC will merge those outputs and predict the output as ‘S’. For the word,
a possible encoding could be SSS-TT-A-TT-EEE, hence the output ‘STATE’.
— Loss Calculation:
For a model to learn, loss needs to be calculated and back-propagated into the network.
Here the loss is calculated by adding up all the scores of possible alignments at each time
step, that sum is the probability of the output sequence. Finally, the loss is calculated by
taking a negative logarithm of the probability, which is then used for back-propagation
into the network.
— Decoding:
At the time of inference, We need a clear and accurate output. For this, CTC calculates
the best possible sequence from the output tensor or matrix by taking the characters with
the highest probability per time step. Then it involves decoding which is the removal of
blanks “-” and repeated characters. So, by these processes, we have the final output.
Conclusion
In this chapter, we have presented the architectural and detailed design, which is a funda-
mental phase in which we specify the skeleton of the system that will be realized in the next
chapter.
Achievement
Introduction
I In this chapter, we describe the implementation phase, which is the culmination of the previ-
ous stages of the preliminary study of the design while presenting the working environment,
the implementation process, and the tools used to successfully complete a functional solution.
4.2.1 Layout
4.2.1.1 Installation
We imported the PaddleOCR library from the Paddle Paddle project available on GitHub
and which contains all the libraries we need. In addition, we have also imported the pre-trained
model LayoutParser which will be used for table detection. Import our test image to the envi-
ronment:
4.3 Reconstruction
4.3.1 Get Horizontal and Vertical Lignes
For the reconstruction of the table, we will follow a two-pronged approach: For any hor-
izontal box, presented by (x1,y1) and (x2,y2) we will expand the rectangle according to the
width of the detected image. In another way:
xh1,yh1= 0,y1
xh2,yh2= x2+width, yh1+(y2-y1)
For any vertical box, presented by (x1,y1) and (x2,y2) we will expand the rectangle accord-
ing to the height of the detected image. In another way:
xv1,yv1= x1,0
xv2,yv2= xv1+(x2-x1), yv1+height
The output of our system is an image named “Horizvert.jpg” where the horizontal lines
are drawn in red and the vertical lines are draped in green.
Conclusion
In this last chapter, we have opted to highlight the hardware and software environment of
work. The results of our system and the approach that we followed. Mention was made of the
value added to the ACP product and the chronogram of the major parts of the development of
our solution and their planning.
T he purpose of this report is to synthesize and implement the work done during the com-
pany immersion course at Axe Finance.
During this internship, we focused on the problem of recognition and the extraction of data
located in the financial statements.
The purpose of the proposed solution is to build a system for extracting the data contained in
the financial statements in order to reduce errors and the consumed time noticed during the
manual process.
To do this, we started by setting up the general context of the project. Indeed, after pre-
senting the basic problem, we tried to develop a solution while defining the actors as well as
functional and non-functional needs.
Then we explained the design part where we explained the overall architecture of the
project as well as the detailed architecture of the different models used. In a second place,
we presented the approach and implementation of the solution step by step and we ended by
presenting the added value to the ACP product.
Implementing such a solution was not easy to achieve. Through this project, we discovered
the field of computer vision through the detection of objects and the recognition of characters.
To conclude, the solution we have proposed and developed meets expectations.
However, it will be possible to improve the results by Fine-tuning the pre-trained models with
the company’s internal balance sheet database, it’s a processor making little improvements in
order to attain the required output or performance. This latest improvement will make our
results more consistent.
It should be mentioned that the references presented below have greatly helped to develop
this report and the approach taken during the development of the project.
[1] https://arxiv.org/pdf/2105.06224.pdf.
[2] https://github.com/PaddlePaddle/PaddleOCR1.
[3] https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/table/README.md?
fbclid=IwAR18yOH5pOqvtLbZW8d5MrhTxzq4enmWpUUJqNKMtQpRoHPn
B_Pgk_rcYg.
[4] https://layout parser.readthedocs.io/en/latest/.
[5] https://learnopencv.com/optical-character-recognition-using paddleocr/.
[6] https://www.axefinance.com/brochure-axefinance-ACP-solution-lending-automation
2019.pdf.
[7] https://www.ibm.com/topics/computer vision.