Vous êtes sur la page 1sur 97

Face detection by surveillance camera

Using Machine Learning

Computer Science Department


University of batna 2 (Mustapha Ben Boulaïd)

By
Djouama Ihcene
Oulmi Saliha

A dissertation submitted to the University of Batna 2 in


accordance with the requirements of the degree of
Master: Artificial Intelligence and Multimedia of
Mathematics and Computer Science Faculty.

Directed by
Dr Larbi GUEZOULI

September 2020
i
Abstract

Abstract
Face detection is the trend field of researches in this last few years , and this ups to the need
to it in this decade and the most use technology ; where we can find it in different
applications (examples: Snapchat Instagram face filters ), in different area to ensure the
security in airports, train stations, security camera in homes and shops...etc , and this is
what make it important and motivates the researchers to achieve remarkable works with high
performances and guarantee real time efficiency. Which make it interesting subject to do it in
our project to detect faces in real time in video. So as a first step we select four known
methods that achieve state of the art in detecting faces ,by doing a comparative study
between them ,then we are going to select the best one that respond to all the conditions and
the requirements needed to achieve real time at least 20 frame per second ,and high accuracy
with our dataset that we built it for surveillance scenarios.
Keywords: Security camera, detecting faces, Faceboxes, Deep Learning, Cnn, Real time,
machine learning.

ii
Abstract

Résumé
La détection des visages est le domaine de recherche de tendance de ces dernières années, et
ceci dépend à des besoins dans cette décennie et la technologie la plus utilisée; où on peut le
trouver dans différentes applications (exemples: filtres de visage Instagram, Snapchat), dans
différents domaines pour assurer la sécurité dans les aéroports, les gares, la caméra de
sécurité dans les maisons et les magasins ... etc, et c’est ce qui le rend important et motive les
chercheurs pour réaliser des travaux remarquables avec des performances élevées et garantir
une efficacité en temps réel dans un video. Ce qui rend le sujet intéressant de le faire dans
notre projet de détection de visages en temps réel. Donc, dans un premier temps, nous
sélectionnons quatre méthodes connues qui atteignent l’état de l’art dans la détection des
visages, en faisant une étude comparative entre ces methods, puis nous allons sélectionner la
meilleure qui répond à toutes les conditions et les exigences nécessaires pour atteindre le
temps réel à au moins 20 images par seconde, et une grande précision avec notre base de
données que nous l’avons construit pour les scénarios de surveillance.
Mots clés: Caméra de sécurité, détection des visages , Faceboxes, Apprentissage profond,
Cnn, Temps réel, Apprentissage Automatique.

iii
‫‪Abstract‬‬

‫ملخص‬
‫إن الكشف عن الوجوه هو من مجا ت البحث العلمي ا ٔكثر استكشافا في السنوات القليلة الماضية ‪ ،‬وهذا‬
‫يرجع إلى الحاجة إليه في هذا العقد و هي من التكنولوجيا ا ٔكثر استخدا ًما ؛ حيث يمكن أن نجدها في‬
‫تطبيقات مختلفة )أمثلة ‪ ،‬فلتر ا نستقرام و سناب شات( ‪،‬وفي مناطق مختلفة لضمان ا ٔمن في المطارات‬
‫ومحطات القطارات وكاميرات ا ٔمن في المنازل والمتاجر ‪ ...‬إلخ ‪ ،‬وهذا ما يجعلها مهمة ومحفزة للباحثون‬
‫عال وضمان كفاءة الوقت الحقيقي‪ .‬مما يجعل ا ٔمر مثي ًرا ل هتمام للقيام به في‬ ‫ٕ نجاز أعمال رائعة بأداء ٍ‬
‫مشروعنا للكشف عن الوجوه في الوقت الفعلي في الفيديو‪ .‬لذلك كخطوة أولى نختار أربع طرق معروفة في‬
‫الكشف عن الوجوه وحققوا نتائج مبهرة في مختلف الشروط ‪،‬ثم من خ ل إجراء دراسة مقارنة بينها ‪ ،‬نختار‬
‫أفضل طريقة تستجيب لجميع الشروط والمتطلبات ال زمة حراز الكشف في الوقت الحقيقي في وقت يقل‬
‫عن ‪ 20‬صورة في الثانية ‪ ،‬ودقة عالية مع مجموعة البيانات التي أنشأناها لسيناريوهات المراقبة‪.‬‬
‫الكلمات المفتاحية‪ :‬كاميرات المراقبة ‪،‬الكشف عن الوجوه ‪ ، Faceboxes ،‬التعلم العميق ‪ ، Cnn ،‬الوقت‬
‫الحقيقي ‪ ،‬التعلم ا ٓلي‪.‬‬

‫‪iv‬‬
Dedication

To the pure soul, to the flower of my life, to you my mother in heaven.


To my hero, to my source of motivation, to you my father.

v
Acknowledgements

Foremost, i have to thank Allah to give me the power and patience to complete our thesis in
these unsuspected situations of Corona Virus.
I would like to express my sincere gratitude to our supervisor Dr LARBI GUEZOULI for his
guidance that helped me during this research ,without his assistance , corrections , planning
and dedicated involvement in every step through the process, this work would have never
been accomplished.
My sincere thanks also goes to all my teachers during all this years, without them i am not in
this level of knowledge.
Special thanks to my friends and my colleague in this thesis for their motivation and support
during this work , to my two best friends who always been there for me , to my friend in
Mostaganem who helped me to see this project easy by giving me tips.
Last but not the least, i would like to thank my family, brothers and my dearest sister , none
of this would happened without their trust on me ,their pushing and encouraging me
spiritually ,most importantly to my father DJOUAMA ABDELAZIZ who did everything he
could to offer me all what i need, and to make me where i am now by his advice , constant
source of support ,and his certitude that i can do anything to be successful during all my
years of study.
Thanks to my self to not giving up.

vi
Acknowledgements

First and foremost, I’m deeply grateful and thankful to ALLAH for giving us the strength,
ability , knowledge to achieve this work.
My gratitude knows no bounds to my colleague DJOUAMA IHCENE which really the most
one supporting and encouraging me in my hard and expecting time this year. Best wishes for
her.
Moreover, I would like to express thanks for my dad and mom,my children DJANNA and
MOUATEZ, my sister and brothers and my husband.
Finally, a special thank to our supervisor Dr LARBI GUEZOULI.

vii
Author’s declaration

W
e declare that the work in this dissertation was carried out in accordance
with the requirements of the University’s Regulations and Code of Prac-
tice for Research Degree Programs, the work is the candidate’s own work.
Work done in collaboration with, or with the assistance of, others, is indicated as
such

Email :

SIGNED: .................................................... DATE: ..........................................

viii
Table of Contents

Page

List of Tables xiii

List of Figures xiv

1 Introduction 1
1.1 Problems and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Lexicon of used expressions 4


2.1 Video surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Convolutional neural network (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Strides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.4 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.5 Non Linearity (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.6 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.7 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.8 Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.9 Resnet50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.10 Data Augmentation [23] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.11 Data augmentation methods in computer vision . . . . . . . . . . . . . . . 12
2.3.12 Image Annotation [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.13 Non-maximum Suppression (NMS) [11] . . . . . . . . . . . . . . . . . . . . 14
2.3.14 The Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.15 Xavier initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Epoch vs Batch Size vs Iterations [22] . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

ix
TABLE OF CONTENTS

2.4.3 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


2.4.4 Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Related Work 18
3.1 Body based Face Detection BFD on the UCCS dataset [16] . . . . . . . . . . . . 18
3.1.1 The characteristics of Body based face detection BFD proposed by Cao
et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 The processing of the real-time face detection approach . . . . . . . . . . 19
3.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Selective Refinement Network for High Performance Face Detection [4] . . . . . 20
3.2.1 Main contributions to the face detection studies . . . . . . . . . . . . . . . 20
3.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Feature Agglomeration Networks for Single Stage Face Detection [27] . . . . . . 21
3.3.1 The contribution of this methods . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 The final FANet model and its results . . . . . . . . . . . . . . . . . . . . . 22
3.4 FaceBoxes: A CPU Real-time Face Detector with High Accuracy [31] . . . . . . 23
3.4.1 The contribution of this methods . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Comparison between methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 SRN and faceboxes 26


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Selective Refinement Network for High Performance Face Detection (SRN) . . . 27
4.2.1 Why SRN? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 How does Selective Refinement Network (SRN) works? . . . . . . . . . . 29
4.2.3 The network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.4 Selective Two-Step Classification (STC) . . . . . . . . . . . . . . . . . . . 31
4.2.5 Selective Two-Step Regression(STR) . . . . . . . . . . . . . . . . . . . . . 32
4.2.6 Receptive Field Enhancement(RFE) . . . . . . . . . . . . . . . . . . . . . . 33
4.2.7 Training, Expirements and Results . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Faceboxes: a cpu real-time and accurate unconstrained face detector . . . . . . . 35
4.3.1 Why Faceboxes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 How does Faceboxes works? . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.3 Rapidly Digested Convolutional Layers . . . . . . . . . . . . . . . . . . . . 37
4.3.4 Multiple Scale Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . 38
4.3.5 Anchor densification strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 39

x
TABLE OF CONTENTS

4.3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.7 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Implementation 46
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 The method used in the implementation . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Which programming language . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Used machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Creating our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.1 Starting idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.2 Collecting images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.3 Frames annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Run the codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Results and discussions 59


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 How do we calculate the AP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.1 Running the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.2 Create the ground-truth files . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.3 Create the detection-results files . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.1 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.2 Average precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.1 Average precision (AP) on our personal machine . . . . . . . . . . . . . . 70
6.4.2 Average precision (AP) on the HPC . . . . . . . . . . . . . . . . . . . . . . 70
6.4.3 The speed of calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Conclusion 72

8 Appendix 74
8.1 The environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 Things to avoid in the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Collection of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

xi
TABLE OF CONTENTS

8.3.1 Resources for our data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


8.4 Errors during training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.5 Commands might be useful for Linux users . . . . . . . . . . . . . . . . . . . . . . 76

Bibliography 79

xii
List of Tables

Table Page

3.1 Comparative table between methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 AP performance of the two-steps classification applied to each pyramid level. [4] . 32
4.2 AP performance of the two-step regression applied to each pyramid level. [4] . . . 33
4.3 Evaluation on Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Overall CPU inference time and mAP compared on different methods. The FPS is
for VGA-resolution images on CPU and the mAP means the true positive rate at
1000 false positives on FDDB. Notably, for STN, its mAP is the true positive rate
at 179 false positives and with ROI convolution, its FPS can be accelerated to 30
with 0.6% recall rate drop. [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Ablative results of the FaceBoxes on FDDB dataset. Accuracy (mAP) means the
true positive rate at 1000 false positives. Speed (ms) is for the VGA-resolution
images on the CPU. [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Result of ablation of each component of the method beside the loss function where
DCFPN=Architecture+Strategy+Loss [31] . . . . . . . . . . . . . . . . . . . . . . . . 44

xiii
List of Figures

Figure Page

2.1 Difference between face recognition and face detection . . . . . . . . . . . . . . . . . 5


2.2 Neural network with many convolutional layers . . . . . . . . . . . . . . . . . . . . . 6
2.3 Image matrix multiplies kernel or filter matrix . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Image matrix multiplies kernel or filter matrix . . . . . . . . . . . . . . . . . . . . . . 7
2.5 3 x 3 Output matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Some common filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Stride of 2 pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 ReLU operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.10 The pooling layer output flattened as an FC layer input . . . . . . . . . . . . . . . . 10
2.11 Complete CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.12 Residual Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.13 Data augmentation – Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.14 Data augmentation – Random cropping . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.15 Data augmentation – Color shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.16 Intersection over Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.17 Gradient descent optimisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.18 Type of curves that network pass by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 A WIDERFACE dataset for face detection. The annotated face bounding boxes are
denoted in green color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Network structure of SRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Architecture of the FaceBoxes and the detailed information table about our anchor
designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 (a) The C.ReLU modules where Negation simply multiplies −1 to the output of
Convolution. (b) The Inception modules. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Examples of anchor densification. For clarity, we only densify anchors at one recep-
tive field center (i.e., the central black cell), and only color the diagonal anchors. . 40
4.6 Face boxes on AFW dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

xiv
List of Figures

5.1 SRN result in wider-face testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


5.2 Collected images from different datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Samples of frames from our dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Image annotation using MakeSense tool. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Start training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 Finishing training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7 Training on HPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1 Testing on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


6.2 Testing on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Testing on CPU HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4 Testing on GPU HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 The Average precision obtained for test images by Faceboxes . . . . . . . . . . . . . 64
6.6 Bonding boxes detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.7 High confidence for detecting face 71% . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.8 Confidence of 42% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.9 Result of Bounding boxes. Red: false positives , Green: true positives , Blue: ground
truth detected , Pink: ground-truth not detected. . . . . . . . . . . . . . . . . . . . . 66
6.10 The Average precision obtained for video test by Faceboxes . . . . . . . . . . . . . . 67
6.11 Bonding boxes detected (true and false positives) . . . . . . . . . . . . . . . . . . . . 67
6.12 High confidence for detecting face 85% . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.13 Confidence of 38% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.14 Result of Bounding boxes. Red: false positives, Green: true positives, Blue: ground-
truth detected, Pink: ground-truth not detected. . . . . . . . . . . . . . . . . . . . . 68
6.15 The Average precision obtained for test images by Faceboxes in HPC . . . . . . . . 69
6.16 Bonding boxes detected in HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.17 Face detected without ground-truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.1 XML file errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Problem with the image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4 Python libraries path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.5 Unlock files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xv
1
Chapter
Introduction

I
n the area of security, motion detection remains a field of research despite the evolution
it has known since its creation. A surveillance camera can recover any type of movement.
The movement that interests us in this project is the movement of human beings. The
camera should only detect the movement of human beings. This detection is based on facial
recognition techniques which we can call it ”too face detection”.
Face detection is a long-standing problem in computer vision with extensive applications
including face recognition, animation, expression analysis and human computer interaction
[29], so face detection is a fundamental step to all facial analysis algorithms, the goal of face
detection is to determine the presence of faces in the image and, if present, return the location
and extent of each face in an image.
To detect faces efficiently and accurately, different detection pipelines have been designed
after the pioneering work of Viola-Jones [24], most of early face detection methods have fo-
cused on designing effective hand-crafted features [e.g., Haar (Viola and Jones 2004) [24] and
HOG (Dalal and Triggs 2005[5]) and classifiers [e.g., Adaboost (Freund and Schapire 1997 [7]),
and combining local features in global models such as deformable parts model (DPM) (Felzen-
szwalb et al. 2010 [6]). However, these methods typically optimize each component of detectors
separately, which limits the performance of these methods when they are deployed in real life
complex scenarios. [29]
To further improve the performance of face detection has become a challenging and hard
issue. However, in recent years, with the advent of deep convolutional neural network (CNN),
a new generation of more effective face detection methods based on CNN significantly improve
the state-of-the-art performances and rapidly become the tool of choices [29]. These detectors
perform much better than approaches based on hand-crafted features due to the capability of
deep CNNs in extracting discriminative representation from data. Modern face detectors based

1
CHAPTER 1. INTRODUCTION

on deep CNNs can easily detect faces under moderate variations in pose, scale, facial expression,
occlusion, and lighting condition. Consequently, deep learning-based face detectors are now
widely used in a myriad of consumer products, e.g., video surveillance systems, digital cameras,
and social networks, because of its remarkable face detection results [15]. But, although these
methods have considerably improved in terms of detection speed or accuracy, some methods
that focus on accuracy tend to be extremely slow due to the use of complicated classifiers.
Moreover, some other methods focusing on detection speed, have limited detection accuracy
of the bonding box, the detection of tiny faces or extreme pose faces especially while working
in real time. [26]

1.1 Problems and objectives


The issues that will be addressed in this project are:

• Recall Efficiency

– Check if the moving object is a human being by checking if it contains a face.


– Number of false positives needs to be reduced at the high recall rates.
– AP is very high, but precision is not high enough at high recall rates.

• Location Accuracy

– Detection and location of a moving object.


– Accuracy of the bounding box location needs to be improved.
– Put more emphasis on the bounding box location accuracy.
– As the IoU (Intersection over union) threshold increases, the Average Precision(AP)
drops dramatically.

• Real time

– To achieve face detection in surveillance scenarios in real time, the speed should be
high.
– To achieve the performance in real time, the frame per second should be at least
20FPS.

• Datasets

– The dataset should be big enough to do efficient training.


– The dataset should guaranty the surveillance scenarios conditions .

2
CHAPTER 1. INTRODUCTION

1.2 Thesis plan


This thesis is structured into 6 chapter divided into two parts:

1. Part one: Theoretical study

a) Chapter 1: Introduction
b) Chapter 2: This chapter contains definitions of expressions and terms and the terms
used in this manuscript to facilitate its reading.
c) Chapter 3: This chapter presents a comparative study between four methods working
on face detection. We present advantages and disadvantages of each with the aim
of selecting the best one.
d) Chapter 4: Presentation of selected methods.

2. Part two: Practical study

a) Chapter 5: This chapter is dedicated to experiments. We will present the used


environment, the used dataset, and execution steps.
b) Chapter 6: This chapter presents the results and discussions of the carried out work.

3
2
Chapter
Lexicon of used expressions

”Computers are able to see, hear


and learn. Welcome to the
future.”

Those are expressions that are used in this manuscript, so to better understand the issue,
let’s explain them briefly.

2.1 Video surveillance


Because state-of-the-art object detection techniques can accurately identify and track multiple
instances of a given object in a scene, these techniques naturally lend themselves to automating
video surveillance systems.
For instance, object detection models are capable of tracking multiple people at once, in
real-time, as they move through a given scene or across video frames. From retail stores to
industrial factory floors, this kind of granular tracking could provide invaluable insights into
security, worker performance and safety, retail foot traffic, and more. [8]

2.2 Face detection


Face detection is a computer vision technique that works to identify and locate faces within an
image or video. Specifically, face detection draws bounding boxes around these detected faces,
which allow us to locate where said faces are in (or how they move through) a given scene.

4
CHAPTER 2. LEXICON OF USED EXPRESSIONS

Face detection is commonly confused with face recognition, so before we proceed, it’s im-
portant that we clarify the distinction between them.
Face recognition assigns a label to an image. A picture of a person’s face his name ”Ali”
receives the label “Ali” with drawing bonding box around each face. Face detection, on the
other hand, draws a box around each face. The model predicts where each face is . (see figure
2.1).

Figure 2.1: Difference between face recognition and face detection

2.3 Convolutional neural network (CNN)


In neural networks, Convolutional Neural Network (ConvNets or CNNs) is one of the main
categories to do images recognition, images classifications, faces detection, recognition faces
etc., are some of the areas where CNNs are widely used. CNN image classifications takes an
input image, process it and classify it under certain categories (Eg., Dog, Cat, Tiger, Lion).
Computers sees an input image as array of pixels and it depends on the image resolution. Based
on the image resolution, it will see h × w × d ( h = Height, w = Width, d = Dimension). Eg. an
image of 6 × 6 × 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 × 4 × 1
array of matrix of gray-scale image.
Technically, deep learning CNN models to train and test, each input image will pass it
through a series of convolution layers with filters (Kernels), Pooling, fully connected layers
(FC) and apply Softmax function to classify an object with probabilistic values between 0 and
1. The below figure is a complete flow of CNN to process an input image and classifies it based
on objects. (see figure 2.2) [17]

2.3.1 Input
This is the input image, here a photo with a car. When we have a video input, we cut it into
each of its images and apply the CNN (generally with tracking to avoid having to detect too
often). [18]

5
CHAPTER 2. LEXICON OF USED EXPRESSIONS

Figure 2.2: Neural network with many convolutional layers

2.3.2 Convolution Layer


Convolution is the first layer to extract features from an input image. Convolution preserves
the relationship between pixels by learning image features using small squares of input data. It
is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.
[17]

Figure 2.3: Image matrix multiplies kernel or filter matrix

Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown in
below.
Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is
called “Feature Map” as output shown in below.
Convolution of an image with different filters can perform operations such as edge detection,
blur and sharpen by applying filters. The below example shows various convolution image after
applying different types of filters (Kernels) (see Figure 2.6).

6
CHAPTER 2. LEXICON OF USED EXPRESSIONS

Figure 2.4: Image matrix multiplies kernel or filter matrix

Figure 2.5: 3 x 3 Output matrix

2.3.3 Strides
Stride is the number of pixels a filter moves across the input image. When the stride is 1 then
we move the filters by 1 pixel at a time. When the stride is 2 then we move the filters by 2
pixels at a time and so on. Figure 2.7 shows convolution with a stride of 2 pixels. [17]

2.3.4 Padding
Sometimes filter does not fit perfectly the input image. We have two options: Pad the picture
with zeros (zero-padding) so that it fits the part of the image where the filter did not fit. This
is called valid padding which keeps only valid part of the image.

2.3.5 Non Linearity (ReLU)


ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).
Why ReLU is important: ReLU’s purpose is to introduce non-linearity in our ConvNet. The
real world data used to learn the ConvNet have to be non-negative linear values. ReLU layer
ensures this constraint (see figure 2.8) [17]

7
CHAPTER 2. LEXICON OF USED EXPRESSIONS

Figure 2.6: Some common filters

Figure 2.7: Stride of 2 pixels

8
CHAPTER 2. LEXICON OF USED EXPRESSIONS

Figure 2.8: ReLU operation

There are other non linear functions such as ’tanh’ or ’sigmoid’ that can also be used
instead of ReLU. Most of the data scientists use ReLU because its performance is better than
the other two.

2.3.6 Pooling Layer


Pooling layers section would reduce the number of parameters when the images are too large.
Spatial pooling also called subsampling or downsampling which reduces the dimensionality of
each map but retains important information. Spatial pooling can be of different types:

• Max Pooling.

• Average Pooling.

• Sum Pooling.

Max pooling takes the largest element from the rectified feature map. Taking the largest
element could also take the average pooling. Sum of all elements in the feature map called as
sum pooling. (see figure 2.9) [17]

Figure 2.9: Max pooling

9
CHAPTER 2. LEXICON OF USED EXPRESSIONS

2.3.7 Fully Connected Layer


The output matrix of pooling layer was flattened into vector and was fed into a fully connected
layer of the neural network.

Figure 2.10: The pooling layer output flattened as an FC layer input

In the figure 2.10, the feature map matrix will be converted as vector (x1, x2, x3, …). With
the fully connected layers, we combined these features together to create a model. Finally, we
have an activation function such as softmax or sigmoid to classify the outputs as cat (y1), dog
(y2), car (y3) etc.

Figure 2.11: Complete CNN architecture

2.3.8 Anchors
We denote the reference bounding box as “anchor box”, which is also called “anchor” for
simplicity. However,it is called too “default box”. [28]
Anchor boxes are used in computer vision object detection algorithms to help locate objects.

2.3.9 Resnet50
ResNet(Residual Networks)[10],is the winner of ImageNet challenge in 2015, it has been used
as an architecture for many computer vision works, in training the deep learning tasks with
different number of layers till 150+ layers which were successfully applied . ResNet-50 is a
pretrained Deep Learning model of the Convolutional Neural Network(CNN, or ConvNet).

10
CHAPTER 2. LEXICON OF USED EXPRESSIONS

What characterizes a residual network is its identity connections that takes the input directly
to the end of each residual block, as shown with the curved arrow in figure 2.12.

Figure 2.12: Residual Network architecture

Specifically, the ResNet-50 model consists of 5 stages each with a residual block. Each one
has 3 layers with both 1*1 and 3*3 convolutions. The concept of residual blocks is quite simple.
In traditional neural networks, each layer feeds the next layer. In a network with residual
blocks, each layer feeds directly the layers about 2–3 hops away, called identity connections.
Resnet solves the problem of vanishing gradients which mean when gradient is very small then
the weights will not be change effectively and it may completely stop the neural network from
further training.

2.3.10 Data Augmentation [23]


Deep convolutional neural networks have performed remarkably well on many Computer Vision
tasks. However, these networks are heavily reliant on big data to avoid overfitting which refers to
the phenomenon when a network learns a function with very high variance such as to perfectly
model the training data. Unfortunately, many application domains do not have access to big
data, such as medical image analysis. So the Data Augmentation came, a data-space solution
to the problem of limited data. Data Augmentation encompasses a suite of techniques that
enhance the size and quality of training datasets and geometric transformations(color space
augmentations, kernel filters, mixing images, random erasing, feature space augmentation) such
that better Deep Learning models can be built using them.

11
CHAPTER 2. LEXICON OF USED EXPRESSIONS

2.3.11 Data augmentation methods in computer vision


2.3.11.1 Mirroring

Perhaps the simplest data augmentation method is Mirroring along the vertical axis. If we
have this example in our training set, we can flip it horizontally to get that image on the right.
For most computer vision tasks if the left picture is a cat then mirroring is still a cat. Hence, if
the mirroring operation preserves whatever we’re trying to recognize in the picture this would
be a good data augmentation technique to use. (Figure 2.13)

Figure 2.13: Data augmentation – Mirroring

2.3.11.2 Random cropping

Another commonly used technique is Random cropping. In the given dataset we pick a few
random crops. Random cropping is not a perfect data augmentation method. What if we
randomly ended up taking a crop which does not look much like the cat. In practice it works
well as long as our random crops are reasonably large subset of the optional image.(Figure
2.14)

Figure 2.14: Data augmentation – Random cropping

12
CHAPTER 2. LEXICON OF USED EXPRESSIONS

2.3.11.3 Color shifting

Another type of data augmentation that is commonly used is Color shifting. For a picture below,
let’s say we add to the red, green and blue channels different distortions. In this example we
are adding to the red and blue channels and subtracting from the green channel. (Figure 2.15)
Introducing these color distortions we make our learning algorithm more robust to changes in
the color of our images.

Figure 2.15: Data augmentation – Color shifting

2.3.12 Image Annotation [1]


Image annotation is one of the most important tasks in computer vision. With numerous
applications, computer vision essentially strives to give a machine eyes – the ability to see and
interpret the world. Image annotation is the human-powered task of annotating an image with
labels. These labels are predetermined by the AI engineer and are chosen to give the computer
vision model information about what is shown in the image. Depending on the project, the
amount of labels on each image can vary. Some projects will require only one label to represent
the content of an entire image (image classification). Other projects could require multiple
objects to be tagged within a single image, each with a different label. For example in our
project we need image annotation for faces in each image and their bonding box’s coordinates
on the image to do the training.

13
CHAPTER 2. LEXICON OF USED EXPRESSIONS

2.3.13 Non-maximum Suppression (NMS) [11]


Input: A list of Proposal boxes B, corresponding confidence scores S and overlap threshold
N.Once the detector outputs the large number of bounding boxes, it is necessary to pick the
best ones. NMS is the most commonly used algorithm for this task.
Output: A list of filtered proposals D. Algorithm:

1. Select the proposal with highest confidence score, remove it from B and add it to the
final proposal list D. (Initially D is empty).

2. Now compare this proposal with all the proposals — calculate the Intersection over Union
(see Figure 2.16) (IoU) of this proposal with every other proposal. If the IOU is greater
than the threshold N, remove that proposal from B.

3. Again take the proposal with the highest confidence from the remaining proposals in B
and remove it from B and add it to D.

4. Once again calculate the IOU of this proposal with all the proposals in B and eliminate
the boxes which have high IOU than threshold.

5. This process is repeated until there are no more proposals left in B.

IOU calculation is actually used to measure the overlap between two proposals.

Figure 2.16: Intersection over Union

2.3.14 The Jaccard Similarity


Also called the Jaccard Index or Jaccard Similarity Coefficient, is a classic measure of similarity
between two sets that was introduced by Paul Jaccard in 1901. Given two sets, A and B, the
Jaccard Similarity is defined as the size of the intersection of set A and set B (i.e. the number
of common elements) over the size of the union of set A and set B (i.e. the number of unique
elements).

14
CHAPTER 2. LEXICON OF USED EXPRESSIONS

| A ∩ B|
js( A, B) = (2.1)
| A ∪ B|

2.3.15 Xavier initialization


Assigning the network weights before starting the training seems to be a random process,
when we do not know anything about the data .So an initialization method called Xavier was
therefore introduced to save the day. The idea is randomizing the initial weights, so that the
inputs of each activation function fall within the sweet range of the activation function. Ideally,
none of the neurons should start with a trapped situation.
Xavier helps signals reach deep into the network.

• If the weights in a network start too small, then the signal shrinks as it passes through
each layer until it’s too tiny to be useful.

• If the weights in a network start too large, then the signal grows as it passes through
each layer until it’s too massive to be useful.

Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reason-
able range of values through many layers.

2.4 Epoch vs Batch Size vs Iterations [22]


To find out the difference between these terms we need to know some of the machine learning
terms like Gradient Descent .

2.4.1 Gradient Descent


It is an algorithm which works to find the best results (minima of a curve), used by machine
learning forward to the optimal results by doing many iterations. The Gradient descent has a
parameter called learning rate. As you can see above on Figure 2.18 (left), initially the steps
are bigger that means the learning rate is higher and as the point goes down the learning rate
becomes more smaller by the shorter size of steps.
As machine learning works with big data it is impossible to pass it all to the computer in
one step. So the three terminologies has been created to solve this problem which are: epochs,
batch size, iterations ;to divide the data into small portions and give them to the network one
by one and updating the weights at the end of every round to fit the data given.

2.4.2 Epochs
When ALL the dataset passed forward and backward through the neural network at one time
it is one Epoch.

15
CHAPTER 2. LEXICON OF USED EXPRESSIONS

Figure 2.17: Gradient descent optimisation.

Yet, if the dataset is too big to be uploaded at once; the epoch need to be divided into
several smaller batches to pass into the neural network.

2.4.2.1 Why we use more than one Epoch?

Passing the entire dataset through a neural network is not enough. it needs to pass the full
dataset multiple times to the same neural network to get the optimal weights.
One epoch leads to underfitting of the curve in the graph (below) 2.18.

Figure 2.18: Type of curves that network pass by.

As the number of epochs increases, more number of times the weight are changed in the
neural network and the curve goes from underfitting to optimal to overfitting curve.

2.4.3 Batch Size


The batch size is the number that we obtain it from dividing all the dataset size on a specific
number that fit the machine(iterations) to find the number of the examples that we are going
to train them in one batch. As we should know, we can not pass the entire dataset into the
neural net at once. So, we need to divide the dataset into a number of batches or sets or parts.

16
CHAPTER 2. LEXICON OF USED EXPRESSIONS

2.4.4 Iterations
In order to complete training all the dataset which is usually big in machine learning , we need
a specific number of iterations which are small sizes divided from the initial dataset and we
call them batches.
Note: The number of batches is equal to number of iterations for one epoch.
Example: Let’s say we have 4000 training examples in our dataset that we are going to use.
We can divide the dataset of 4000 examples into batches of 250 then it will take 16 iterations
to complete 1 epoch (4000/250=16).

17
3
Chapter
Related Work

”We have indeed created


humankind in the best of moulds”

Quran:95:4(Surat At-Tin,The
Fig)

I
n this chapter we are going to do a comparative study between existing methods, which
are related to our research, then we are going to select the best method that has the
better results, that it is going to be more explored and detailed in the next chapter.

3.1 Body based Face Detection BFD on the UCCS dataset


[16]
The Face Detection Data Set and Benchmark (FDDB) database contains 2845 images with
a total of 5171 faces. These images were collected from Yahoo! News website, and later are
cleaned and annotated. The WIDER FACE dataset consists of 32203 images with 393703
labeled faces, but most of them are not representative of face images collected in surveillance
scenario. In most face detection or recognition databases the majority of images are “posed”.
So they perform a comparative face detection analysis with our body-based face detection
method on UCCS dataset In UnConstrained College Students (UCCS) dataset, subjects are
photographed using a long-range high-resolution surveillance camera without their knowledge.
Faces inside these images are of various poses, and varied levels of blurriness and occlusion.[2]

Why we choose it? The most interesting point is that its dataset UCCS is a collection of
images in surveillance scenario so it gives too much relevance to our subject.

18
CHAPTER 3. RELATED WORK

3.1.1 The characteristics of Body based face detection BFD proposed by


Cao et al.
• Face detection in real time regardless of the number of faces in an image.

• Their method is based on Convolutional Neural Networks (CNNs).

• This method efficiently detects the 2D pose of multiple people in an image.

• The face detection algorithm is based on detected joints on face.

3.1.2 The processing of the real-time face detection approach


1. Extract coordinates of joints. Such as shoulder center, waist and nose, etc. If any joint
is undetected, the coordinates of the joint will be set to null.

2. Apply frontal/side face detection based on the details of the joints and draw the boundary
boxes for all detected frontal/side faces. Based on the information of the five joints on
face (nose, left eye, right eye, left ear and right ear) and in order to reduce false alarm,
a confidence threshold is set. The threshold is applied to all detected joints of the face,
and then delete the joints whose confidence is lower than the threshold. We consider all
different detection situation (angle of the face) and build a well defined frontal/side face
detection rule.

3. Apply boundary box size check to detected faces in order to decrease the false alarm rate.
Two threshold (thre_min and thre_max) are set for checking the size of boundary box
of each face. If size(boundary box) > thre_max or size(boundary box) < thre_min, we
delete such detected face.

4. Finally, a skin detector method was trained using part of the training set. This helped
to remove more false alarms.

3.1.3 Experiments
• In 23350 faces ; detection 95% false alarms 5000.

• In 11110 faces : detection 92% false alarms 4000.

• In the precision and the recall on training set; the Area Under the Curve (AUC) is 0.94.

• In the precision and the recall on validation set; the AUC is 0.944.

19
CHAPTER 3. RELATED WORK

3.1.4 Results
• This methods has the problem of many false alarms.

• We know in surveillance scenario the camera is usually placed at a distance to capture


a wide angle and set above human height. This installation usually allows to capture
images containing multiple faces with the whole body, because of the distance and height
of the position of the camera position. For this reason, authors think, taking the cue from
the body detection can help us more to effectively find the face region.

3.2 Selective Refinement Network for High Performance Face


Detection [4]
Selective Refinement Network (SRN), which introduces novel two step classification and re-
gression operations.
The SRN consists of two modules:

1. Selective Two-step Classification (STC) module : aims to filter out most simple negative
anchors from low level detection layers to reduce the search space,it has two classes(face
or background).

2. Selective Two-step Regression STR) module: to adjust the locations and sizes of anchors
from high level detection layers to provide better initialization for the subsequent regres-
sor.

Authors design Receptive Field Enhancement (RFE) module it helps to better capture faces
in some extreme poses.The RFE module is responsible for providing the necessary information
to predict the classification and location of objects.
The face detector used in this work in based on RetinaNet. And it is trained on the WIDER
FACE dataset.

Why we choose it? The SRN method worked on blurred ,small ,different position of
faces,and it gives high results .So it may be useful in our case because we are going to train
faces in these conditions.

3.2.1 Main contributions to the face detection studies


• They present an STC module to filter out most simple negative samples from low level
layers to reduce the classification search space.

• They design an STR module to coarsely adjust the locations and sizes of anchors from
high level layers to provide better initialization for the subsequent regressor.

20
CHAPTER 3. RELATED WORK

• They introduce an RFE module to provide more diverse receptive fields for detecting
extreme-pose faces.

• They achieve state-of-the-art results on AFW, PASCAL face, FDDB, and WIDER FACE
datasets.

• They works with anchors and small anchors to detect small faces.

3.2.2 Experiments and Results


• Experimental results of applying two-step classification to each pyramid level indicating
that applying it improve the performance, especially on tiny faces.

• After using the STC module, the AP scores of the detector are improved from 95.1%,
93.9% and 88% to 95,3%, 94.4% and 89.4% on the Easy, Medium and Hard subsets,
respectively. In order to verify whether the improvements benefit from reducing the false
positives. It is to count the number of false positives under different recall rates. The
STC effectively reduces the false positives across different recall rates, demonstrating the
effectiveness of the STC module.

• The STR module produces much better results than the base line, with 0.8%, 0.9% and
0.8% AP improvements on the Easy, Medium, and Hard subset.STR also can produces
more accurate localization and produces consistently accurate detection results than the
baseline method.

• When we couple between STR and STC the performance is further improved to 96.1%,
95.0% and 90.1% on the Easy, Medium and Hard subsets, respectively.

• The RFE is used to diversify the receptive fields of detection layers in order to capture
faces with extreme poses. RFE consistently improves the AP scores in different sub-
sets,i.e., 0.3%, 0.3%, and 0.1% APs on the Easy, Medium,and Hard categories. These
improvements can be mainly attributed to the diverse receptive fields, which is useful to
capture various pose faces for better detection accuracy.

3.3 Feature Agglomeration Networks for Single Stage Face


Detection [27]
The key idea of this work is to exploit inherent multi-scale features of a single convolutional
neural network by aggregating higher-level semantic feature maps of different scales as con-
textual cues to augment lower level ,feature maps via a hierarchical agglomeration manner
at marginal extra computation cost .with using the hierarchical Loss to effectively train the
FANet model.

21
CHAPTER 3. RELATED WORK

They evaluate the proposed FANet detector on several public face detection benchmarks,
including PASCAL face, FDDB, and WIDER FACE datasets and achieved state-of-the-art
results. Their detector can run in real-time for VGA-resolution images on GPU.

Why we choose it? The reason is to see the effect of using contextual information in
detecting faces especially when the face is small and in different scales. So it may help our
study.

3.3.1 The contribution of this methods


• They introduce an Agglomeration Connection module to enhance the feature represen-
tation power in high resolution shallow layers.

• They propose a simple yet effective framework of Feature Agglomeration Networks


(FANet) for single stage face detection, which creates a new hierarchical effective fea-
ture pyramid with rich semantics at all scales.

• An effective Hierarchical Loss based training scheme is presented to train the proposed
FANet model in an end-to-end manner, which guides a more stable and better training
for discriminative features.

• Comprehensive experiments are carried out on several public Face Detection benchmarks
to demonstrate the superiority of the proposed FANet framework, in which promising
results show that the FANet detector not only achieves the state-of-the-art performances
but also runs efficiently with real-time speed on GPU.

• In this work authors propose a new Agglomerate Connection module which can aggregate
multi-scale features more effectively than the skip connection module. Besides, they also
introduce a novel Hierarchical Loss on the proposed FANet framework which enables to
train this powerful detector effectively and robustly in an end-to-end approach.

3.3.2 The final FANet model and its results


• Trained with 3-level Hierarchical Loss

• WIDER FACE is a very challenging face benchmark and the results strongly prove the
effectiveness of FANet in handling high scale variances, especially for small faces.

• Final FANet improved +3.9% over vanilla S3FD while still reaching the real-time speed.

• FANet introduces two key novel components: the “Agglomeration Connection” module
for context-aware feature enhancing and multi-scale features agglomeration with a hi-
erarchical structure, which effectively handles scale variance in face detection; and the
Hierarchical Loss to guide a more stable and better training in an end-to-end manner.

22
CHAPTER 3. RELATED WORK

• On WIDER FACE dataset . FANet model is robust to blur, occlusion, pose, expression,
makeup, illumination, etc. and it is also able to handle faces with a wide range of face
scales, even with extremely small faces.

• On FDDB dataset. FANet model is robust to occlusion and scale variance.

3.4 FaceBoxes: A CPU Real-time Face Detector with High


Accuracy [31]
The FaceBoxes is a challenging work that used on CPU to achieve real time face detector and to
accomplish the important two requirements: 1) Real-time speed; 2) Maintain high performance,
when there are a large search space of possible face positions and face sizes and non-face and
face classification problems.

Why we choose it? It focuses on detecting small faces and trade-off between accuracy and
efficiency, and beside that the idea of using CPU in real time that gives remarkable results is
a really challenging work nowadays.

3.4.1 The contribution of this methods


• Developing a novel face detector (DCFPN) with high performance as well as CPU real-
time speed. By using Rapidly Digested Convolutional Layers (RDCL) which quickly
reduce spatial size by 16 times with narrow but large convolution kernels.

• Designing a lightweight-but-powerful network with the consideration of efficiency and


accuracy. By using Densely Connected Convolutional Layers (DCCL) which enrich the
receptive field to learn visual patterns for different scales of faces and combining coarse-
to-fine information.

• Proposing a fair L1 loss and using dense anchor strategy to handle small faces well that
uniformly tiles several anchors around the center of one receptive field instead of only
tiling one.

• Achieving state-of-the-art performance on common benchmark datasets at the speed of


20 FPS on CPU and 125 FPS on GPU for VGA images.

• Using the challenging WIDER-FACE dataset in training and PASCAL, AFW, FDDB
datasets in testing.

• VGA-resolution images to detect faces ≥ 40 pixels.

23
CHAPTER 3. RELATED WORK

3.4.2 Results
• Fair L1 loss is promising: +0.7% owns to locating small faces well.

• Dense anchor strategy is effective: +0.8% shows the importance of this strategy.

• Designed architecture is crucial: +0.5% demonstrates the effectiveness of enriching the


receptive fields and combining coarse-to-fine information across different layers.

3.5 Comparison between methods


Table 3.1 shows the performed comparative study between existing methods.

3.6 Conclusion
After seeing results produced by the mentioned methods in this study; we see that SRN and
Faceboxes are the best methods in terms of AP values, then FANet is the best method in terms
of good features (robust to blur, occlusion, small faces etc.) and its speed on GPU. Concerning
BFD method, we see that it’s implementation idea is brilliant but we think that extracting a
lot of data such as body joints takes more time and still has a lot of false alarms.
So, we can say that the best methods for our study are SRN and Faceboxes because of their
high average precision (AP) and the production of more accurate locations with focusing on
tiny faces and extreme pose faces, and beside that, they are from the newest way to solve face
detection problems. To know more details about those methods we will discuss them in the
next chapter.

24
CHAPTER 3. RELATED WORK

BFD SRN FANet FaceBoxes


UCCS:dataset with Wider Face dataset: WIDER FACE WIDER FACE
images for surveil- different images with
lance scenarios. different positions.
Real times Real time Real time Real time
Detecting joints of Classification: STC Apply inherent Shrink images faster
body to easy find module to filter out multi-scale features to reduce outputs
face’s joints. most simple negative of a single con- channels using RDCl
samples to reduce volutional neural design to reach real
the search space network. time speed on CPU
device.
Threshold to select Regression: STR Hierarchical effective DCCL enriches the
the box boundary module to coarsely feature pyramid with receptive field to
to decrease the false adjust the loca- rich semantics at all learn visual patterns
alarm rate. tions and sizes of scales. for different scales
anchors to provide of faces, combin-
better initialization ing coarse-to-fine
for the subsequent information to im-
regressor. prove the recall rate
and precision of
detection.
Apply skin detector RFE: helps to better Effective Hierarchi- Apply the new an-
to reduce the false capture faces in some cal Loss guides a chor densification
alarms too. extreme poses. more stable and strategy, to improve
better training the tiling density
for discriminative of the small anchor
features. and guarantee the
balance between
different scales of
anchors.
Validation accuracy : Improve the perfor- Robust to blur, oc- 96% on FDDB
94.9% . mance, especially on clusion, pose, expres- dataset, 98.91%
tiny faces. sion, makeup, illumi- on AFW dataset,
The AP in validation nation, etc. and it 96.30% on PASCAL
set: is also able to han- dataset
94.4% (Easy) dle faces with a wide
95.3% (Medium) range of face scales,
90.2% (Hard) even with extremely
small faces. The AP
in validation set:
95.6% (Easy)
94.7%(Medium)
89.5%(Hard).
VGG-19 +CNN ResNet-50 VGG-16 CNN +Inception
2.37 seconds per im- - 35.6 FPS 20 FPS on CPU and
age 125 FPS on GPU.
Table 3.1: Comparative table between methods
25
4
Chapter
SRN and faceboxes

”We can only see a short distance


ahead, but we can see plenty
there that needs to be done.”

Alan Turing

I
n this chapter, we will discuss the two methods that we have selected for our problematic:
The Selective Refinement Network (SRN) and Faceboxes: a cpu real-time and accurate
unconstrained face detector. They fit the conditions and fulfil the requests by their im-
pressive results in the average precision and their experiments taken in many conditions.

4.1 Introduction
Face detection in surveillance scenarios requires very specific methods that can be adapted to
the conditions and problems posed by the environment and the acquisition conditions, such as
occlusion (ex:wearing medical bib, hat ...); different scale variation from big to small, because
as we know the surveillance camera takes captures from a specific distance and angle that
gives plenty face scale; different illumination conditions (daytime, night); various facial poses,
it is clearly that people don’t look to the camera directly when they walk or buy from shops.
So, it is very important to train the machine to know all the sides of face. Blurry faces comes
when we have a low quality camera or when taking frames where the face moves very fast.
Researchers have done a lot of studies and research that we saw in the previous chapter. They
aim to determine whether there is any face in the input image and returning the coordinates

26
CHAPTER 4. SRN AND FACEBOXES

of the bounding box which is near to the truth and as well as improving accuracy and recall
in real time.

4.2 Selective Refinement Network for High Performance Face


Detection (SRN)
4.2.1 Why SRN?
4.2.1.1 The Benchmark WiderFace

The WIDERFACE dataset is 10 times larger than existing datasets. It contains rich annota-
tions, including occlusions, poses, event categories, and face bounding boxes [25]. Faces in this
dataset are extremely challenging due to the large variations in scale, pose and occlusion as
well as plenty of tiny faces in various complex scenes as shown in Figure 4.1 [30]. Furthermore,
the WIDERFACE dataset is an effective training source for face detection[25] which has a
high degree of variability in scale, pose, occlusion, expression, appearance and illumination. So,
because it is a challenging dataset and it has high degree it was used by the SRN method.

Figure 4.1: A WIDERFACE dataset for face detection. The annotated face bounding boxes
are denoted in green color.

It performs favourably against the state-of-the-art based on the average precision (AP)
across the three subsets, especially on the Hard subset which contains a large amount of small
faces. Specifically, it produces the best AP score in all subsets of both validation and testing
sets, i.e., 96.4% (Easy), 95.3% (Medium) and 90.2% (Hard) for validation set, and 95.9%

27
CHAPTER 4. SRN AND FACEBOXES

(Easy), 94.9% (Medium) and 89.7% (Hard) for testing set, surpassing all approaches, which
demonstrates the superiority of this detector. [4]
So, WIDERFACE dataset demonstrates that SRN achieves the state-of-the-art detection
performance. [4]

4.2.1.2 High performance

It is a high performance face detector based on the deep convolutional neural networks (CNNs)
because there exists many tiny faces. So, as result, it performs better detection and increases
the recall.

4.2.1.3 Using the Residual Networks 50 (ResNet50)

ResNet50 has too many advantages which are:

• ResNet makes it possible to train up to hundreds or even thousands of layers and still
achieves compelling performance. [9, 12, 20, 32]

• ResNet solves the vanishing gradient problem by using identity shortcut connection or
skip connections that skip one or more layers which means that the performance won’t
degrade even if the layers get more deeper so it performs better training. [9, 12, 20, 32]

4.2.1.4 Using the Feature Pyramid Network (FPN)

FPN is not an object detector by itself. It is a feature detector that works with object detectors.
Face detection at vastly different scales is a fundamental challenge in computer vision. Feature
pyramids built upon image pyramids (for short we call these featurized image pyramids) form
the basis of a standard solution. It achieves significant improvements are shown over several
strong baselines [13], example: it has been used in faster-RCNN and placing it with Region
proposal Networks (RPN) [13, 19].
FPN has too many advantages which are: [13]

• It is for object detection at different scales.

• It can run at 6 FPS on GPU.

• FPN improves AP.

• It provides a practical solution for research and applications of feature pyramids, without
the need of computing image pyramids.

• FPN has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for
ResNet-50.

28
CHAPTER 4. SRN AND FACEBOXES

• Despite the effectiveness of ResNet and Faster R-CNN, FPN shows significant improve-
ments over several strong baselines and competition winners. [13, 19]

4.2.1.5 Using Two stage methods [28]

In the current state-of-the-art, two-stage methods, e.g. Faster R-CNN, R-FCN, and FPN, have
three advantages over the one-stage methods:

1. Using two-stage structure with sampling heuristics to handle class imbalance.

2. Using two-step cascade to regress the object box parameters.

3. Using two-stage features to describe the objects.

4.2.2 How does Selective Refinement Network (SRN) works?


The SRN is inspired by the multi-step classification and regression in RefineDet [28] and the
focal loss in RetinaNet, so the state-of-the-art face detector has been developed.
The two-stage approach consists of two parts, where the first one (e.g., Selective Search,
EdgeBoxes, DeepMask , RPN ) generates a sparse set of candidate object proposals, and the
second one determines the accurate object regions and the corresponding class labels using
convolutional networks. [28]
This study consists of the following main contributions which are [4] :

• The Selective Two-step Classification (STC) module to filter out most simple negative
samples from low level layers to reduce the classification search space.

• The Selective Two-step Regression (STR) module to coarsely adjust the locations and
sizes of anchors from high level layers to provide better initialization for the subsequent
regressor.

• A Receptive Field Enhancement (RFE) module to provide more diverse receptive fields
for detecting extreme-pose faces.

• And achieving a state-of-the-art results on AFW, PASCAL face, FDDB, and WIDER
FACE datasets.

4.2.3 The network structure


4.2.3.1 Network Structure

The overall framework of SRN is shown in figure 4.2. It consists of STC, STR, and RFB.
STC uses the first-step classifier to filter out most simple negative anchors from low level

29
CHAPTER 4. SRN AND FACEBOXES

detection layers to reduce the search space for the second-step classifier. STR applies the first-
step regressor to coarsely adjust the locations and sizes of anchors from high level detection
layers to provide better initialization for the second-step regressor. RFE provides more diverse
receptive fields to better capture extreme-pose faces. We describe each component as follow:

Figure 4.2: Network structure of SRN

4.2.3.2 Backbone

The ResNet-50 [10] has been adopted with 6-level feature pyramid structure as the backbone
network for SRN. The feature maps extracted from those four residual blocks are denoted as
C2, C3, C4, and C5, respectively. C6 and C7 are just extracted by two simple down-sample
3 × 3 convolution layers after C5. The lateral structure between the bottom-up and the top-
down pathways is the same as (Lin et al. 2017a) [13]. P2, P3, P4, and P5 are the feature maps
extracted from lateral connections, corresponding to C2, C3, C4, and C5 that are respectively
of the same spatial sizes, while P6 and P7 are just down-sampled by two 3 × 3 convolution
layers after P5.[4]
ResNet is used for deep feature extraction.
Feature Pyramid Network (FPN) is used on top of ResNet for constructing a rich
multi-scale feature pyramid from one single resolution input image.

4.2.3.3 Dedicated Modules

The STC module selects C2, C3, C4,P2, P3, and P4 to perform two-step classification, while
the STR module selects C5, C6, C7, P5, P6, and P7 to conduct two-step regression. The RFE
module is responsible for enriching the receptive field of features that are used to predict the
classification and location of objects.[4]

30
CHAPTER 4. SRN AND FACEBOXES

4.2.3.4 Anchor Design


p
At each pyramid level, there is two specific scales of anchors (i.e., 2S and 2 2 S , where S
represents the total stride size of each pyramid level) and one aspect ratios (i.e., 1.25). In total,
there are A = 2 anchors per level and they cover the scale range (8 to 362) pixels across levels
with respect to the network’s input image.

4.2.3.5 Loss Function

A hybrid loss is append at the end of the deep architecture, which leverage the merits of the
focal loss and the smooth L 1 loss to drive the model to focus on more hard training examples
and learn better regression results.

4.2.4 Selective Two-Step Classification (STC)


• The classification predicts the probability of object presence at each spatial position of
each anchor.

• It aims to remove negative anchors so as to reduce search space for the classifier. [28]

For one-stage detectors, numerous anchors with extreme positive/negative sample ratio
(e.g., there are about 300k anchors and the positive/negative ratio is approximately 0.006% in
SRN) leads to quite a few false positives. Hence it needs another stage like RPN to filter out
some negative examples. Selective Two-step classification, inherited from RefineDet, effectively
rejects lots of negative anchors and alleviates the class imbalance problem.
Specifically, most of anchors (i.e., 88.9%) are tiled on the first three low level feature maps,
which do not contain adequate context information. So it is necessary to apply STC on these
three low level features. Other three high level feature maps only produce 11.1% anchors with
abundant semantic information, which is not suitable for STC. To sum up, the application of
STC on three low level features brings advanced results, while on three high level ones will
bring ineffective results and more computational cost. STC module suppresses the amount of
negative anchors by a large margin, leading the positive/negative sample ratio about 38 times
increased (i.e., from around 1:15441 to 1:404). The shared classification convolution module
and the same binary Focal Loss are used in the two-step classification, since both of the targets
are distinguishing the faces from the background. [30]
Therefore, the STC module selects C2, C3, C4, P2, P3, and P4 to perform two-step clas-
sification. The STC increases the positive/negative sample ratio by approximately 38 times,
from around 1:15441 to 1:404. In addition, we use the focal loss in both two steps to make
full use of samples. Unlike RefineDet [28], the SRN shares the same classification module in
the two steps, since they have the same task to distinguish the face from the background. The
experimental results of applying the two-step classification on each pyramid level are shown in

31
CHAPTER 4. SRN AND FACEBOXES

Table 4.1. Consistent with our analysis, the two-step classification on the three lower pyramid
levels helps to improve performance, while on the three higher pyramid levels is ineffective.

STC B (baseline detector) P2 P3 P4 P5 P6 P7


Easy 95.1 95.2 95.2 95.2 95.0 95.1 95.0
Medium 93.9 94.2 94.3 94.1 93.9 93.7 93.9
Hard 88.0 88.9 88.7 88.5 87.8 88.0 87.7
Table 4.1: AP performance of the two-steps classification applied to each pyramid level. [4]

The loss function for STC consists of two parts, i.e., the loss in the first step and the second
step. For the first step, calculating the focal loss for those samples selected to perform two-step
classification. For the second step, just focusing on those samples that remain after the first
step filtering. With these definitions, the loss function defined as:

1 ∑ 1 ∑
LSTC ({pi }, {qi }) = LFL (pi , l∗i ) + L FL ( p i , l ∗i ) (4.1)
Ns1 i∈Ω Ns2 i∈Φ
where i is the index of anchor in a mini-batch, p i and q i are the predicted confidence of
the anchor i being a face in the two steps, l ∗i is the ground truth class label of anchor i, Ns1
and Ns2 are the numbers of positive anchors in the first and second steps, Ω represents the
collection of samples selected for two-step classification, and Φ represents the sample set that
remains after the first step filtering. The binary classification loss L FL is the sigmoid focal loss
over two classes (face vs. background).

4.2.5 Selective Two-Step Regression(STR)


• Improves the accuracy of bounding box locations, especially in some challenging scenes(small
faces).

After the STC filter and reduce search space, STR now apply regresion on the three higher
pyramids levels , the reason behind not choosing the lower pyramids is because: 1) the three
lower pyramid levels are associated with plenty of small anchors to detect small faces. These
small faces are characterized by very coarse feature representations, so it is difficult for these
small anchors to perform two-step regression; 2) in the training phase, if we let the network
pay too much attention to the difficult regression task on the low pyramid levels, it will cause
the loss to bias towards regression problem and hinder the essential classification problem.
Meanwhile,The motivation behind this to make the framework more efficient ,by the STR
utilize the detailed features of large faces on the three higher pyramid levels to regress more
accurate locations of bounding boxes and letting the three lower pyramid levels pay more
attention to the classification task.

32
CHAPTER 4. SRN AND FACEBOXES

The loss function of STR also consists of two parts, which is shown as below:
∑ ∑
LSTR ({xi }, {ti }) = [l∗i = 1]Lr (xi , g∗i )+ [ l ∗i = 1]L r ( t i , g∗i ) (4.2)
i∈Ψ i ∈Φ

Where g∗i is the ground truth location and size of anchor i, x i is the refined coordinates of
the anchor i in the first step, t i is the coordinates of the bounding box in the second step to
locate the face’s bonding box in precise way. We can see the effectivness of STR in the table
4.2

STR B P2 P3 P4 P5 P6 P7
Easy 95.1 94.8 94.3 94.8 95.4 95.7 95.6
Medium 93.9 93.4 93.7 93.9 94.2 94.4 94.6
Hard 88.0 87.5 87.7 87.0 88.2 88.2 88.4
Table 4.2: AP performance of the two-step regression applied to each pyramid level. [4]

4.2.6 Receptive Field Enhancement(RFE)


• Solves the problem of Mismatch between receptive fields and aspect ratio of faces affect
the detection performance.

• Propose RFE to diversify receptive fields before predicting classes and locations.

• RFE replaces the middle two convolution layers in the class and box subnet of RetinaNet.

Current networks usually possess square receptive fields, which affect the detection of objects
with different aspect ratios. To address this issue, SRN designs a Receptive Field Enhancement
(RFE) to diversify the receptive field of features before predicting classes and locations, which
helps to capture faces well in some extreme poses.

4.2.7 Training, Expirements and Results


• The used training dataset Wider-face.

• The data augmentation strategies adapted for faces are

– Photometric distortions.
– Expanding the images with a random factor in the interval [1, 2] by the zero-padding
operation.
– Cropping two square patches and randomly selecting one for training.
– Flipping the selected patch randomly and resizing it to 1024x1024.

33
CHAPTER 4. SRN AND FACEBOXES

Anchor Matching.

• Using Intersection Over Union (IoU) to devide the samples into negatives and positives
anchors .

• Anchors assigned to ground truth on threshold of θ p and as a background if their IoU is


in [0, θn ).

Optimization.

• Loss function of SRN L = L STC + L STR

• Initializing parameters in the newly added convolution layers by ”xavier” way.

• Fine-tune the SRN model using SGD with 0.9 momentum, 0.0001 weight decay.

• Using batch of size 32.

• Setting the learning rate to 10−2 for the first 100 epochs, and decay it to 10−3 and 10−4
for another 20 and 10 epochs, respectively.

• Implementing SRN using Pytorch library.

Inference.

• STC first filters the regularly tiled anchors on the selected pyramid levels with the neg-
ative confidence scores larger than the threshold θ = 0.99

• Then STR adjusts the locations and sizes of selected anchors.

• The second step takes over these refined anchors, and outputs top 2000 high confident
detections.

• Finally, applying the non-maximum suppression (NMS) with jaccard overlap of 0.5 to
generate the top 750 high confident detections.

You can see the results of evaluation on Benchmark on the table 4.3.

4.2.8 Conclusion
The SRN method is one of the powerful works. With its structure of modules and strategies
could achieve face detection task with high performance in the challenging benchmark. It has
two steps: first is STC that aims to filter out negative samples and improving the precision in
high recall rates, second is STR making the location of bounding box more accurate. Moreover,
the RFE is introduced to provide diverse receptive fields to better capture faces in some extreme
poses. Extensive experiments on the AFW, PASCAL face, FDDB and WIDER FACE datasets
demonstrate that SRN achieves the state-of-the-art detection performance.

34
CHAPTER 4. SRN AND FACEBOXES

Dataset Criterion Value


AFW Average Precision (AP) 99.87
PASCAL face Average Precision 99.09
(AP)
FDDB True Positive Rate, 98.8
False Positive=1000
Validation 96.4(Easy) 95.3(Medium) 90.2(Hard)
WIDER FACE Average Precision (AP)
Test 95.9(Easy) 94.9(Medium) 89.7(Hard)
Table 4.3: Evaluation on Benchmark

4.3 Faceboxes: a cpu real-time and accurate unconstrained


face detector
4.3.1 Why Faceboxes?
4.3.1.1 The Benchmark WiderFace

See section 4.2.1.1, and beside that, they further improve the state-of-the-art performance on
the AFW, PASCAL face, and FDDB datasets.

4.3.1.2 Superior performance

• Achieve real-time speed on the CPU as well as GPU, regarding to this challenging case
of study, we can confirm the power of this method in terms of computing speed. It is
about 0.008s to process an image on GPU.

• Trade-off between accuracy and efficiency by shrinking the input image and focus on
detecting small faces.

• The speed of FaceBoxes is invariant to the number of faces on the image.

4.3.1.3 Applying anchor new strategy

To solve the problem of low recall rate of small faces that was because of the small anchors
which make them too sparse comparing to large anchors. They propose to apply strategy which
is dense anchor to solve this tiling density imbalance problem. It uniformly tiles several anchors
around the center of one receptive field instead of tiling only one.

4.3.1.4 Powerful network structure

• The Rapidly Digested Convolutional Layers (RDCL): is designed to enable FaceBoxes to


achieve real-time speed on the CPU. By quickly reducing spatial size via 16 times with
narrow and large convolution kernels .

35
CHAPTER 4. SRN AND FACEBOXES

• The Multiple Scale Convolutional Layers (MSCL): aims at enriching the receptive fields
and discretizing anchors over different layers to handle faces of various scales, combining
coarse-to-fine information to improve the recall rate and precision of detection.

4.3.1.5 Solve problems of cascaded CNN

Inspired by the RPN in Faster R-CNN [19] and the multi-scale mechanism in SSD [14], they
develop a state-of-the-art face detector with real-time speed on the CPU to avoid those prob-
lems:

• Their speed is negatively related to the number of faces on the image. The speed would
dramatically degrade as the number of faces increases.

• The cascade based detectors optimize each component separately, making the training
process extremely complicated and the final model sub-optimal.

• For the VGA-resolution images (high quality), their run time efficiency on the CPU is
about 14 FPS, which is not fast enough to reach the real-time speed.

4.3.1.6 Using in the architecture structure the Inception Network

• It used a lot of tricks to push performance; both in terms of speed and accuracy.

• With the use of inception network you will gain speed and less operations computations
cost which give benefit of using less memory storage.

4.3.2 How does Faceboxes works?


Faceboxes was inspired by the RPN in Faster R-CNN [19] and the multi-scale mechanism
in SSD [14], to develop the state-of-the-art face detector with real-time speed on the CPU.
Specifically, it is a novel face detector, which only contains a single fully convolutional neural
network and can be trained end-to-end. The proposed method has a lightweight yet powerful
network structure (as shown in Figure 4.3 ) that consists of the Rapidly Digested Convolutional
Layers (RDCL) which is designed to achieve real-time speed on CPU device and the Multiple
Scale Convolutional Layers (MSCL) aims at enriching the receptive fields and discretizing
anchors over different layers to handle various scales of faces. Besides, they add new strategy
which is anchor densification strategy to make different types of anchors have the same density
on the input image, which significantly improves the recall rate of small faces. Faceboxes
demonstrate the state of the art detection for high performance in several benchmark datasets
, including AFW, PASCAL, FDDB.
We can summarize that the main contributions of this work can be summarized as four-fold:

• RDCL enables face detection to achieve real-time speed on the CPU.

36
CHAPTER 4. SRN AND FACEBOXES

• MSCL handles various scales of face via enriching receptive fields and discretizing anchors
over layers.

• Proposing a fair L1 loss ans using a new anchor densification strategy to improve the
recall rate of small faces.

• Achieve the state-of-the-art performance on the AFW, PASCAL face and FDDB datasets.

Figure 4.3: Architecture of the FaceBoxes and the detailed information table about our anchor
designs.

4.3.3 Rapidly Digested Convolutional Layers


As face detection methods which are based on CNN are known that they are limited by the
heavy cost of time when the size of input, kernel and output are large.,especially on the CPU
devices.To enable the Faceboxes reach real-time speed RDCL was designed to fast shrink the
input spatial size by suitable kernel size with reducing the number of output channels, as
follows:

Shrinking the spatial size of input: Setting series of large strides sizes for the convolution
and pooling layers to rapidly shrink the spatial size of the input. As illustrated in Figure 4.3,
the stride size of Conv1, Pool1, Conv2 and Pool2 are 4, 2, 2 and 2, respectively. The total
stride size of RDCL is 32, which means the input spatial size is reduced by 32 times quickly.

Choosing suitable kernel size: To speed up the kernel size in the first few layers in the
network should be small, while it is also supposed to be large enough to alleviate the information
loss brought by the spatial size reducing. As shown in Figure 4.3, to keep the effectiveness and
the efficiency as well, and for that they choose 7×7, 5×5 and 3×3 kernel size for Conv1, Conv2
and all Pool layers, respectively.

37
CHAPTER 4. SRN AND FACEBOXES

Reducing the number of output channels: We utilize the C.ReLU activation function
(illustrated in Figure 4.4(a)) to reduce the number of output channels. C.ReLU [21] is motivated
from the observation in CNN that the filters in the lower layers form pairs (i.e.,filters with
opposite phase). From this observation, C.ReLU can double the number of output channels
by simply concatenating negated outputs before applying ReLU. Using C.ReLU significantly
increases speed with negligible decline in accuracy.

Figure 4.4: (a) The C.ReLU modules where Negation simply multiplies −1 to the output of
Convolution. (b) The Inception modules.

4.3.4 Multiple Scale Convolutional Layers


The MSCL comes to solve those two problems; because that faceboxes based on RPN first the
anchors in the RPN are only associated with the last convolutional layer whose feature and
resolution are too weak to handle faces of various sizes, second an anchor associated layer is
responsible for detecting faces within a corresponding range of scales, but it only has a single
receptive field that can not match different scales of faces. So MSCL is designed along the
following two dimensions:

Multi-scale design along the dimension of network depth: As shown in Figure 4.3,
the designed MSCL consists of several layers. These layers decrease in size progressively and
form the multi-scale feature maps. Similar to [14], the default anchors are associated with
multi-scale feature maps (i.e., Inception3, Conv3_2 and Conv4_2). These layers, as a multi-

38
CHAPTER 4. SRN AND FACEBOXES

scale design along the dimension of network depth, discretize anchors over multiple layers with
different resolutions to naturally handle faces of various sizes.

Multi-scale design along the dimension of network width: To learn visual patterns for
different scales of faces, output features of the anchor-associated layers should correspond to
various sizes of receptive fields, which can be easily fulfilled via Inception modules. The Incep-
tion module consists of multiple convolution branches with different kernels. These branches,
as a multi-scale design along the dimension of network width, is able to enrich the receptive
fields.
As shown in Figure 4.3, the first three layers in MSCL are based on the Inception module. Fig-
ure 4.4(b) illustrates the Inception implementation, which is a cost-effective module to capture
different scales of faces.

4.3.5 Anchor densification strategy


Choosing aspect ratio of 1:1 as illustrated in Figure 4.3 for anchors relatively to the face box
is approximately a square.
The scale of anchor for the Inception layer is 32, 64 and 128 pixels, for the Conv3_2 layer and
Conv4_2 layer are 256 and 512 pixels, respectively.
The tiling interval of anchor on the image is equal to the stride size of the corresponding
anchor-associated layer.
For example, the stride size of Conv3_2 is 64 pixels and its anchor is 256 × 256, indicating
that there is a 256 × 256 anchor for every 64 pixels on the input image. The tiling density of
anchor can be defined as (i.e., A densit y ) as follows:

A densit y = A scal e / A interval (4.3)

Here, A scal e is the scale of anchor and A interval is the tiling interval of anchor. The tiling
intervals for our default anchors are 32, 32, 32, 64 and 128, respectively. According to Equ.4.3,
the corresponding densities are 1, 2, 4, 4 and 4, where it is obviously that there is a tiling
density imbalance problem between anchors of different scales. Comparing with large anchors
(i.e., 128×128, 256×256 and 512×512), small anchors (i.e., 32 × 32 and 64 × 64) are too sparse,
which results in low recall rate of small faces. To eliminate this imbalance, we propose a new
anchor densification strategy. Specifically, to densify one type of anchors n times, we uniformly
tile A number = n2 anchors around the center of one receptive field instead of only tiling one at
the center of this receptive field to predict. Some examples are shown in Figure 4.5. To improve
the tiling density of the small anchor, this strategy is used to densify the 32×32 anchor 4 times
and the 64×64 anchor 2 times, which guarantees that different scales of anchor have the same
density (i.e.,4) on the image, so that various scales of faces can match almost the same number
of anchors.

39
CHAPTER 4. SRN AND FACEBOXES

Figure 4.5: Examples of anchor densification. For clarity, we only densify anchors at one recep-
tive field center (i.e., the central black cell), and only color the diagonal anchors.

4.3.6 Training
Training dataset: The model is trained on 12880 images of the WIDER FACE[25] training
subset.

Data augmentation: Each training image has been treated by the following data augmen-
tation:

• Color distortion: Applying some photo-metric distortions

• Random cropping: randomly crop five square patches from the original image: one is the
biggest square patch, and the size of the others range between [0.3, 1] of the short size
of the original image. Then arbitrarily selecting one patch for subsequent operations.

• Scale transformation: After random cropping, the selected square patch is resized to 1024
× 1024.

• Horizontal flipping: The resized image is horizontally flipped with probability of 0.5.

40
CHAPTER 4. SRN AND FACEBOXES

• Face-box filter: keeping the overlapped part of the face box if its center is in the above
processed image, then filter out these face boxes whose height or width is less than 20
pixels.

Matching strategy: During training, to determine which anchors correspond to a face


bounding box. So we need to do the following:

• First match each face to the anchor with the best jaccard overlap.

• then match anchors to any face with jaccard overlap higher than a threshold (i.e., 0.35).

Loss function: Loss function in faceboxes is the same as RPN in Faster R-CNN [19]. We
adopt a 2-class softmax loss for classification and the smooth L1 loss for regression.
Fair L1 loss the regression target of Fair L1 loss is as follows:

t x = x − xa , t y = y − ya , t w = w, t h = h; (4.4)

t∗x = x∗ − xa , t∗y = y∗ − ya , t/w = w∗ , t∗h = h∗ ; (4.5)

where x, y, w, h denote center coordinates and width and height, x, xa , x∗ are for predicted
box, anchor box, and GT box (likewise for y, w, h). The scale normalization is implemented
to have scale-invariance loss value as follows:

L re g ( t, t∗ ) = f air L 1 ( t j − t∗j ) (4.6)
j ∈( x,y,w,h)

where {
| z j |/w∗ , i f j ∈ ( x, w)
f air L 1 ( z j ) = (4.7)
| z j |/ h∗ , otherwise

It equally treats small and big face by directly regressing box’s relative center coordinate
and width and height.

Hard negative mining: In this step is about sorting the samples and pick the top ones for
faster optimization, because after matching the anchors, most of them found to be negative,
which as a result getting a significant imbalance between the positives and negatives examples,
so teh ratio between them is at most 3:1.

Other implementation details: All the parameters are randomly initialized with the
“xavier” method. They finetune the resulting model using SGD with 0.9 momentum, 0.0005
weight decay and batch size 32 (variant depending on the device capability). The maximum
number of iterations is 120k and we use 10−3 learning rate for the first 80k iterations, then

41
CHAPTER 4. SRN AND FACEBOXES

continue training for 20k iterations with 10−4 and 10−5 , respectively. Our method is imple-
mented in the Caffe library and they implemented as well in Pytorch library.

4.3.7 Experiments and results


4.3.7.1 Runtime efficiency

The methods based on CNN have always been with low resolution in runtime efficiency, but
they were accelerated with GPU. And now we can see in the table bellow that CPU took up
the challenge and achieve real time efficiency by those following steps:

• During inference, the method outputs a large number of boxes (e.g., 8, 525 boxes for a
VGA-resolution image).

• First filter out most boxes by a confidence threshold of 0.05 and keep the top 400 boxes
before applying NMS.

• Then performing NMS with jaccard overlap of 0.3 and keep the top 200 boxes.

• Then measuring the speed using Titan X (Pascal) and cuDNN v5.1 with Intel Xeon
E5-2660v3@2.60GHz.

As listed in Table 4.4, comparing with recent CNN-based methods, the FaceBoxes can run at
20 FPS on the CPU with state-of-the-art accuracy. Besides, it can run at 125 FPS using a
single GPU and has only 4.1 MB in size.

Approach CPU-model mAP(%) FPS


ACF i7-3770@3.40 85.2 20
CasCNN E5-2620@2.00 85.7 14
FaceCraft N/A 90.8 10
STN i7-4770K@3.50 91.5 10
MTCNN N/A@2.60 94.4 16
Faceboxes E5-2660v3@2.60 96.0 20
Table 4.4: Overall CPU inference time and mAP compared on different methods. The FPS is
for VGA-resolution images on CPU and the mAP means the true positive rate at 1000 false
positives on FDDB. Notably, for STN, its mAP is the true positive rate at 179 false positives
and with ROI convolution, its FPS can be accelerated to 30 with 0.6% recall rate drop. [31]

4.3.7.2 Model analysis

Applying extensive ablation experiments on faceboxes model with AFW , PASCAL, and FDDB.
With FDDB was convincing result because it is the most difficult .

42
CHAPTER 4. SRN AND FACEBOXES

Ablative Setting: To better understand FaceBoxes, the experiment pf ablatation each com-
ponent one after another to examine how each proposed component affects the final perfor-
mance and how each component not dispensable. These is how they do this test:

• First, the ablation of the anchor densification strategy.

• Then replacing MSCL with three convolutional layers, which all have 3×3 kernel size and
whose output number is the same as the first three Inception modules of MSCL.

• Finally, replace C.ReLU with ReLU in RDCL.

The ablative results are listed in Table 4.5

Contribution FaceBoxes
RDCL ×
MSCL × ×
Strategy × × ×
Accuracy(mAP) 96.0 94.9 93.9 94.0
Speed(ms) 50.98 48.27 48.23 67.48
Table 4.5: Ablative results of the FaceBoxes on FDDB dataset. Accuracy (mAP) means the
true positive rate at 1000 false positives. Speed (ms) is for the VGA-resolution images on the
CPU. [31]

Discuss the ablation results: The ablation experiment shows how each module is impor-
tant in faceboxes method.

Anchor densification strategy is crucial: The anchor densification strategy is used to


increase the density of small anchors (i.e., 32 × 32 and 64 × 64) in order to improve the
recall rate of small faces and to find the imbalance between all the anchors’ size. From the
results listed in Table 4.5, we can see that the mAP on FDDB is reduced from 96.0% to 94.9%
after ablating the anchor densification strategy. The sharp decline (i.e., 1.1%) demonstrates
the effectiveness of the proposed anchor densification strategy.

MSCL is better: The comparison between the second and third columns in Table 4.5
indicates that MSCL effectively increases the mAP by 1.0%, owning to the diverse receptive
fields and the multi-scale anchor tiling mechanism.

MRDCL: is efficient and accuracy-preserving. The design of RDCL enables FaceBoxes to


achieve real-time speed on the CPU. As reported in Tab.4.5, RDCL leads to a negligible decline
on accuracy but a significant improvement on speed. Specifically, the FDDB mAP decreases
by 0.1% in return for the about 19.3ms speed improvement.

43
CHAPTER 4. SRN AND FACEBOXES

Component DCFPN
Designed architecture ? * * *
Dense anchor strategy? * *
Fair L1 loss ? *
Accuracy(mAP) 99.2 94.5 93.7 93.2
Table 4.6: Result of ablation of each component of the method beside the loss function where
DCFPN=Architecture+Strategy+Loss [31]

The second table of another ablation and experiment shows the results:

• Fair L1 loss is promising: +0.7% owns to locating small faces well.

• Dense anchor strategy is effective: +0.8% shows the importance of this strategy.

• Designed architecture is crucial: +0.5% demonstrates the effectiveness of enriching the


receptive fields and combining coarse-to-fine information across different layers.

4.3.7.3 Evaluation on benchmark

The evaluation of the FaceBoxes on the common face detection benchmark datasets, includ-
ing Annotated Faces in the Wild (AFW), PASCAL Face, and Face Detection Data Set and
Benchmark (FDDB).

AFW dataset: It has 205 images with 473 faces.It achieves with Faceboxes 98.91%. (see
result on Figure 4.6

PASCAL face dataset: It is collected from the test set of PASCAL person layout dataset,
consisting of 1335 faces with large face appearance and pose variations from 851 images. It
achieves with Faceboxes 96.30%.

FDDB dataset: It has 5171 faces in 2845 images taken from news articles on Yahoo websites.
It achieves in discontinuous ROC curves 96% and in continuous ROC curves 82.9% .

4.3.8 Conclusion
Since achieving real time on CPU device was a challenging issue. Faceboxes worked on im-
proving the performance and make high results, beside that CNN based methods have their
disadvantages Faceboxes could do big step by its structure where it uses RDCL to achieve time
performance and MSCL enrich the receptive field to learn different faces scales and apply new
strategy which is anchor densification strategy is proposed to improve the recall rate of small
faces. The method experiments demonstrate the state of art by achieving 20 fps on CPU and
125 fps on GPU, beside that gives high accuracy on AFW, PÄSCAL, FDDB datasets.

44
CHAPTER 4. SRN AND FACEBOXES

Figure 4.6: Face boxes on AFW dataset.

4.4 Conclusion
In this chapter we talked with more details about the two methods SRN and faceboxes. After
seeing the high results and the condition treated and experiments applied, we see that they can
be efficient for our study where we are going to do the training in Faceboxes for our dataset
which is a collection of pictures taken from surveillance scenarios and do the test as well with
Faceboxes with surveillance video, and see what we will get.

45
5
Chapter
Implementation

”Good fortune is what happens


when opportunity meets with
planning.”

Thomas Edison

I
n this chapter we will highlight our contribution in the field of face detection; which
method used in which environment to detect faces in real time on surveillance scenarios
to achieve good performance in terms of accuracy and recall.

5.1 Introduction
Computer vision is a field that needs many requirements to achieve better performances, not
only which network used or which structure or which method but moreover which machine
that is going to process the big data, is it speed enough, is the memory capacity can cover all
the input simultaneously. So, it is a global pack that need to be completed to obtain the target.
By doing the experiments using Faceboxes method in our environment using our datasets in
training and testing, let’s see what we got in the following sections.

5.2 The method used in the implementation


The method that we selected among the SRN and Faceboxes is Faceboxes and there are many
reasons for our choice which are:

46
CHAPTER 5. IMPLEMENTATION

• From the SRN method’s requirements the memory capacity of the GPU which should be
at least 11 Go. Our GPU has only 2 Go.

• When we run the testing code with WIDER-face test dataset, it gives result of AP=0,
and this is confusing to judge the performance of the method. see Figure 5.1.

• The training code of the SRN is not yet released, and the testing code of the method
related only to the WIDER-face which make it difficult to make adaptable changing to
be suitable for our dataset.

• Faceboxes both training and testing codes are available and we can do changes.

• Faceboxes method can be run in both CPU and GPU devices.

Figure 5.1: SRN result in wider-face testing.

5.3 Environment
5.3.1 Operating system
Both methods need Linux system to be executed, so we choose Ubuntu18.04 LPS as an oper-
ating system to have a compatible environment for requirements which are:

• The shell files.

• Cuda and cudnn.

• Pytorch library.

To avoid a lot of errors that appear in the execution it’s better to be careful in choosing the
version of each package to be installed.

47
CHAPTER 5. IMPLEMENTATION

5.3.2 Which programming language


• Python was the common programming language of both methods. Ubuntu 18.04 came
with python3.6 which is the needed version.

• Working on anaconda environment to avoid importing library errors, because it was a


solution for many errors and when installing new libraries it brings with them extra ones.

• Use the Github tool which was the support between us and our supervisor so that he
can correct our errors and follow our progress.

The original Faceboxes code was implemented on Caffe and re-implemented again on the
Pytorch library. We chose the second because it is more familiar to us.

5.3.3 Used machine


• Memory Ram: 12 Gb.

• Processor: Intel core i7-4500U CPU @ 1.80GHZx4.

• Graphics: GeForce GT 740M/PCIe/SSE2.

5.4 Creating our dataset


We created a dataset of 1277 training images containing 4537 faces, and 425 testing images
containing 1342 faces and 205 frames in video testing with 386 faces.

5.4.1 Starting idea


The idea of creating our dataset came when we were looking for one that has surveillance
scenarios and conditions (specific angle and distance). There where ones but not available or
very old. Beside that, the Faceboxes method work with images that was labeled on VOC XML
format and that is mean even if we bring a ready dataset. We need to annotate once again, so
why not make a full dataset that concerns us.
The idea of a video dataset is a bit difficult to implement and requires a lot of changes and
more study to do this job. We will see more details in the following sections.

5.4.2 Collecting images


This phase is a very important step. To collect the right data for our case we need a lot of
work to have an impeccable dataset.

48
CHAPTER 5. IMPLEMENTATION

5.4.2.1 From where we get the data?

• We upload more than 24 videos for training and 14 videos for testing from Youtube of
security camera and one video from supermarket that we used it for testing.

• 139 images from different datasets (PASCAL, FDDB, AFW) and 126 images from WIDER-
Face dataset. These images are aquired in different conditions (see Figure 5.2) which are
helpfull to feed the network; they were selected accurately.

– Choosing images where the faces are in different scale.


– From specific angle.
– Collecting full faces so the machine learn how the face can be in closed position.
– Usually people don’t see directly to the camera most of time, and the camera cap-
tures the face by its side, so including such samples is a must.

Figure 5.2: Collected images from different datasets.

5.4.2.2 How do we get frames?

As we said before that the videos were uploaded from Youtube. So, how do we extract them
as frames and collect the best ones?

49
CHAPTER 5. IMPLEMENTATION

• We write a simple code using OpenCV library which extract one frame each 2 seconds
from the video :

import cv2
#put t h e v i d e o uploaded
v i d c a p = cv2 . VideoCapture ( ’ v i d e o t e s t / 1 2 . mp4 ’ )

d e f getFrame ( s e c , i ) :
v i d c a p . s e t ( cv2 .CAP\_PROP\_POS_MSEC, s e c ∗ 1 0 0 0 )
hasFrames , image = v i d c a p . r e a d ( )
i f hasFrames :
#s a v e frame a s JPG f i l e
cv2 . i m w r i t e ( ” t ”+s t r ( i ) + ” . j p g ” , image )
r e t u r n hasFrames
i =0
sec = 0
#i t w i l l c a p t u r e i m a g e i n each 2 se c o n d
frameRate =2
s u c c e s s = getFrame ( s e c , i )
while success :
s e c = s e c + frameRate
s e c = round ( s e c , 2 )
i+=1
s u c c e s s = getFrame ( s e c , i )

• After extracting the videos into frames, we select only the ones which contain faces and
we delete the rest. (see figure 5.3 samples from the selected frames).

5.4.3 Frames annotation


This section is the hardest and important step, although we did it manually. It took a lot of
time and concentration. Doing it manually is for one powerful reason that human can see all
faces on the image and label them which produce high accuracy comparing to the machine is
not yet flawless because still there are missing part. So how we did this:
We use an available program named MakeSence.

50
CHAPTER 5. IMPLEMENTATION

Figure 5.3: Samples of frames from our dataset.

MakeSense: is an open-source tool to used annotate images under GPLv3 license. It does not
require any advanced installations, just need a web browser to run it (Open-source, Free, Web
based). The user-interface is simple and easy to use. MakeSense supports multiple annotations:
bounding box, polygon and point annotation. You can export the labels in different formats
including YOLO, VOC XML, VGG JSON and CSV. In our case we need VOC XML which is
asked by Faceboxes. Here is a step-by-step guide to use MakeSense annotation tool:

1. Go to the link http://www.makesense.ai

2. Click the bottom get started box to go to annotation page and you will see where you
can upload images you want to annotate (maximum of 600 images per time).

3. After selecting and uploading images, click “Object Detection” button.

4. Since we do not have any labels loaded, we will create our label for our project which is
face. To add a new label, click the ’+’ sign on the top left corner of the message box, and
enter the label in the “Insert Label” text field, then “Start project”. Then select ’doing
by my own’ then start labelling. (See Figure 5.4)

51
CHAPTER 5. IMPLEMENTATION

5. After annotating all the images, it’s time to export labels. To export, click on ’Export
Labels’ button on the top-right of the page, and selecting XML VOC format.

Figure 5.4: Image annotation using MakeSense tool.

5.5 Run the codes


5.5.0.1 Requirements

Before running the training code we need to prepare the environment.

• We have three folders for data which are:

1. The images folder that contains all extracted images with extension ’.jpg’.
2. The annotations folder that contains an xml file for each image. Make sure the xml
file has the same name as the associated image.
3. The img_list.txt is a file which contains couples of <image name, xml file name>,
they have to be in this format:
<imageName>.jpg <imageName>.xml
Make sure you put them all in this one file.

• Changing the batch-size, to be compatible with your device, in our case we putted 15 to
avoid Cuda out memory.

• Add the data path whith this instruction:

p a r s e r . add_argument ( ’ −− t r a i n i n g _ d a t a s e t ’ , d e f a u l t= ’ . / data / f i n a l b d d ’ ,
h e l p= ’ T r a i n i n g d a t a s e t d i r e c t o r y ’ )

52
CHAPTER 5. IMPLEMENTATION

5.5.0.2 Training

The training in this case took more than 10 hours with 300 Epochs and 86 iterations of each
one (25800 iterations in total). It tooks 1.2 second approximately for each batch (see Figure
5.5 and Figure 5.6) and give as in the end the weights file.

Figure 5.5: Start training.

Figure 5.6: Finishing training.

5.5.0.3 Testing

• We did the test for our dataset that contains 425 images by using CPU and GPU, then
we got as a result a file contains the bounding box coordinates with the confidence of
each one that we are going to use them to calculate the Average Precision (AP) in the
next chapter. Before execution, we need to prepare those two folders:

53
CHAPTER 5. IMPLEMENTATION

1. The images folder that contains all the testing images.


2. The img_list.txt that contains image names; be sure that you put all the images
names.
3. Add dataset name with this instruction:
p a r s e r . add_argument
( ’ −− d a t a s e t ’ , d e f a u l t= ’PASCAL ’ , type=s t r ,
c h o i c e s =[ ’AFW’ , ’PASCAL ’ , ’FDDB ’ , ’ testmybdd ’ ] , h e l p= ’ d a t a s e t ’ )

4. To execute ,
on GPU:
python3 test.py –dataset testmybdd
on CPU:
python3 test.py –dataset testmybdd –cpu

• We have wrote the testing code for video dataset; to do that we selected video from
supermarket and we add instructions of code to extract video’s frames and do the testing
in parallel, on the other hand we extract the frames by using the same code that we used
before 5.4.2.2 then labelling them using MakeSense tool 5; so we can calculate the AP.
To execute the code we add those instructions :
v i d c a p = cv2 . VideoCapture ( ’ s a m p l e s / vidd . mp4 ’ )

d e f getFrame ( s e c ) :
v i d c a p . s e t ( cv2 .CAP_PROP_POS_MSEC, s e c ∗ 1 0 0 0 )
hasFrames , image = v i d c a p . r e a d ( )
r e t u r n hasFrames , image

and
w h i l e True :
cap . s e t ( cv2 .CAP_PROP_POS_MSEC, s e c ∗ 1 0 0 0 )
has_frame , img_raw = cap . r e a d ( cv2 .IMREAD_COLOR)
s e c+=1
i f not has_frame :
p r i n t ( ’ [ i ] ==> Done p r o c e s s i n g ! ! ! ’ )
break
#t e s t i n g b e g i n


54
CHAPTER 5. IMPLEMENTATION

Then use this command:


python testVideo.py –video samples/vidd.avi

• We have write a code to do demonstration using input video or the laptop web camera.
First we ensure if the input is a video or webcam by:

i f args . video :
i f not o s . path . i s f i l e ( a r g s . v i d e o ) :
p r i n t ( ” [ ! ] ==> Input v i d e o f i l e {} doesn ’ t e x i s t ”
. format ( a r g s . v i d e o ) )
sys . exit (1)
cap = cv2 . VideoCapture ( a r g s . v i d e o )
o u t p u t _ f i l e = a r g s . v i d e o [ : − 4 ] . r s p l i t ( ’ / ’ ) [ − 1 ] + ’ _Facebox . a v i ’
else :
# Get data from t h e camera
cap = cv2 . VideoCapture ( a r g s . s r c )
output_file = args . video [ : − 4 ] . r s p l i t ( ’ / ’ )[ −1] +
’ _webcamFaceBox . a v i ’

While true keep testing the frames:

w h i l e True :

has_frame , img = cap . r e a d ( cv2 .IMREAD_COLOR)

# Stop t h e program i f r e a c h e d end o f v i d e o


i f not has_frame :
p r i n t ( ’ [ i ] ==> Done p r o c e s s i n g ! ! ! ’ )
p r i n t ( ’ [ i ] ==> Output f i l e i s s t o r e d a t ’ ,
o s . path . j o i n ( a r g s . output_dir , o u t p u t _ f i l e ) )
cv2 . waitKey ( 1 0 0 0 )
break

And visualizing the frames in the same time:

i f len ( faces )!=0:


for b in faces :
i f b [ 4 ] < args . vis_thres :
continue

55
CHAPTER 5. IMPLEMENTATION

t e x t = ” { : . 4 f } ” . format ( b [ 4 ] )
b = l i s t (map( i n t , b ) )
cv2 . r e c t a n g l e ( img , ( b [ 0 ] , b [ 1 ] ) , ( b [ 2 ] , b [ 3 ] ) , ( 0 , 0 , 2 5 5 ) , 2 )
cx = b [ 0 ]
cy = b [ 1 ] + 12
cv2 . putText ( img , t e x t , ( cx , cy ) ,
cv2 .FONT_HERSHEY_DUPLEX, 0 . 5 , ( 2 5 5 , 2 5 5 , 2 5 5 ) )
#s a v e t h e output v i d e o
v i d e o _ w r i t e r . w r i t e ( img . a s t y p e ( np . u i n t 8 ) )
#show t h e v i d
cv2 . imshow ( ’ r e s ’ , img )
key = cv2 . waitKey ( 1 )
i f key == 27 o r key == ord ( ’ q ’ ) :
p r i n t ( ’ [ i ] ==> I n t e r r u p t e d by u s e r ! ’ )
break

For execution:
run the MyTest.py with video input
python MyTest.py –video samples/vid.mp4 –output-dir outputs/
run the MyTest.py in your own webcam
python MyTest.py –src 0 –output-dir outputs/

Hypotheses. In order to verify the impact of the machine capacity on the results, we im-
plemented the training and testing in a more powerful machine which is the HPC (High
Performance Computing) that is situated in the data center of our university by accessing it
remotely through ssh and running it via this script :

#! / b i n / bash

#SBATCH −J FaceDetection_Job
# Job name
#SBATCH −o FaceDetection_ .% j . out
# Name o f s t d out output f i l e (% j expands t o j o b I d )
#SBATCH −N 1
# T o t a l number o f nodes r e q u e s t e d
##SBATCH −n 1
# Number o f t a s k s p e r node ( d e f a u l t = 1 )
#SBATCH −p gpu

56
CHAPTER 5. IMPLEMENTATION

# T o t a l number o f mpi t a s k s r e q u e s t e d
#SBATCH −− n o d e l i s t=node11
# T o t a l number o f mpi t a s k s r e q u e s t e d
#SBATCH −−gpu 2
# T o t a l number o f mpi t a s k s r e q u e s t e d
#SBATCH − t 0 6 : 0 0 : 0 0
# Run time ( hh :mm: s s ) − 6 ho u r s

# Launch

. / make . sh
python t r a i n . py
# in case of training
python t e s t . py −− d a t a s e t testmybdd
# in case of testing

It only did the training for 2 hours, while our personal machine took over 10 hours. We
changed the batch size also to 32 and using two GPUs. The result of training is shown in figure
5.7.

Figure 5.7: Training on HPC.

5.6 Conclusion
After successfully completing the training and testing tasks using our dataset for face detection
in surveillance images scenarios, we will see now if the weights obtained by the training phase
are optimal enough to give similar AP results gained by first Faceboxes experiments, do they
feed the network to give same accuracy? does changing the batch size reduce the performances
of the speed and affect the results?

57
CHAPTER 5. IMPLEMENTATION

We will answers all this questions in the next chapter. You can check the code via this url :
https://github.com/IhceneDjou/FaceBoxe-surveillanceVideo

58
6
Chapter
Results and discussions

”Good fortune is what happens


when opportunity meets with
planning.”

Thomas Edison

A
fter doing all the work mentioned in previous chapters, it’s time now to see the results
obtained by our implementations using our own dataset, and how do we get them?
Do they satisfy the requirements? Are the conditions cover all the cases to achieve
the target? We will discuss in this chapter all those questions.

6.1 Introduction
Face detection in surveillance scenarios in real time, is a challenging work to achieve perfor-
mances in time speed efficiency and more accurate bonding boxes with high average precision
and less false positives. Our dataset is ready with labels and our model has been defined and
trained. Now does the test phase give desired results?

6.2 How do we calculate the AP


We use some existing code [3] to calculate the Average Precision for both test images and test
video.

• The code calculate the Average Precision (AP), for each of the classes present in the
ground-truth. In our case, there is only one class which is ’face’ by using IoU (Intersection

59
CHAPTER 6. RESULTS AND DISCUSSIONS

over union) where we sort the detection-results by descending confidence. There is ”a


match” when they share the same label and an IoU >= 0.5 (in our case we change it
to 0.35 because it was more suitable for our results). This ”match” is considered a true
positive.

• Then calculating the mAP (mean Average Precision) value.

6.2.1 Running the code


1. Create the ground-truth files.

2. Copy the ground-truth files into the folder input/ground-truth/.

3. Create the detection-results files.

4. Copy the detection-results files into the folder input/detection-results/.

5. Run the code: python main.py.

6. To do animation during the calculation: Insert images into the folder input/images-
optional/.

6.2.2 Create the ground-truth files


• Create a separate ground-truth text file for each image. So, we need to convert all the
ground truth files built on XML VOC format. To convert xml to our format:

– Insert ground-truth xml files into ground-truth.


– Run the python script: python convert_gt_xml.py.

• Use matching names for the files (e.g. image: ”t1.jpg”, ground-truth: ”t1.txt”).

• In these files, each line should be in the following format:


< class_ name >< l e f t >< top >< ri ght >< bottom >

• E.g. ”t1.txt”:
face 2 10 173 238
face 439 157 556 241

60
CHAPTER 6. RESULTS AND DISCUSSIONS

6.2.3 Create the detection-results files


• Create a separate detection-results text file for each image. After getting from testing a
’.txt’ file which contains the bounding boxes detected for each image presented by sorting
them in descending way based on confidence like this:

– frame0 0.051 573.1 682.6 614.0 724.4


– frame0 0.050 8.0 881.6 35.6 918.4
– frame0 0.050 964.6 267.3 981.0 282.8
– frame0 0.050 9.3 369.1 56.8 420.6
– frame1 0.760 1330.1 379.1 1398.1 427.9
– frame1 0.736 1327.2 379.4 1400.0 428.8
– frame1 0.637 1325.0 376.8 1403.0 429.1
– frame1 0.620 1629.9 219.6 1657.6 250.6
– -
– -

• Then we copied each image’s bonding boxes in separate file ’.txt’ and changing the class
name to ’face’ in this case it was the name of the image to be like this:

– face 0.760 1330.1 379.1 1398.1 427.9


– face 0.736 1327.2 379.4 1400.0 428.8
– face 0.637 1325.0 376.8 1403.0 429.1
– face 0.620 1629.9 219.6 1657.6 250.6
– -
– -

• Use matching names for the files (e.g. image: ”f1.jpg”, detection-results: ”f1.txt”).

• In these files, each line should be in the following format:


< class_ name >< con f idence >< l e f t >< top >< ri ght >< bottom >

• E.g. ”f1.txt”:
face 0.471781 0 13 174 244
face 0.414941 274 226 301 265

61
CHAPTER 6. RESULTS AND DISCUSSIONS

6.3 Results
6.3.1 Time
• Testing using CPU: it took 1.283 second per image as shown in Figure 6.1.

Figure 6.1: Testing on CPU

• Testing using GPU: it tooks 0.02 second per image which means 50 FPS as shown in
Figure 6.2.

Figure 6.2: Testing on GPU

• Testing using CPU of the HPC it gives 0.745 second per image as shown in Figure 6.3.

• Testing using GPU of HPC it gives 0.02 second per image which mean 50 frames per
second as shown in 6.4.

62
CHAPTER 6. RESULTS AND DISCUSSIONS

Figure 6.3: Testing on CPU HPC

Figure 6.4: Testing on GPU HPC

63
CHAPTER 6. RESULTS AND DISCUSSIONS

6.3.2 Average precision


6.3.2.1 For test images

• As we see in the curve of Figure 6.5 the average precision obtained is 34.03%.

• Figure 6.6 shows that the program detects 1096 faces from 1342 faces from the ground-
truth (true positive) which means 246 faces in ground-truth not detected. It shows also
that false alarms of 25272 face bounding boxes (false positive).

• The Figure 6.7 shows an example of face detection with high performance of 71%. The
green rectangle is the ground-truth and the blue one is the bounding box detected.

• Figure 6.8 shows face detection with confidence of 42.22%, that is why we change the
threshold from 50% to 35% because the program detects a lot of faces but with low
confidence.

• Figure 6.9 shows detected bounding boxes (true and false ). The red ones are false positive,
the green ones are the bounding boxes that match the ground truth, the blue ones are
the ground truth which where detected by the program and the pink ones are the ground
truth not detected.

Figure 6.5: The Average precision obtained for test images by Faceboxes

6.3.2.2 For video tests

• As we see in the curve Figure 6.10 the average precision obtained is 22.71%.

64
CHAPTER 6. RESULTS AND DISCUSSIONS

Figure 6.6: Bonding boxes detected

Figure 6.7: High confidence for detecting face 71%

Figure 6.8: Confidence of 42%

65
CHAPTER 6. RESULTS AND DISCUSSIONS

Figure 6.9: Result of Bounding boxes. Red: false positives , Green: true positives , Blue: ground
truth detected , Pink: ground-truth not detected.

• Figure 6.11 shows that the program detects 277 faces from 386 faces from the ground-
truth (true positive) which means 109 faces in ground-truth not detected. It shows also
false alarms of 12971 face bounding boxes (false positive).

• Figure 6.12 shows an example of face detection with high performance of 85.04%. The
green rectangle is the ground-truth and the blue one is the bounding box detected.

• Figure 6.17 shows face detection with confidence of 38.03% that is why we change the
threshold from 50% to 35% because the program detect a lot of faces but with less
confidence.

• Figure 6.14 shows detected bounding boxes (true and false ). The red ones are false
positives, the green ones are the bounding boxes that match the ground-truth, the blue
ones are the ground-truth which where detected by the program and the pink ones are
the ground-truth not detected.

6.3.2.3 Tests on HPC

• As we see in the curve of Figure 6.15 the average precision obtained is 64.91%.

66
CHAPTER 6. RESULTS AND DISCUSSIONS

Figure 6.10: The Average precision obtained for video test by Faceboxes

Figure 6.11: Bonding boxes detected (true and false positives)

67
CHAPTER 6. RESULTS AND DISCUSSIONS

Figure 6.12: High confidence for detecting face 85%

Figure 6.13: Confidence of 38%

Figure 6.14: Result of Bounding boxes. Red: false positives, Green: true positives, Blue: ground-
truth detected, Pink: ground-truth not detected.

68
CHAPTER 6. RESULTS AND DISCUSSIONS

• Figure 6.16 shows that the program detects 1212 faces from 1342 faces from the ground-
truth (true positive) which means 130 faces in ground-truth not detected. It shows also
that the false alarms is 13920 face bounding boxes (false positive).

Figure 6.15: The Average precision obtained for test images by Faceboxes in HPC

Figure 6.16: Bonding boxes detected in HPC

6.4 Discussion
After seeing obtained results by our model based on Faceboxes method, let’s see the reasons
behind these results:

69
CHAPTER 6. RESULTS AND DISCUSSIONS

6.4.1 Average precision (AP) on our personal machine


As we saw before that, the AP obtained for images and video tests are 34.03% and 22.71%
respectively; which is quite small percentage comparing to faceboxes result on AFW 98.55%
PASCAL 97.05% FDDB 96.00%. We can say that is because:

• We did the training using our datasets that contain 1277 images with 4537 faces which
is quite small comparing to the training done with faceboxes on Wider-face dataset that
has for training 12881 images and around 157481 faces. So their model is stronger than
ours.

• Labelling the images manually make us sometimes fall in no drawing boxes as shown in
image. The program detect the face but it does not find any match.

Figure 6.17: Face detected without ground-truth

• Doing our tests with 425 images that has 1342 faces and then with the video that has 205
frame with 386 faces, the program gives good results with these small numbers, otherwise
it will give high AP if the testing dataset is more wide.

• The reason of our reduced results is because the change in the batch size from 30 to 15
due to our machine memory capacity, so the network efficiency would be reduced because
its small input.

• The false positives was more than the true positives as shown in Figures 6.4, 6.7, 6.9 and
6.12 which is a limit of this approach.

6.4.2 Average precision (AP) on the HPC


After doing the training on the HPC, we got a new model with a batch size of 32 as mentioned
in the previous chapter, that we applied it to do the test in our testing images, which gives

70
CHAPTER 6. RESULTS AND DISCUSSIONS

an AP of 64.91% which is more better and higher than the previous results obtained by our
machine ( 30.88% ). We obtained 1212 detected faces from the ground-truth comparing to the
first model which detected 1096 faces from the ground-truth; and beside that, it gives less false
alarms. Here we can check our hypothesis where we said that the capacity of the machine can
affect the results; yet the results are still far from the optimal result obtained by the Faceboxes
experiments. So, we can say that the reason behind this is still on the dataset that need to be
really wide and big to feed the network and gives more powerful model that needs more time
to collect more data.

6.4.3 The speed of calculations


As we see before in figures 6.1 and 6.2, the speed achieved in our machine was 1.283s per image
on CPU and 0.02s per image (50 FPS) on GPU. On the HPC, as shown in figures 6.3 and
6.4 we achieved 0.7s per image on CPU and 0.02s per image (50 FPS) on GPUs; which is less
than the speed of Faceboxes that achieved 20 FPS on CPU and 125 FPS on GPU. So, we can
say that we achieve real time on GPU but not on CPU because, as we said in the previous
chapters, to perform real time efficiency the speed has to be at least 20 FPS. The two reasons
behind these obtained results on our machine are: the machine capacity and how much is our
model optimal because it has only one device GPU and a CPU with 2 cores and 4 threads,
comparing to the CPU model used by Faceboxes that has 10 cores and 20 threads.

6.5 Conclusion
In face detection tests, our model gives an AP of 34.03% using images, 22.71% using video
and 64.91% using images on HPC. It got a speed on our machine of 1.283s per image with the
CPU and 50 FPS on GPU, and on HPC 0.7s per image on CPU and 50 FPS on GPU, which
is less than the results obtained by Faceboxes methods. It is due in first degree to the dataset
which is not big enough to give optimal model and also the test dataset is small, so we can not
judge the speed. Second, to the machine capacity, because we confirmed that when we do the
training and testing on the HPC then by getting improved results in term of AP we conclude
that the machine capacity affects the results. Beside that, we got so many false alarms and
this is because to the dataset which is not big enough. Third, it’s due to the less accuracy in
doing images labels.

71
7
Chapter
Conclusion

F
ace detection has always been a challenging issue that researchers are working to
achieve more higher performance. Because the need of face detection becomes impor-
tant in the world of the technology, we are working on surveillance scenarios. The
security cameras are installed everywhere and we aim to add the option of face detection. This
option will meet the request for anyone looking for detecting humans by detecting faces. Ex-
amples: in the shop, in the police office, in the airport and much more. The face detection is
useful in statistics by counting the number of humans in specific area, they can go further by
recognizing faces and use face detection as first step especially when it gives efficient results,
they can use it in some secured places where it should not have the existence of humans. So,
by applying face detection it will set an efficient alarms. In our work, we studied four existing
methods related to face detection by comparing them. We selected the Faceboxes method to do
our study for its efficiency. We created our dataset which has different surveillance scenarios
conditions with 1277 images composed of 4537 faces for training, and 425 images composed of
1342 faces and one video which has 205 frames with 386 faces for testing.
Our model reaches 1.738s per image on CPU and 50 FPS on GPU using our personal machine,
and it reaches 0.7s per image on CPU and 50 FPS on GPU using the HPC of the university,
which means that we achieved a real time only on GPU. We achieve an Average Precision for
images and video tests of 34.03% and 22.71% respectively on our machine, and 64.91% on HPC
for images test. We get results less than Faceboxes results using FDDB, PASCAL and AFW
due to the capacity of our machine and not having more data. Otherwise, we can say that the
results obtained are good enough for the requirements that we offered.
As perspectives, we have to reduce false positives presented in figures 6.6, 6.11 and 6.16 by
cropping the images on the faces and using these faces in the training phase which means

72
CHAPTER 7. CONCLUSION

enriching our dataset through the data augmentation. This solution will give more accurate
results and an improved AP by having a robust model. We have also to build a more precise
ground-truth using more powerful machines and big data.

73
8
Chapter
Appendix

I
n this section we are going to mention the most common errors that we faced during the
development and preparation in python of the existing SRN and faceboxes code. As it
is the first time working in this kind of codes and environments it took a lot of time to
satisfy the conditions to execute the code.

8.1 The environment


It is the most part that took a lot of time in producing suitable environment. It was hard and
filled up of obstacles that appear during installations from the operating system to python
libraries. Those are the few points that you need to make them in consideration :

1. First thing we face is that we need to change our operating system to Linux environment
(we choose Ubuntu 18.04 to get the python3.6 version by default which is requested by
the code owners) because the code has the Bash file which can be compiled only under
Linux.

2. If you have already Windows on your machine, install Linux beside it and make sure to
prepare free partition for the Linux system so during the installation.

3. The both codes has Cuda cudnn code, so your machine should has a GPU. Before in-
stalling Cuda cudnn you need to install first the Nvidia driver.

4. You need to choose carefully the versions for each package to be installed to have a com-
patible environment (in the SRN code they request pytorch V < (1.0.0) and torchvision
0.2.1; the cuda should be V9.0 or under, cudnn V9.5.6 and Nvdia driver 380).

74
CHAPTER 8. APPENDIX

5. The faceboxes requet pytorch version > (1.0.0). The cuda version should be > 9.0.

Working under anaconda environment may allow us to save more time and organize each
folder by its code and libraries with needed version.

8.2 Things to avoid in the environment


• Do not try to uninstall python or purge it because it will damage your graphical system,
if you want to change the version, install it first then change the priorities as follows:
Example: To change Python 3.6.8 as the default in Ubuntu 18.04 to Python 3.7

1. install Python 3.7


Steps to install Python3.7 and configure it as the default interpreter.
Install the python3.7 package using apt-get
$sudo apt-get install python3.7
2. Add Python 3.6 & Python 3.7 to update-alternatives
$sudo update-alternatives –install /usr/bin/python3 python3 /usr/bin/python3.6 1
$sudo update-alternatives –install /usr/bin/python3 python3 /usr/bin/python3.7 2
3. Update Python 3 to point to Python 3.7
$ sudo update-alternatives –config python3 Enter 2 for Python 3.7
4. Test the version of python
$python3 –v python 3.7.1

• If you do not want to make dual operating system in your machine; only one which is
Linux, you need to be very careful in uninstalling the other operating system.

8.3 Collection of data


The second thing that took time, is the collection of data. It must to be :

• The more we collect the more data is good for training.

• The data should be variant and have different positions and scenes.

• Find different data conditions (blurry, far, occlusion, multi sizes).

• Select the data that matches the goal which is detecting faces in surveillance scenarios.

To accomplish those points you need a lot of time and searches to fall in the right area and
build dataset that fits the study and has good quality that matches the requirements.

75
CHAPTER 8. APPENDIX

8.3.1 Resources for our data


We bring the most of our data from Youtube videos, and for the video test phase we asked a
supermarket to give us some footage, but it was difficult for them to give us a lot, so it was just
one video that we crop it to obtain sequences containing only faces scenes whith 205 frames.

8.4 Errors during training


During the training, you may face few errors:

• During the creation of images labels in XML VOC format, if you want to change some-
thing be sure that all the tags are closed and you did not delete any character because
it will be so confusing if you have many files, in this case you will pass by all of them to
find the issue. (see Figure 8.1)

• Second error may appear if your have a limit memory capacity, be sure to change the
parameters to be suitable with your device as shows in Figure 8.2.

• Make sure that the name and extension are the same in the img_list.txt file (example
.jpg .JPG are not the same). (see Figure 8.3)

• Be sure that the path in all images labels are the same because it will produce occlusion
for the program.

Figure 8.1: XML file errors

8.5 Commands might be useful for Linux users


• To know the path of your python’s libraries see figure 8.4.
$ python3 -m site –user-site

76
CHAPTER 8. APPENDIX

Figure 8.2:

Figure 8.3: Problem with the image

Figure 8.4: Python libraries path

77
CHAPTER 8. APPENDIX

• When we install Pycharm CE for python using Ubuntu software, sometimes interruptions
happened (example because of the Internet connection), so as a result error message will
appear when you try to reinstall again:

1. Open terminal using alt +ctrl +T.


2. $ snap changes .
3. Select the number with message error then write it in the next command.
4. $ sudo snap abort number_selected example sudo snap abort 3.
5. Then go again to Ubuntu software and install it again.
6. Sometimes when you download file from the internet it uploaded locked so to remove
the lock see figure 8.5.Or you can use these commands:
$sudo chown -R $USER: $HOME

Figure 8.5: Unlock files

78
Bibliography

[1] L. Ambalina, What is image annotation? – an intro to 5 image annotation ser-


vices. https://hackernoon.com/what-is-image-annotation-an-intro-to-5-image-annotation-
services-yt6n3xfj, 2019. Accessed on 2020-09-03.

[2] Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime multi-person 2d pose estimation
using part affinity fields, CoRR, abs/1611.08050 (2016).

[3] J. Cartucho, R. Ventura, and M. Veloso, Robust object recognition through sym-
biotic deep learning in mobile robots, in 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), 2018, pp. 2336–2341.

[4] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou, Selective refinement network
for high performance face detection, CoRR, abs/1809.02693 (2018).

[5] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.

[6] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, Object


detection with discriminatively trained part-based models, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 32 (2010), pp. 1627–1645.

[7] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning


and an application to boosting, Journal of Computer and System Sciences, 55 (1997),
pp. 119 – 139.

[8] FRITZ.AI, Object detection guide. https://www.fritz.ai/object-detection/, 2020. Ac-


cessed on 2020-09-03.

[9] V. Fung, An overview of resnet and its variants. https://towardsdatascience.com/an-


overview-of-resnet-and-its-variants-5281e2f56035, 2017. Accessed on 2020-09-03.

[10] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition,
CoRR, abs/1512.03385 (2015).

79
BIBLIOGRAPHY

[11] S. K, Non-maximum suppression (nms). https://towardsdatascience.com/non-maximum-


suppression-nms-93ce178e177c, 2019. Accessed on 2020-09-03.

[12] R. Khandelwal, Deep learning using transfer learning -python code for
resnet50. https://towardsdatascience.com/deep-learning-using-transfer-learning-python-
code-for-resnet50-8acdfb3a2d38, 2019. Accessed on 2020-09-03.

[13] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie,


Feature pyramid networks for object detection., in CVPR, IEEE Computer Society, 2017,
pp. 936–944.

[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C.


Berg, SSD: single shot multibox detector, CoRR, abs/1512.02325 (2015).

[15] C. C. Loy, Computer Vision: A Reference Guide, Springer International Publishing,


Cham, 2020, ch. Face Detection.

[16] M. I. Nouyed and G. Guo, Face detection on surveillance images, arXiv preprint
arXiv:1910.11121, (2019).

[17] Prabhu, Understanding of convolutional neural network (cnn).


https://medium.com/RaghavPrabhu/understanding-of-convolutional-neural-network-
cnn-deep-learning-99760835f148, 2018. Accessed on 2020-09-03.

[18] L. R., Focus: Mobilenet, a powerful real-time and embedded image recognition, 2018. Ac-
cessed on 2020-09-03.

[19] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detec-
tion with region proposal networks, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39 (2017), pp. 1137–1149.

[20] P. Ruiz, Understanding and visualizing resnets. https://towardsdatascience.com/understanding-


and-visualizing-resnets-442284831be8, 2018. Accessed on 2020-09-03.

[21] W. Shang, K. Sohn, D. Almeida, and H. Lee, Understanding and improving con-
volutional neural networks via concatenated rectified linear units, CoRR, abs/1603.05201
(2016).

[22] S. SHARMA, Epoch vs batch size vs iterations, 2017. Accessed on 2020-09-03.

[23] C. Shorten, Khoshgoftaar, and T. M, A survey on image data augmentation for


deep learning, Journal of Big Data, 6 (2019).

[24] P. Viola and M. Jones, Robust real-time face detection, International Journal of Com-
puter Vision, 57 (2004), pp. 137–154.

80
BIBLIOGRAPHY

[25] S. Yang, P. Luo, C. C. Loy, and X. Tang, WIDER FACE: A face detection benchmark,
CoRR, abs/1511.06523 (2015).

[26] J. Yoon and D. Kim, An accurate and real-time multi-view face detector using orfs and
doubly domain-partitioning classifier, Real-Time Image Processing, 16 (2019).

[27] J. Zhang, X. Wu, J. Zhu, and S. C. H. Hoi, Feature agglomeration networks for single
stage face detection, CoRR, abs/1712.00721 (2017).

[28] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, Single-shot refinement neural
network for object detection, CoRR, abs/1711.06897 (2017).

[29] S. Zhang, L. Wen, H. Shi, . Z. Lei, S. Lyu, and S. Z. Li, Single-shot scale-aware
network for real-time face detection, Int. J. Comput. Vision, 127 (2019), p. 537–559.

[30] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei, and S. Z. Li,
Improved selective refinement network for face detection, CoRR, abs/1901.06651 (2019).

[31] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, Faceboxes: A CPU real-time
face detector with high accuracy, CoRR, abs/1708.05234 (2017).

[32] F. Zuppichini, Residual networks: Implementing resnet in pytorch.


https://towardsdatascience.com/residual-network-implementing-resnet-a7da63c7b278,
2019. Accessed on 2020-09-03.

81

Vous aimerez peut-être aussi