Académique Documents
Professionnel Documents
Culture Documents
By
Djouama Ihcene
Oulmi Saliha
Directed by
Dr Larbi GUEZOULI
September 2020
i
Abstract
Abstract
Face detection is the trend field of researches in this last few years , and this ups to the need
to it in this decade and the most use technology ; where we can find it in different
applications (examples: Snapchat Instagram face filters ), in different area to ensure the
security in airports, train stations, security camera in homes and shops...etc , and this is
what make it important and motivates the researchers to achieve remarkable works with high
performances and guarantee real time efficiency. Which make it interesting subject to do it in
our project to detect faces in real time in video. So as a first step we select four known
methods that achieve state of the art in detecting faces ,by doing a comparative study
between them ,then we are going to select the best one that respond to all the conditions and
the requirements needed to achieve real time at least 20 frame per second ,and high accuracy
with our dataset that we built it for surveillance scenarios.
Keywords: Security camera, detecting faces, Faceboxes, Deep Learning, Cnn, Real time,
machine learning.
ii
Abstract
Résumé
La détection des visages est le domaine de recherche de tendance de ces dernières années, et
ceci dépend à des besoins dans cette décennie et la technologie la plus utilisée; où on peut le
trouver dans différentes applications (exemples: filtres de visage Instagram, Snapchat), dans
différents domaines pour assurer la sécurité dans les aéroports, les gares, la caméra de
sécurité dans les maisons et les magasins ... etc, et c’est ce qui le rend important et motive les
chercheurs pour réaliser des travaux remarquables avec des performances élevées et garantir
une efficacité en temps réel dans un video. Ce qui rend le sujet intéressant de le faire dans
notre projet de détection de visages en temps réel. Donc, dans un premier temps, nous
sélectionnons quatre méthodes connues qui atteignent l’état de l’art dans la détection des
visages, en faisant une étude comparative entre ces methods, puis nous allons sélectionner la
meilleure qui répond à toutes les conditions et les exigences nécessaires pour atteindre le
temps réel à au moins 20 images par seconde, et une grande précision avec notre base de
données que nous l’avons construit pour les scénarios de surveillance.
Mots clés: Caméra de sécurité, détection des visages , Faceboxes, Apprentissage profond,
Cnn, Temps réel, Apprentissage Automatique.
iii
Abstract
ملخص
إن الكشف عن الوجوه هو من مجا ت البحث العلمي ا ٔكثر استكشافا في السنوات القليلة الماضية ،وهذا
يرجع إلى الحاجة إليه في هذا العقد و هي من التكنولوجيا ا ٔكثر استخدا ًما ؛ حيث يمكن أن نجدها في
تطبيقات مختلفة )أمثلة ،فلتر ا نستقرام و سناب شات( ،وفي مناطق مختلفة لضمان ا ٔمن في المطارات
ومحطات القطارات وكاميرات ا ٔمن في المنازل والمتاجر ...إلخ ،وهذا ما يجعلها مهمة ومحفزة للباحثون
عال وضمان كفاءة الوقت الحقيقي .مما يجعل ا ٔمر مثي ًرا ل هتمام للقيام به في ٕ نجاز أعمال رائعة بأداء ٍ
مشروعنا للكشف عن الوجوه في الوقت الفعلي في الفيديو .لذلك كخطوة أولى نختار أربع طرق معروفة في
الكشف عن الوجوه وحققوا نتائج مبهرة في مختلف الشروط ،ثم من خ ل إجراء دراسة مقارنة بينها ،نختار
أفضل طريقة تستجيب لجميع الشروط والمتطلبات ال زمة حراز الكشف في الوقت الحقيقي في وقت يقل
عن 20صورة في الثانية ،ودقة عالية مع مجموعة البيانات التي أنشأناها لسيناريوهات المراقبة.
الكلمات المفتاحية :كاميرات المراقبة ،الكشف عن الوجوه ، Faceboxes ،التعلم العميق ، Cnn ،الوقت
الحقيقي ،التعلم ا ٓلي.
iv
Dedication
v
Acknowledgements
Foremost, i have to thank Allah to give me the power and patience to complete our thesis in
these unsuspected situations of Corona Virus.
I would like to express my sincere gratitude to our supervisor Dr LARBI GUEZOULI for his
guidance that helped me during this research ,without his assistance , corrections , planning
and dedicated involvement in every step through the process, this work would have never
been accomplished.
My sincere thanks also goes to all my teachers during all this years, without them i am not in
this level of knowledge.
Special thanks to my friends and my colleague in this thesis for their motivation and support
during this work , to my two best friends who always been there for me , to my friend in
Mostaganem who helped me to see this project easy by giving me tips.
Last but not the least, i would like to thank my family, brothers and my dearest sister , none
of this would happened without their trust on me ,their pushing and encouraging me
spiritually ,most importantly to my father DJOUAMA ABDELAZIZ who did everything he
could to offer me all what i need, and to make me where i am now by his advice , constant
source of support ,and his certitude that i can do anything to be successful during all my
years of study.
Thanks to my self to not giving up.
vi
Acknowledgements
First and foremost, I’m deeply grateful and thankful to ALLAH for giving us the strength,
ability , knowledge to achieve this work.
My gratitude knows no bounds to my colleague DJOUAMA IHCENE which really the most
one supporting and encouraging me in my hard and expecting time this year. Best wishes for
her.
Moreover, I would like to express thanks for my dad and mom,my children DJANNA and
MOUATEZ, my sister and brothers and my husband.
Finally, a special thank to our supervisor Dr LARBI GUEZOULI.
vii
Author’s declaration
W
e declare that the work in this dissertation was carried out in accordance
with the requirements of the University’s Regulations and Code of Prac-
tice for Research Degree Programs, the work is the candidate’s own work.
Work done in collaboration with, or with the assistance of, others, is indicated as
such
Email :
viii
Table of Contents
Page
1 Introduction 1
1.1 Problems and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
ix
TABLE OF CONTENTS
3 Related Work 18
3.1 Body based Face Detection BFD on the UCCS dataset [16] . . . . . . . . . . . . 18
3.1.1 The characteristics of Body based face detection BFD proposed by Cao
et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 The processing of the real-time face detection approach . . . . . . . . . . 19
3.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Selective Refinement Network for High Performance Face Detection [4] . . . . . 20
3.2.1 Main contributions to the face detection studies . . . . . . . . . . . . . . . 20
3.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Feature Agglomeration Networks for Single Stage Face Detection [27] . . . . . . 21
3.3.1 The contribution of this methods . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 The final FANet model and its results . . . . . . . . . . . . . . . . . . . . . 22
3.4 FaceBoxes: A CPU Real-time Face Detector with High Accuracy [31] . . . . . . 23
3.4.1 The contribution of this methods . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Comparison between methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
x
TABLE OF CONTENTS
4.3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.7 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Implementation 46
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 The method used in the implementation . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Which programming language . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.3 Used machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Creating our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.1 Starting idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.2 Collecting images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.3 Frames annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 Run the codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Conclusion 72
8 Appendix 74
8.1 The environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 Things to avoid in the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Collection of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xi
TABLE OF CONTENTS
Bibliography 79
xii
List of Tables
Table Page
4.1 AP performance of the two-steps classification applied to each pyramid level. [4] . 32
4.2 AP performance of the two-step regression applied to each pyramid level. [4] . . . 33
4.3 Evaluation on Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Overall CPU inference time and mAP compared on different methods. The FPS is
for VGA-resolution images on CPU and the mAP means the true positive rate at
1000 false positives on FDDB. Notably, for STN, its mAP is the true positive rate
at 179 false positives and with ROI convolution, its FPS can be accelerated to 30
with 0.6% recall rate drop. [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Ablative results of the FaceBoxes on FDDB dataset. Accuracy (mAP) means the
true positive rate at 1000 false positives. Speed (ms) is for the VGA-resolution
images on the CPU. [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Result of ablation of each component of the method beside the loss function where
DCFPN=Architecture+Strategy+Loss [31] . . . . . . . . . . . . . . . . . . . . . . . . 44
xiii
List of Figures
Figure Page
4.1 A WIDERFACE dataset for face detection. The annotated face bounding boxes are
denoted in green color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Network structure of SRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Architecture of the FaceBoxes and the detailed information table about our anchor
designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 (a) The C.ReLU modules where Negation simply multiplies −1 to the output of
Convolution. (b) The Inception modules. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Examples of anchor densification. For clarity, we only densify anchors at one recep-
tive field center (i.e., the central black cell), and only color the diagonal anchors. . 40
4.6 Face boxes on AFW dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xiv
List of Figures
xv
1
Chapter
Introduction
I
n the area of security, motion detection remains a field of research despite the evolution
it has known since its creation. A surveillance camera can recover any type of movement.
The movement that interests us in this project is the movement of human beings. The
camera should only detect the movement of human beings. This detection is based on facial
recognition techniques which we can call it ”too face detection”.
Face detection is a long-standing problem in computer vision with extensive applications
including face recognition, animation, expression analysis and human computer interaction
[29], so face detection is a fundamental step to all facial analysis algorithms, the goal of face
detection is to determine the presence of faces in the image and, if present, return the location
and extent of each face in an image.
To detect faces efficiently and accurately, different detection pipelines have been designed
after the pioneering work of Viola-Jones [24], most of early face detection methods have fo-
cused on designing effective hand-crafted features [e.g., Haar (Viola and Jones 2004) [24] and
HOG (Dalal and Triggs 2005[5]) and classifiers [e.g., Adaboost (Freund and Schapire 1997 [7]),
and combining local features in global models such as deformable parts model (DPM) (Felzen-
szwalb et al. 2010 [6]). However, these methods typically optimize each component of detectors
separately, which limits the performance of these methods when they are deployed in real life
complex scenarios. [29]
To further improve the performance of face detection has become a challenging and hard
issue. However, in recent years, with the advent of deep convolutional neural network (CNN),
a new generation of more effective face detection methods based on CNN significantly improve
the state-of-the-art performances and rapidly become the tool of choices [29]. These detectors
perform much better than approaches based on hand-crafted features due to the capability of
deep CNNs in extracting discriminative representation from data. Modern face detectors based
1
CHAPTER 1. INTRODUCTION
on deep CNNs can easily detect faces under moderate variations in pose, scale, facial expression,
occlusion, and lighting condition. Consequently, deep learning-based face detectors are now
widely used in a myriad of consumer products, e.g., video surveillance systems, digital cameras,
and social networks, because of its remarkable face detection results [15]. But, although these
methods have considerably improved in terms of detection speed or accuracy, some methods
that focus on accuracy tend to be extremely slow due to the use of complicated classifiers.
Moreover, some other methods focusing on detection speed, have limited detection accuracy
of the bonding box, the detection of tiny faces or extreme pose faces especially while working
in real time. [26]
• Recall Efficiency
• Location Accuracy
• Real time
– To achieve face detection in surveillance scenarios in real time, the speed should be
high.
– To achieve the performance in real time, the frame per second should be at least
20FPS.
• Datasets
2
CHAPTER 1. INTRODUCTION
a) Chapter 1: Introduction
b) Chapter 2: This chapter contains definitions of expressions and terms and the terms
used in this manuscript to facilitate its reading.
c) Chapter 3: This chapter presents a comparative study between four methods working
on face detection. We present advantages and disadvantages of each with the aim
of selecting the best one.
d) Chapter 4: Presentation of selected methods.
3
2
Chapter
Lexicon of used expressions
Those are expressions that are used in this manuscript, so to better understand the issue,
let’s explain them briefly.
4
CHAPTER 2. LEXICON OF USED EXPRESSIONS
Face detection is commonly confused with face recognition, so before we proceed, it’s im-
portant that we clarify the distinction between them.
Face recognition assigns a label to an image. A picture of a person’s face his name ”Ali”
receives the label “Ali” with drawing bonding box around each face. Face detection, on the
other hand, draws a box around each face. The model predicts where each face is . (see figure
2.1).
2.3.1 Input
This is the input image, here a photo with a car. When we have a video input, we cut it into
each of its images and apply the CNN (generally with tracking to avoid having to detect too
often). [18]
5
CHAPTER 2. LEXICON OF USED EXPRESSIONS
Consider a 5 x 5 whose image pixel values are 0, 1 and filter matrix 3 x 3 as shown in
below.
Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is
called “Feature Map” as output shown in below.
Convolution of an image with different filters can perform operations such as edge detection,
blur and sharpen by applying filters. The below example shows various convolution image after
applying different types of filters (Kernels) (see Figure 2.6).
6
CHAPTER 2. LEXICON OF USED EXPRESSIONS
2.3.3 Strides
Stride is the number of pixels a filter moves across the input image. When the stride is 1 then
we move the filters by 1 pixel at a time. When the stride is 2 then we move the filters by 2
pixels at a time and so on. Figure 2.7 shows convolution with a stride of 2 pixels. [17]
2.3.4 Padding
Sometimes filter does not fit perfectly the input image. We have two options: Pad the picture
with zeros (zero-padding) so that it fits the part of the image where the filter did not fit. This
is called valid padding which keeps only valid part of the image.
7
CHAPTER 2. LEXICON OF USED EXPRESSIONS
8
CHAPTER 2. LEXICON OF USED EXPRESSIONS
There are other non linear functions such as ’tanh’ or ’sigmoid’ that can also be used
instead of ReLU. Most of the data scientists use ReLU because its performance is better than
the other two.
• Max Pooling.
• Average Pooling.
• Sum Pooling.
Max pooling takes the largest element from the rectified feature map. Taking the largest
element could also take the average pooling. Sum of all elements in the feature map called as
sum pooling. (see figure 2.9) [17]
9
CHAPTER 2. LEXICON OF USED EXPRESSIONS
In the figure 2.10, the feature map matrix will be converted as vector (x1, x2, x3, …). With
the fully connected layers, we combined these features together to create a model. Finally, we
have an activation function such as softmax or sigmoid to classify the outputs as cat (y1), dog
(y2), car (y3) etc.
2.3.8 Anchors
We denote the reference bounding box as “anchor box”, which is also called “anchor” for
simplicity. However,it is called too “default box”. [28]
Anchor boxes are used in computer vision object detection algorithms to help locate objects.
2.3.9 Resnet50
ResNet(Residual Networks)[10],is the winner of ImageNet challenge in 2015, it has been used
as an architecture for many computer vision works, in training the deep learning tasks with
different number of layers till 150+ layers which were successfully applied . ResNet-50 is a
pretrained Deep Learning model of the Convolutional Neural Network(CNN, or ConvNet).
10
CHAPTER 2. LEXICON OF USED EXPRESSIONS
What characterizes a residual network is its identity connections that takes the input directly
to the end of each residual block, as shown with the curved arrow in figure 2.12.
Specifically, the ResNet-50 model consists of 5 stages each with a residual block. Each one
has 3 layers with both 1*1 and 3*3 convolutions. The concept of residual blocks is quite simple.
In traditional neural networks, each layer feeds the next layer. In a network with residual
blocks, each layer feeds directly the layers about 2–3 hops away, called identity connections.
Resnet solves the problem of vanishing gradients which mean when gradient is very small then
the weights will not be change effectively and it may completely stop the neural network from
further training.
11
CHAPTER 2. LEXICON OF USED EXPRESSIONS
Perhaps the simplest data augmentation method is Mirroring along the vertical axis. If we
have this example in our training set, we can flip it horizontally to get that image on the right.
For most computer vision tasks if the left picture is a cat then mirroring is still a cat. Hence, if
the mirroring operation preserves whatever we’re trying to recognize in the picture this would
be a good data augmentation technique to use. (Figure 2.13)
Another commonly used technique is Random cropping. In the given dataset we pick a few
random crops. Random cropping is not a perfect data augmentation method. What if we
randomly ended up taking a crop which does not look much like the cat. In practice it works
well as long as our random crops are reasonably large subset of the optional image.(Figure
2.14)
12
CHAPTER 2. LEXICON OF USED EXPRESSIONS
Another type of data augmentation that is commonly used is Color shifting. For a picture below,
let’s say we add to the red, green and blue channels different distortions. In this example we
are adding to the red and blue channels and subtracting from the green channel. (Figure 2.15)
Introducing these color distortions we make our learning algorithm more robust to changes in
the color of our images.
13
CHAPTER 2. LEXICON OF USED EXPRESSIONS
1. Select the proposal with highest confidence score, remove it from B and add it to the
final proposal list D. (Initially D is empty).
2. Now compare this proposal with all the proposals — calculate the Intersection over Union
(see Figure 2.16) (IoU) of this proposal with every other proposal. If the IOU is greater
than the threshold N, remove that proposal from B.
3. Again take the proposal with the highest confidence from the remaining proposals in B
and remove it from B and add it to D.
4. Once again calculate the IOU of this proposal with all the proposals in B and eliminate
the boxes which have high IOU than threshold.
IOU calculation is actually used to measure the overlap between two proposals.
14
CHAPTER 2. LEXICON OF USED EXPRESSIONS
| A ∩ B|
js( A, B) = (2.1)
| A ∪ B|
• If the weights in a network start too small, then the signal shrinks as it passes through
each layer until it’s too tiny to be useful.
• If the weights in a network start too large, then the signal grows as it passes through
each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reason-
able range of values through many layers.
2.4.2 Epochs
When ALL the dataset passed forward and backward through the neural network at one time
it is one Epoch.
15
CHAPTER 2. LEXICON OF USED EXPRESSIONS
Yet, if the dataset is too big to be uploaded at once; the epoch need to be divided into
several smaller batches to pass into the neural network.
Passing the entire dataset through a neural network is not enough. it needs to pass the full
dataset multiple times to the same neural network to get the optimal weights.
One epoch leads to underfitting of the curve in the graph (below) 2.18.
As the number of epochs increases, more number of times the weight are changed in the
neural network and the curve goes from underfitting to optimal to overfitting curve.
16
CHAPTER 2. LEXICON OF USED EXPRESSIONS
2.4.4 Iterations
In order to complete training all the dataset which is usually big in machine learning , we need
a specific number of iterations which are small sizes divided from the initial dataset and we
call them batches.
Note: The number of batches is equal to number of iterations for one epoch.
Example: Let’s say we have 4000 training examples in our dataset that we are going to use.
We can divide the dataset of 4000 examples into batches of 250 then it will take 16 iterations
to complete 1 epoch (4000/250=16).
17
3
Chapter
Related Work
Quran:95:4(Surat At-Tin,The
Fig)
I
n this chapter we are going to do a comparative study between existing methods, which
are related to our research, then we are going to select the best method that has the
better results, that it is going to be more explored and detailed in the next chapter.
Why we choose it? The most interesting point is that its dataset UCCS is a collection of
images in surveillance scenario so it gives too much relevance to our subject.
18
CHAPTER 3. RELATED WORK
2. Apply frontal/side face detection based on the details of the joints and draw the boundary
boxes for all detected frontal/side faces. Based on the information of the five joints on
face (nose, left eye, right eye, left ear and right ear) and in order to reduce false alarm,
a confidence threshold is set. The threshold is applied to all detected joints of the face,
and then delete the joints whose confidence is lower than the threshold. We consider all
different detection situation (angle of the face) and build a well defined frontal/side face
detection rule.
3. Apply boundary box size check to detected faces in order to decrease the false alarm rate.
Two threshold (thre_min and thre_max) are set for checking the size of boundary box
of each face. If size(boundary box) > thre_max or size(boundary box) < thre_min, we
delete such detected face.
4. Finally, a skin detector method was trained using part of the training set. This helped
to remove more false alarms.
3.1.3 Experiments
• In 23350 faces ; detection 95% false alarms 5000.
• In the precision and the recall on training set; the Area Under the Curve (AUC) is 0.94.
• In the precision and the recall on validation set; the AUC is 0.944.
19
CHAPTER 3. RELATED WORK
3.1.4 Results
• This methods has the problem of many false alarms.
1. Selective Two-step Classification (STC) module : aims to filter out most simple negative
anchors from low level detection layers to reduce the search space,it has two classes(face
or background).
2. Selective Two-step Regression STR) module: to adjust the locations and sizes of anchors
from high level detection layers to provide better initialization for the subsequent regres-
sor.
Authors design Receptive Field Enhancement (RFE) module it helps to better capture faces
in some extreme poses.The RFE module is responsible for providing the necessary information
to predict the classification and location of objects.
The face detector used in this work in based on RetinaNet. And it is trained on the WIDER
FACE dataset.
Why we choose it? The SRN method worked on blurred ,small ,different position of
faces,and it gives high results .So it may be useful in our case because we are going to train
faces in these conditions.
• They design an STR module to coarsely adjust the locations and sizes of anchors from
high level layers to provide better initialization for the subsequent regressor.
20
CHAPTER 3. RELATED WORK
• They introduce an RFE module to provide more diverse receptive fields for detecting
extreme-pose faces.
• They achieve state-of-the-art results on AFW, PASCAL face, FDDB, and WIDER FACE
datasets.
• They works with anchors and small anchors to detect small faces.
• After using the STC module, the AP scores of the detector are improved from 95.1%,
93.9% and 88% to 95,3%, 94.4% and 89.4% on the Easy, Medium and Hard subsets,
respectively. In order to verify whether the improvements benefit from reducing the false
positives. It is to count the number of false positives under different recall rates. The
STC effectively reduces the false positives across different recall rates, demonstrating the
effectiveness of the STC module.
• The STR module produces much better results than the base line, with 0.8%, 0.9% and
0.8% AP improvements on the Easy, Medium, and Hard subset.STR also can produces
more accurate localization and produces consistently accurate detection results than the
baseline method.
• When we couple between STR and STC the performance is further improved to 96.1%,
95.0% and 90.1% on the Easy, Medium and Hard subsets, respectively.
• The RFE is used to diversify the receptive fields of detection layers in order to capture
faces with extreme poses. RFE consistently improves the AP scores in different sub-
sets,i.e., 0.3%, 0.3%, and 0.1% APs on the Easy, Medium,and Hard categories. These
improvements can be mainly attributed to the diverse receptive fields, which is useful to
capture various pose faces for better detection accuracy.
21
CHAPTER 3. RELATED WORK
They evaluate the proposed FANet detector on several public face detection benchmarks,
including PASCAL face, FDDB, and WIDER FACE datasets and achieved state-of-the-art
results. Their detector can run in real-time for VGA-resolution images on GPU.
Why we choose it? The reason is to see the effect of using contextual information in
detecting faces especially when the face is small and in different scales. So it may help our
study.
• An effective Hierarchical Loss based training scheme is presented to train the proposed
FANet model in an end-to-end manner, which guides a more stable and better training
for discriminative features.
• Comprehensive experiments are carried out on several public Face Detection benchmarks
to demonstrate the superiority of the proposed FANet framework, in which promising
results show that the FANet detector not only achieves the state-of-the-art performances
but also runs efficiently with real-time speed on GPU.
• In this work authors propose a new Agglomerate Connection module which can aggregate
multi-scale features more effectively than the skip connection module. Besides, they also
introduce a novel Hierarchical Loss on the proposed FANet framework which enables to
train this powerful detector effectively and robustly in an end-to-end approach.
• WIDER FACE is a very challenging face benchmark and the results strongly prove the
effectiveness of FANet in handling high scale variances, especially for small faces.
• Final FANet improved +3.9% over vanilla S3FD while still reaching the real-time speed.
• FANet introduces two key novel components: the “Agglomeration Connection” module
for context-aware feature enhancing and multi-scale features agglomeration with a hi-
erarchical structure, which effectively handles scale variance in face detection; and the
Hierarchical Loss to guide a more stable and better training in an end-to-end manner.
22
CHAPTER 3. RELATED WORK
• On WIDER FACE dataset . FANet model is robust to blur, occlusion, pose, expression,
makeup, illumination, etc. and it is also able to handle faces with a wide range of face
scales, even with extremely small faces.
Why we choose it? It focuses on detecting small faces and trade-off between accuracy and
efficiency, and beside that the idea of using CPU in real time that gives remarkable results is
a really challenging work nowadays.
• Proposing a fair L1 loss and using dense anchor strategy to handle small faces well that
uniformly tiles several anchors around the center of one receptive field instead of only
tiling one.
• Using the challenging WIDER-FACE dataset in training and PASCAL, AFW, FDDB
datasets in testing.
23
CHAPTER 3. RELATED WORK
3.4.2 Results
• Fair L1 loss is promising: +0.7% owns to locating small faces well.
• Dense anchor strategy is effective: +0.8% shows the importance of this strategy.
3.6 Conclusion
After seeing results produced by the mentioned methods in this study; we see that SRN and
Faceboxes are the best methods in terms of AP values, then FANet is the best method in terms
of good features (robust to blur, occlusion, small faces etc.) and its speed on GPU. Concerning
BFD method, we see that it’s implementation idea is brilliant but we think that extracting a
lot of data such as body joints takes more time and still has a lot of false alarms.
So, we can say that the best methods for our study are SRN and Faceboxes because of their
high average precision (AP) and the production of more accurate locations with focusing on
tiny faces and extreme pose faces, and beside that, they are from the newest way to solve face
detection problems. To know more details about those methods we will discuss them in the
next chapter.
24
CHAPTER 3. RELATED WORK
Alan Turing
I
n this chapter, we will discuss the two methods that we have selected for our problematic:
The Selective Refinement Network (SRN) and Faceboxes: a cpu real-time and accurate
unconstrained face detector. They fit the conditions and fulfil the requests by their im-
pressive results in the average precision and their experiments taken in many conditions.
4.1 Introduction
Face detection in surveillance scenarios requires very specific methods that can be adapted to
the conditions and problems posed by the environment and the acquisition conditions, such as
occlusion (ex:wearing medical bib, hat ...); different scale variation from big to small, because
as we know the surveillance camera takes captures from a specific distance and angle that
gives plenty face scale; different illumination conditions (daytime, night); various facial poses,
it is clearly that people don’t look to the camera directly when they walk or buy from shops.
So, it is very important to train the machine to know all the sides of face. Blurry faces comes
when we have a low quality camera or when taking frames where the face moves very fast.
Researchers have done a lot of studies and research that we saw in the previous chapter. They
aim to determine whether there is any face in the input image and returning the coordinates
26
CHAPTER 4. SRN AND FACEBOXES
of the bounding box which is near to the truth and as well as improving accuracy and recall
in real time.
The WIDERFACE dataset is 10 times larger than existing datasets. It contains rich annota-
tions, including occlusions, poses, event categories, and face bounding boxes [25]. Faces in this
dataset are extremely challenging due to the large variations in scale, pose and occlusion as
well as plenty of tiny faces in various complex scenes as shown in Figure 4.1 [30]. Furthermore,
the WIDERFACE dataset is an effective training source for face detection[25] which has a
high degree of variability in scale, pose, occlusion, expression, appearance and illumination. So,
because it is a challenging dataset and it has high degree it was used by the SRN method.
Figure 4.1: A WIDERFACE dataset for face detection. The annotated face bounding boxes
are denoted in green color.
It performs favourably against the state-of-the-art based on the average precision (AP)
across the three subsets, especially on the Hard subset which contains a large amount of small
faces. Specifically, it produces the best AP score in all subsets of both validation and testing
sets, i.e., 96.4% (Easy), 95.3% (Medium) and 90.2% (Hard) for validation set, and 95.9%
27
CHAPTER 4. SRN AND FACEBOXES
(Easy), 94.9% (Medium) and 89.7% (Hard) for testing set, surpassing all approaches, which
demonstrates the superiority of this detector. [4]
So, WIDERFACE dataset demonstrates that SRN achieves the state-of-the-art detection
performance. [4]
It is a high performance face detector based on the deep convolutional neural networks (CNNs)
because there exists many tiny faces. So, as result, it performs better detection and increases
the recall.
• ResNet makes it possible to train up to hundreds or even thousands of layers and still
achieves compelling performance. [9, 12, 20, 32]
• ResNet solves the vanishing gradient problem by using identity shortcut connection or
skip connections that skip one or more layers which means that the performance won’t
degrade even if the layers get more deeper so it performs better training. [9, 12, 20, 32]
FPN is not an object detector by itself. It is a feature detector that works with object detectors.
Face detection at vastly different scales is a fundamental challenge in computer vision. Feature
pyramids built upon image pyramids (for short we call these featurized image pyramids) form
the basis of a standard solution. It achieves significant improvements are shown over several
strong baselines [13], example: it has been used in faster-RCNN and placing it with Region
proposal Networks (RPN) [13, 19].
FPN has too many advantages which are: [13]
• It provides a practical solution for research and applications of feature pyramids, without
the need of computing image pyramids.
• FPN has inference time of 0.148 seconds per image on a single NVIDIA M40 GPU for
ResNet-50.
28
CHAPTER 4. SRN AND FACEBOXES
• Despite the effectiveness of ResNet and Faster R-CNN, FPN shows significant improve-
ments over several strong baselines and competition winners. [13, 19]
In the current state-of-the-art, two-stage methods, e.g. Faster R-CNN, R-FCN, and FPN, have
three advantages over the one-stage methods:
• The Selective Two-step Classification (STC) module to filter out most simple negative
samples from low level layers to reduce the classification search space.
• The Selective Two-step Regression (STR) module to coarsely adjust the locations and
sizes of anchors from high level layers to provide better initialization for the subsequent
regressor.
• A Receptive Field Enhancement (RFE) module to provide more diverse receptive fields
for detecting extreme-pose faces.
• And achieving a state-of-the-art results on AFW, PASCAL face, FDDB, and WIDER
FACE datasets.
The overall framework of SRN is shown in figure 4.2. It consists of STC, STR, and RFB.
STC uses the first-step classifier to filter out most simple negative anchors from low level
29
CHAPTER 4. SRN AND FACEBOXES
detection layers to reduce the search space for the second-step classifier. STR applies the first-
step regressor to coarsely adjust the locations and sizes of anchors from high level detection
layers to provide better initialization for the second-step regressor. RFE provides more diverse
receptive fields to better capture extreme-pose faces. We describe each component as follow:
4.2.3.2 Backbone
The ResNet-50 [10] has been adopted with 6-level feature pyramid structure as the backbone
network for SRN. The feature maps extracted from those four residual blocks are denoted as
C2, C3, C4, and C5, respectively. C6 and C7 are just extracted by two simple down-sample
3 × 3 convolution layers after C5. The lateral structure between the bottom-up and the top-
down pathways is the same as (Lin et al. 2017a) [13]. P2, P3, P4, and P5 are the feature maps
extracted from lateral connections, corresponding to C2, C3, C4, and C5 that are respectively
of the same spatial sizes, while P6 and P7 are just down-sampled by two 3 × 3 convolution
layers after P5.[4]
ResNet is used for deep feature extraction.
Feature Pyramid Network (FPN) is used on top of ResNet for constructing a rich
multi-scale feature pyramid from one single resolution input image.
The STC module selects C2, C3, C4,P2, P3, and P4 to perform two-step classification, while
the STR module selects C5, C6, C7, P5, P6, and P7 to conduct two-step regression. The RFE
module is responsible for enriching the receptive field of features that are used to predict the
classification and location of objects.[4]
30
CHAPTER 4. SRN AND FACEBOXES
A hybrid loss is append at the end of the deep architecture, which leverage the merits of the
focal loss and the smooth L 1 loss to drive the model to focus on more hard training examples
and learn better regression results.
• It aims to remove negative anchors so as to reduce search space for the classifier. [28]
For one-stage detectors, numerous anchors with extreme positive/negative sample ratio
(e.g., there are about 300k anchors and the positive/negative ratio is approximately 0.006% in
SRN) leads to quite a few false positives. Hence it needs another stage like RPN to filter out
some negative examples. Selective Two-step classification, inherited from RefineDet, effectively
rejects lots of negative anchors and alleviates the class imbalance problem.
Specifically, most of anchors (i.e., 88.9%) are tiled on the first three low level feature maps,
which do not contain adequate context information. So it is necessary to apply STC on these
three low level features. Other three high level feature maps only produce 11.1% anchors with
abundant semantic information, which is not suitable for STC. To sum up, the application of
STC on three low level features brings advanced results, while on three high level ones will
bring ineffective results and more computational cost. STC module suppresses the amount of
negative anchors by a large margin, leading the positive/negative sample ratio about 38 times
increased (i.e., from around 1:15441 to 1:404). The shared classification convolution module
and the same binary Focal Loss are used in the two-step classification, since both of the targets
are distinguishing the faces from the background. [30]
Therefore, the STC module selects C2, C3, C4, P2, P3, and P4 to perform two-step clas-
sification. The STC increases the positive/negative sample ratio by approximately 38 times,
from around 1:15441 to 1:404. In addition, we use the focal loss in both two steps to make
full use of samples. Unlike RefineDet [28], the SRN shares the same classification module in
the two steps, since they have the same task to distinguish the face from the background. The
experimental results of applying the two-step classification on each pyramid level are shown in
31
CHAPTER 4. SRN AND FACEBOXES
Table 4.1. Consistent with our analysis, the two-step classification on the three lower pyramid
levels helps to improve performance, while on the three higher pyramid levels is ineffective.
The loss function for STC consists of two parts, i.e., the loss in the first step and the second
step. For the first step, calculating the focal loss for those samples selected to perform two-step
classification. For the second step, just focusing on those samples that remain after the first
step filtering. With these definitions, the loss function defined as:
1 ∑ 1 ∑
LSTC ({pi }, {qi }) = LFL (pi , l∗i ) + L FL ( p i , l ∗i ) (4.1)
Ns1 i∈Ω Ns2 i∈Φ
where i is the index of anchor in a mini-batch, p i and q i are the predicted confidence of
the anchor i being a face in the two steps, l ∗i is the ground truth class label of anchor i, Ns1
and Ns2 are the numbers of positive anchors in the first and second steps, Ω represents the
collection of samples selected for two-step classification, and Φ represents the sample set that
remains after the first step filtering. The binary classification loss L FL is the sigmoid focal loss
over two classes (face vs. background).
After the STC filter and reduce search space, STR now apply regresion on the three higher
pyramids levels , the reason behind not choosing the lower pyramids is because: 1) the three
lower pyramid levels are associated with plenty of small anchors to detect small faces. These
small faces are characterized by very coarse feature representations, so it is difficult for these
small anchors to perform two-step regression; 2) in the training phase, if we let the network
pay too much attention to the difficult regression task on the low pyramid levels, it will cause
the loss to bias towards regression problem and hinder the essential classification problem.
Meanwhile,The motivation behind this to make the framework more efficient ,by the STR
utilize the detailed features of large faces on the three higher pyramid levels to regress more
accurate locations of bounding boxes and letting the three lower pyramid levels pay more
attention to the classification task.
32
CHAPTER 4. SRN AND FACEBOXES
The loss function of STR also consists of two parts, which is shown as below:
∑ ∑
LSTR ({xi }, {ti }) = [l∗i = 1]Lr (xi , g∗i )+ [ l ∗i = 1]L r ( t i , g∗i ) (4.2)
i∈Ψ i ∈Φ
Where g∗i is the ground truth location and size of anchor i, x i is the refined coordinates of
the anchor i in the first step, t i is the coordinates of the bounding box in the second step to
locate the face’s bonding box in precise way. We can see the effectivness of STR in the table
4.2
STR B P2 P3 P4 P5 P6 P7
Easy 95.1 94.8 94.3 94.8 95.4 95.7 95.6
Medium 93.9 93.4 93.7 93.9 94.2 94.4 94.6
Hard 88.0 87.5 87.7 87.0 88.2 88.2 88.4
Table 4.2: AP performance of the two-step regression applied to each pyramid level. [4]
• Propose RFE to diversify receptive fields before predicting classes and locations.
• RFE replaces the middle two convolution layers in the class and box subnet of RetinaNet.
Current networks usually possess square receptive fields, which affect the detection of objects
with different aspect ratios. To address this issue, SRN designs a Receptive Field Enhancement
(RFE) to diversify the receptive field of features before predicting classes and locations, which
helps to capture faces well in some extreme poses.
– Photometric distortions.
– Expanding the images with a random factor in the interval [1, 2] by the zero-padding
operation.
– Cropping two square patches and randomly selecting one for training.
– Flipping the selected patch randomly and resizing it to 1024x1024.
33
CHAPTER 4. SRN AND FACEBOXES
Anchor Matching.
• Using Intersection Over Union (IoU) to devide the samples into negatives and positives
anchors .
Optimization.
• Fine-tune the SRN model using SGD with 0.9 momentum, 0.0001 weight decay.
• Setting the learning rate to 10−2 for the first 100 epochs, and decay it to 10−3 and 10−4
for another 20 and 10 epochs, respectively.
Inference.
• STC first filters the regularly tiled anchors on the selected pyramid levels with the neg-
ative confidence scores larger than the threshold θ = 0.99
• The second step takes over these refined anchors, and outputs top 2000 high confident
detections.
• Finally, applying the non-maximum suppression (NMS) with jaccard overlap of 0.5 to
generate the top 750 high confident detections.
You can see the results of evaluation on Benchmark on the table 4.3.
4.2.8 Conclusion
The SRN method is one of the powerful works. With its structure of modules and strategies
could achieve face detection task with high performance in the challenging benchmark. It has
two steps: first is STC that aims to filter out negative samples and improving the precision in
high recall rates, second is STR making the location of bounding box more accurate. Moreover,
the RFE is introduced to provide diverse receptive fields to better capture faces in some extreme
poses. Extensive experiments on the AFW, PASCAL face, FDDB and WIDER FACE datasets
demonstrate that SRN achieves the state-of-the-art detection performance.
34
CHAPTER 4. SRN AND FACEBOXES
See section 4.2.1.1, and beside that, they further improve the state-of-the-art performance on
the AFW, PASCAL face, and FDDB datasets.
• Achieve real-time speed on the CPU as well as GPU, regarding to this challenging case
of study, we can confirm the power of this method in terms of computing speed. It is
about 0.008s to process an image on GPU.
• Trade-off between accuracy and efficiency by shrinking the input image and focus on
detecting small faces.
To solve the problem of low recall rate of small faces that was because of the small anchors
which make them too sparse comparing to large anchors. They propose to apply strategy which
is dense anchor to solve this tiling density imbalance problem. It uniformly tiles several anchors
around the center of one receptive field instead of tiling only one.
35
CHAPTER 4. SRN AND FACEBOXES
• The Multiple Scale Convolutional Layers (MSCL): aims at enriching the receptive fields
and discretizing anchors over different layers to handle faces of various scales, combining
coarse-to-fine information to improve the recall rate and precision of detection.
Inspired by the RPN in Faster R-CNN [19] and the multi-scale mechanism in SSD [14], they
develop a state-of-the-art face detector with real-time speed on the CPU to avoid those prob-
lems:
• Their speed is negatively related to the number of faces on the image. The speed would
dramatically degrade as the number of faces increases.
• The cascade based detectors optimize each component separately, making the training
process extremely complicated and the final model sub-optimal.
• For the VGA-resolution images (high quality), their run time efficiency on the CPU is
about 14 FPS, which is not fast enough to reach the real-time speed.
• It used a lot of tricks to push performance; both in terms of speed and accuracy.
• With the use of inception network you will gain speed and less operations computations
cost which give benefit of using less memory storage.
36
CHAPTER 4. SRN AND FACEBOXES
• MSCL handles various scales of face via enriching receptive fields and discretizing anchors
over layers.
• Proposing a fair L1 loss ans using a new anchor densification strategy to improve the
recall rate of small faces.
• Achieve the state-of-the-art performance on the AFW, PASCAL face and FDDB datasets.
Figure 4.3: Architecture of the FaceBoxes and the detailed information table about our anchor
designs.
Shrinking the spatial size of input: Setting series of large strides sizes for the convolution
and pooling layers to rapidly shrink the spatial size of the input. As illustrated in Figure 4.3,
the stride size of Conv1, Pool1, Conv2 and Pool2 are 4, 2, 2 and 2, respectively. The total
stride size of RDCL is 32, which means the input spatial size is reduced by 32 times quickly.
Choosing suitable kernel size: To speed up the kernel size in the first few layers in the
network should be small, while it is also supposed to be large enough to alleviate the information
loss brought by the spatial size reducing. As shown in Figure 4.3, to keep the effectiveness and
the efficiency as well, and for that they choose 7×7, 5×5 and 3×3 kernel size for Conv1, Conv2
and all Pool layers, respectively.
37
CHAPTER 4. SRN AND FACEBOXES
Reducing the number of output channels: We utilize the C.ReLU activation function
(illustrated in Figure 4.4(a)) to reduce the number of output channels. C.ReLU [21] is motivated
from the observation in CNN that the filters in the lower layers form pairs (i.e.,filters with
opposite phase). From this observation, C.ReLU can double the number of output channels
by simply concatenating negated outputs before applying ReLU. Using C.ReLU significantly
increases speed with negligible decline in accuracy.
Figure 4.4: (a) The C.ReLU modules where Negation simply multiplies −1 to the output of
Convolution. (b) The Inception modules.
Multi-scale design along the dimension of network depth: As shown in Figure 4.3,
the designed MSCL consists of several layers. These layers decrease in size progressively and
form the multi-scale feature maps. Similar to [14], the default anchors are associated with
multi-scale feature maps (i.e., Inception3, Conv3_2 and Conv4_2). These layers, as a multi-
38
CHAPTER 4. SRN AND FACEBOXES
scale design along the dimension of network depth, discretize anchors over multiple layers with
different resolutions to naturally handle faces of various sizes.
Multi-scale design along the dimension of network width: To learn visual patterns for
different scales of faces, output features of the anchor-associated layers should correspond to
various sizes of receptive fields, which can be easily fulfilled via Inception modules. The Incep-
tion module consists of multiple convolution branches with different kernels. These branches,
as a multi-scale design along the dimension of network width, is able to enrich the receptive
fields.
As shown in Figure 4.3, the first three layers in MSCL are based on the Inception module. Fig-
ure 4.4(b) illustrates the Inception implementation, which is a cost-effective module to capture
different scales of faces.
Here, A scal e is the scale of anchor and A interval is the tiling interval of anchor. The tiling
intervals for our default anchors are 32, 32, 32, 64 and 128, respectively. According to Equ.4.3,
the corresponding densities are 1, 2, 4, 4 and 4, where it is obviously that there is a tiling
density imbalance problem between anchors of different scales. Comparing with large anchors
(i.e., 128×128, 256×256 and 512×512), small anchors (i.e., 32 × 32 and 64 × 64) are too sparse,
which results in low recall rate of small faces. To eliminate this imbalance, we propose a new
anchor densification strategy. Specifically, to densify one type of anchors n times, we uniformly
tile A number = n2 anchors around the center of one receptive field instead of only tiling one at
the center of this receptive field to predict. Some examples are shown in Figure 4.5. To improve
the tiling density of the small anchor, this strategy is used to densify the 32×32 anchor 4 times
and the 64×64 anchor 2 times, which guarantees that different scales of anchor have the same
density (i.e.,4) on the image, so that various scales of faces can match almost the same number
of anchors.
39
CHAPTER 4. SRN AND FACEBOXES
Figure 4.5: Examples of anchor densification. For clarity, we only densify anchors at one recep-
tive field center (i.e., the central black cell), and only color the diagonal anchors.
4.3.6 Training
Training dataset: The model is trained on 12880 images of the WIDER FACE[25] training
subset.
Data augmentation: Each training image has been treated by the following data augmen-
tation:
• Random cropping: randomly crop five square patches from the original image: one is the
biggest square patch, and the size of the others range between [0.3, 1] of the short size
of the original image. Then arbitrarily selecting one patch for subsequent operations.
• Scale transformation: After random cropping, the selected square patch is resized to 1024
× 1024.
• Horizontal flipping: The resized image is horizontally flipped with probability of 0.5.
40
CHAPTER 4. SRN AND FACEBOXES
• Face-box filter: keeping the overlapped part of the face box if its center is in the above
processed image, then filter out these face boxes whose height or width is less than 20
pixels.
• First match each face to the anchor with the best jaccard overlap.
• then match anchors to any face with jaccard overlap higher than a threshold (i.e., 0.35).
Loss function: Loss function in faceboxes is the same as RPN in Faster R-CNN [19]. We
adopt a 2-class softmax loss for classification and the smooth L1 loss for regression.
Fair L1 loss the regression target of Fair L1 loss is as follows:
t x = x − xa , t y = y − ya , t w = w, t h = h; (4.4)
where x, y, w, h denote center coordinates and width and height, x, xa , x∗ are for predicted
box, anchor box, and GT box (likewise for y, w, h). The scale normalization is implemented
to have scale-invariance loss value as follows:
∑
L re g ( t, t∗ ) = f air L 1 ( t j − t∗j ) (4.6)
j ∈( x,y,w,h)
where {
| z j |/w∗ , i f j ∈ ( x, w)
f air L 1 ( z j ) = (4.7)
| z j |/ h∗ , otherwise
It equally treats small and big face by directly regressing box’s relative center coordinate
and width and height.
Hard negative mining: In this step is about sorting the samples and pick the top ones for
faster optimization, because after matching the anchors, most of them found to be negative,
which as a result getting a significant imbalance between the positives and negatives examples,
so teh ratio between them is at most 3:1.
Other implementation details: All the parameters are randomly initialized with the
“xavier” method. They finetune the resulting model using SGD with 0.9 momentum, 0.0005
weight decay and batch size 32 (variant depending on the device capability). The maximum
number of iterations is 120k and we use 10−3 learning rate for the first 80k iterations, then
41
CHAPTER 4. SRN AND FACEBOXES
continue training for 20k iterations with 10−4 and 10−5 , respectively. Our method is imple-
mented in the Caffe library and they implemented as well in Pytorch library.
The methods based on CNN have always been with low resolution in runtime efficiency, but
they were accelerated with GPU. And now we can see in the table bellow that CPU took up
the challenge and achieve real time efficiency by those following steps:
• During inference, the method outputs a large number of boxes (e.g., 8, 525 boxes for a
VGA-resolution image).
• First filter out most boxes by a confidence threshold of 0.05 and keep the top 400 boxes
before applying NMS.
• Then performing NMS with jaccard overlap of 0.3 and keep the top 200 boxes.
• Then measuring the speed using Titan X (Pascal) and cuDNN v5.1 with Intel Xeon
E5-2660v3@2.60GHz.
As listed in Table 4.4, comparing with recent CNN-based methods, the FaceBoxes can run at
20 FPS on the CPU with state-of-the-art accuracy. Besides, it can run at 125 FPS using a
single GPU and has only 4.1 MB in size.
Applying extensive ablation experiments on faceboxes model with AFW , PASCAL, and FDDB.
With FDDB was convincing result because it is the most difficult .
42
CHAPTER 4. SRN AND FACEBOXES
Ablative Setting: To better understand FaceBoxes, the experiment pf ablatation each com-
ponent one after another to examine how each proposed component affects the final perfor-
mance and how each component not dispensable. These is how they do this test:
• Then replacing MSCL with three convolutional layers, which all have 3×3 kernel size and
whose output number is the same as the first three Inception modules of MSCL.
Contribution FaceBoxes
RDCL ×
MSCL × ×
Strategy × × ×
Accuracy(mAP) 96.0 94.9 93.9 94.0
Speed(ms) 50.98 48.27 48.23 67.48
Table 4.5: Ablative results of the FaceBoxes on FDDB dataset. Accuracy (mAP) means the
true positive rate at 1000 false positives. Speed (ms) is for the VGA-resolution images on the
CPU. [31]
Discuss the ablation results: The ablation experiment shows how each module is impor-
tant in faceboxes method.
MSCL is better: The comparison between the second and third columns in Table 4.5
indicates that MSCL effectively increases the mAP by 1.0%, owning to the diverse receptive
fields and the multi-scale anchor tiling mechanism.
43
CHAPTER 4. SRN AND FACEBOXES
Component DCFPN
Designed architecture ? * * *
Dense anchor strategy? * *
Fair L1 loss ? *
Accuracy(mAP) 99.2 94.5 93.7 93.2
Table 4.6: Result of ablation of each component of the method beside the loss function where
DCFPN=Architecture+Strategy+Loss [31]
The second table of another ablation and experiment shows the results:
• Dense anchor strategy is effective: +0.8% shows the importance of this strategy.
The evaluation of the FaceBoxes on the common face detection benchmark datasets, includ-
ing Annotated Faces in the Wild (AFW), PASCAL Face, and Face Detection Data Set and
Benchmark (FDDB).
AFW dataset: It has 205 images with 473 faces.It achieves with Faceboxes 98.91%. (see
result on Figure 4.6
PASCAL face dataset: It is collected from the test set of PASCAL person layout dataset,
consisting of 1335 faces with large face appearance and pose variations from 851 images. It
achieves with Faceboxes 96.30%.
FDDB dataset: It has 5171 faces in 2845 images taken from news articles on Yahoo websites.
It achieves in discontinuous ROC curves 96% and in continuous ROC curves 82.9% .
4.3.8 Conclusion
Since achieving real time on CPU device was a challenging issue. Faceboxes worked on im-
proving the performance and make high results, beside that CNN based methods have their
disadvantages Faceboxes could do big step by its structure where it uses RDCL to achieve time
performance and MSCL enrich the receptive field to learn different faces scales and apply new
strategy which is anchor densification strategy is proposed to improve the recall rate of small
faces. The method experiments demonstrate the state of art by achieving 20 fps on CPU and
125 fps on GPU, beside that gives high accuracy on AFW, PÄSCAL, FDDB datasets.
44
CHAPTER 4. SRN AND FACEBOXES
4.4 Conclusion
In this chapter we talked with more details about the two methods SRN and faceboxes. After
seeing the high results and the condition treated and experiments applied, we see that they can
be efficient for our study where we are going to do the training in Faceboxes for our dataset
which is a collection of pictures taken from surveillance scenarios and do the test as well with
Faceboxes with surveillance video, and see what we will get.
45
5
Chapter
Implementation
Thomas Edison
I
n this chapter we will highlight our contribution in the field of face detection; which
method used in which environment to detect faces in real time on surveillance scenarios
to achieve good performance in terms of accuracy and recall.
5.1 Introduction
Computer vision is a field that needs many requirements to achieve better performances, not
only which network used or which structure or which method but moreover which machine
that is going to process the big data, is it speed enough, is the memory capacity can cover all
the input simultaneously. So, it is a global pack that need to be completed to obtain the target.
By doing the experiments using Faceboxes method in our environment using our datasets in
training and testing, let’s see what we got in the following sections.
46
CHAPTER 5. IMPLEMENTATION
• From the SRN method’s requirements the memory capacity of the GPU which should be
at least 11 Go. Our GPU has only 2 Go.
• When we run the testing code with WIDER-face test dataset, it gives result of AP=0,
and this is confusing to judge the performance of the method. see Figure 5.1.
• The training code of the SRN is not yet released, and the testing code of the method
related only to the WIDER-face which make it difficult to make adaptable changing to
be suitable for our dataset.
• Faceboxes both training and testing codes are available and we can do changes.
5.3 Environment
5.3.1 Operating system
Both methods need Linux system to be executed, so we choose Ubuntu18.04 LPS as an oper-
ating system to have a compatible environment for requirements which are:
• Pytorch library.
To avoid a lot of errors that appear in the execution it’s better to be careful in choosing the
version of each package to be installed.
47
CHAPTER 5. IMPLEMENTATION
• Use the Github tool which was the support between us and our supervisor so that he
can correct our errors and follow our progress.
The original Faceboxes code was implemented on Caffe and re-implemented again on the
Pytorch library. We chose the second because it is more familiar to us.
48
CHAPTER 5. IMPLEMENTATION
• We upload more than 24 videos for training and 14 videos for testing from Youtube of
security camera and one video from supermarket that we used it for testing.
• 139 images from different datasets (PASCAL, FDDB, AFW) and 126 images from WIDER-
Face dataset. These images are aquired in different conditions (see Figure 5.2) which are
helpfull to feed the network; they were selected accurately.
As we said before that the videos were uploaded from Youtube. So, how do we extract them
as frames and collect the best ones?
49
CHAPTER 5. IMPLEMENTATION
• We write a simple code using OpenCV library which extract one frame each 2 seconds
from the video :
import cv2
#put t h e v i d e o uploaded
v i d c a p = cv2 . VideoCapture ( ’ v i d e o t e s t / 1 2 . mp4 ’ )
d e f getFrame ( s e c , i ) :
v i d c a p . s e t ( cv2 .CAP\_PROP\_POS_MSEC, s e c ∗ 1 0 0 0 )
hasFrames , image = v i d c a p . r e a d ( )
i f hasFrames :
#s a v e frame a s JPG f i l e
cv2 . i m w r i t e ( ” t ”+s t r ( i ) + ” . j p g ” , image )
r e t u r n hasFrames
i =0
sec = 0
#i t w i l l c a p t u r e i m a g e i n each 2 se c o n d
frameRate =2
s u c c e s s = getFrame ( s e c , i )
while success :
s e c = s e c + frameRate
s e c = round ( s e c , 2 )
i+=1
s u c c e s s = getFrame ( s e c , i )
• After extracting the videos into frames, we select only the ones which contain faces and
we delete the rest. (see figure 5.3 samples from the selected frames).
50
CHAPTER 5. IMPLEMENTATION
MakeSense: is an open-source tool to used annotate images under GPLv3 license. It does not
require any advanced installations, just need a web browser to run it (Open-source, Free, Web
based). The user-interface is simple and easy to use. MakeSense supports multiple annotations:
bounding box, polygon and point annotation. You can export the labels in different formats
including YOLO, VOC XML, VGG JSON and CSV. In our case we need VOC XML which is
asked by Faceboxes. Here is a step-by-step guide to use MakeSense annotation tool:
2. Click the bottom get started box to go to annotation page and you will see where you
can upload images you want to annotate (maximum of 600 images per time).
4. Since we do not have any labels loaded, we will create our label for our project which is
face. To add a new label, click the ’+’ sign on the top left corner of the message box, and
enter the label in the “Insert Label” text field, then “Start project”. Then select ’doing
by my own’ then start labelling. (See Figure 5.4)
51
CHAPTER 5. IMPLEMENTATION
5. After annotating all the images, it’s time to export labels. To export, click on ’Export
Labels’ button on the top-right of the page, and selecting XML VOC format.
1. The images folder that contains all extracted images with extension ’.jpg’.
2. The annotations folder that contains an xml file for each image. Make sure the xml
file has the same name as the associated image.
3. The img_list.txt is a file which contains couples of <image name, xml file name>,
they have to be in this format:
<imageName>.jpg <imageName>.xml
Make sure you put them all in this one file.
• Changing the batch-size, to be compatible with your device, in our case we putted 15 to
avoid Cuda out memory.
p a r s e r . add_argument ( ’ −− t r a i n i n g _ d a t a s e t ’ , d e f a u l t= ’ . / data / f i n a l b d d ’ ,
h e l p= ’ T r a i n i n g d a t a s e t d i r e c t o r y ’ )
52
CHAPTER 5. IMPLEMENTATION
5.5.0.2 Training
The training in this case took more than 10 hours with 300 Epochs and 86 iterations of each
one (25800 iterations in total). It tooks 1.2 second approximately for each batch (see Figure
5.5 and Figure 5.6) and give as in the end the weights file.
5.5.0.3 Testing
• We did the test for our dataset that contains 425 images by using CPU and GPU, then
we got as a result a file contains the bounding box coordinates with the confidence of
each one that we are going to use them to calculate the Average Precision (AP) in the
next chapter. Before execution, we need to prepare those two folders:
53
CHAPTER 5. IMPLEMENTATION
4. To execute ,
on GPU:
python3 test.py –dataset testmybdd
on CPU:
python3 test.py –dataset testmybdd –cpu
• We have wrote the testing code for video dataset; to do that we selected video from
supermarket and we add instructions of code to extract video’s frames and do the testing
in parallel, on the other hand we extract the frames by using the same code that we used
before 5.4.2.2 then labelling them using MakeSense tool 5; so we can calculate the AP.
To execute the code we add those instructions :
v i d c a p = cv2 . VideoCapture ( ’ s a m p l e s / vidd . mp4 ’ )
d e f getFrame ( s e c ) :
v i d c a p . s e t ( cv2 .CAP_PROP_POS_MSEC, s e c ∗ 1 0 0 0 )
hasFrames , image = v i d c a p . r e a d ( )
r e t u r n hasFrames , image
and
w h i l e True :
cap . s e t ( cv2 .CAP_PROP_POS_MSEC, s e c ∗ 1 0 0 0 )
has_frame , img_raw = cap . r e a d ( cv2 .IMREAD_COLOR)
s e c+=1
i f not has_frame :
p r i n t ( ’ [ i ] ==> Done p r o c e s s i n g ! ! ! ’ )
break
#t e s t i n g b e g i n
−
−
−
54
CHAPTER 5. IMPLEMENTATION
• We have write a code to do demonstration using input video or the laptop web camera.
First we ensure if the input is a video or webcam by:
i f args . video :
i f not o s . path . i s f i l e ( a r g s . v i d e o ) :
p r i n t ( ” [ ! ] ==> Input v i d e o f i l e {} doesn ’ t e x i s t ”
. format ( a r g s . v i d e o ) )
sys . exit (1)
cap = cv2 . VideoCapture ( a r g s . v i d e o )
o u t p u t _ f i l e = a r g s . v i d e o [ : − 4 ] . r s p l i t ( ’ / ’ ) [ − 1 ] + ’ _Facebox . a v i ’
else :
# Get data from t h e camera
cap = cv2 . VideoCapture ( a r g s . s r c )
output_file = args . video [ : − 4 ] . r s p l i t ( ’ / ’ )[ −1] +
’ _webcamFaceBox . a v i ’
w h i l e True :
55
CHAPTER 5. IMPLEMENTATION
t e x t = ” { : . 4 f } ” . format ( b [ 4 ] )
b = l i s t (map( i n t , b ) )
cv2 . r e c t a n g l e ( img , ( b [ 0 ] , b [ 1 ] ) , ( b [ 2 ] , b [ 3 ] ) , ( 0 , 0 , 2 5 5 ) , 2 )
cx = b [ 0 ]
cy = b [ 1 ] + 12
cv2 . putText ( img , t e x t , ( cx , cy ) ,
cv2 .FONT_HERSHEY_DUPLEX, 0 . 5 , ( 2 5 5 , 2 5 5 , 2 5 5 ) )
#s a v e t h e output v i d e o
v i d e o _ w r i t e r . w r i t e ( img . a s t y p e ( np . u i n t 8 ) )
#show t h e v i d
cv2 . imshow ( ’ r e s ’ , img )
key = cv2 . waitKey ( 1 )
i f key == 27 o r key == ord ( ’ q ’ ) :
p r i n t ( ’ [ i ] ==> I n t e r r u p t e d by u s e r ! ’ )
break
For execution:
run the MyTest.py with video input
python MyTest.py –video samples/vid.mp4 –output-dir outputs/
run the MyTest.py in your own webcam
python MyTest.py –src 0 –output-dir outputs/
Hypotheses. In order to verify the impact of the machine capacity on the results, we im-
plemented the training and testing in a more powerful machine which is the HPC (High
Performance Computing) that is situated in the data center of our university by accessing it
remotely through ssh and running it via this script :
#! / b i n / bash
#SBATCH −J FaceDetection_Job
# Job name
#SBATCH −o FaceDetection_ .% j . out
# Name o f s t d out output f i l e (% j expands t o j o b I d )
#SBATCH −N 1
# T o t a l number o f nodes r e q u e s t e d
##SBATCH −n 1
# Number o f t a s k s p e r node ( d e f a u l t = 1 )
#SBATCH −p gpu
56
CHAPTER 5. IMPLEMENTATION
# T o t a l number o f mpi t a s k s r e q u e s t e d
#SBATCH −− n o d e l i s t=node11
# T o t a l number o f mpi t a s k s r e q u e s t e d
#SBATCH −−gpu 2
# T o t a l number o f mpi t a s k s r e q u e s t e d
#SBATCH − t 0 6 : 0 0 : 0 0
# Run time ( hh :mm: s s ) − 6 ho u r s
# Launch
. / make . sh
python t r a i n . py
# in case of training
python t e s t . py −− d a t a s e t testmybdd
# in case of testing
It only did the training for 2 hours, while our personal machine took over 10 hours. We
changed the batch size also to 32 and using two GPUs. The result of training is shown in figure
5.7.
5.6 Conclusion
After successfully completing the training and testing tasks using our dataset for face detection
in surveillance images scenarios, we will see now if the weights obtained by the training phase
are optimal enough to give similar AP results gained by first Faceboxes experiments, do they
feed the network to give same accuracy? does changing the batch size reduce the performances
of the speed and affect the results?
57
CHAPTER 5. IMPLEMENTATION
We will answers all this questions in the next chapter. You can check the code via this url :
https://github.com/IhceneDjou/FaceBoxe-surveillanceVideo
58
6
Chapter
Results and discussions
Thomas Edison
A
fter doing all the work mentioned in previous chapters, it’s time now to see the results
obtained by our implementations using our own dataset, and how do we get them?
Do they satisfy the requirements? Are the conditions cover all the cases to achieve
the target? We will discuss in this chapter all those questions.
6.1 Introduction
Face detection in surveillance scenarios in real time, is a challenging work to achieve perfor-
mances in time speed efficiency and more accurate bonding boxes with high average precision
and less false positives. Our dataset is ready with labels and our model has been defined and
trained. Now does the test phase give desired results?
• The code calculate the Average Precision (AP), for each of the classes present in the
ground-truth. In our case, there is only one class which is ’face’ by using IoU (Intersection
59
CHAPTER 6. RESULTS AND DISCUSSIONS
6. To do animation during the calculation: Insert images into the folder input/images-
optional/.
• Use matching names for the files (e.g. image: ”t1.jpg”, ground-truth: ”t1.txt”).
• E.g. ”t1.txt”:
face 2 10 173 238
face 439 157 556 241
60
CHAPTER 6. RESULTS AND DISCUSSIONS
• Then we copied each image’s bonding boxes in separate file ’.txt’ and changing the class
name to ’face’ in this case it was the name of the image to be like this:
• Use matching names for the files (e.g. image: ”f1.jpg”, detection-results: ”f1.txt”).
• E.g. ”f1.txt”:
face 0.471781 0 13 174 244
face 0.414941 274 226 301 265
61
CHAPTER 6. RESULTS AND DISCUSSIONS
6.3 Results
6.3.1 Time
• Testing using CPU: it took 1.283 second per image as shown in Figure 6.1.
• Testing using GPU: it tooks 0.02 second per image which means 50 FPS as shown in
Figure 6.2.
• Testing using CPU of the HPC it gives 0.745 second per image as shown in Figure 6.3.
• Testing using GPU of HPC it gives 0.02 second per image which mean 50 frames per
second as shown in 6.4.
62
CHAPTER 6. RESULTS AND DISCUSSIONS
63
CHAPTER 6. RESULTS AND DISCUSSIONS
• As we see in the curve of Figure 6.5 the average precision obtained is 34.03%.
• Figure 6.6 shows that the program detects 1096 faces from 1342 faces from the ground-
truth (true positive) which means 246 faces in ground-truth not detected. It shows also
that false alarms of 25272 face bounding boxes (false positive).
• The Figure 6.7 shows an example of face detection with high performance of 71%. The
green rectangle is the ground-truth and the blue one is the bounding box detected.
• Figure 6.8 shows face detection with confidence of 42.22%, that is why we change the
threshold from 50% to 35% because the program detects a lot of faces but with low
confidence.
• Figure 6.9 shows detected bounding boxes (true and false ). The red ones are false positive,
the green ones are the bounding boxes that match the ground truth, the blue ones are
the ground truth which where detected by the program and the pink ones are the ground
truth not detected.
Figure 6.5: The Average precision obtained for test images by Faceboxes
• As we see in the curve Figure 6.10 the average precision obtained is 22.71%.
64
CHAPTER 6. RESULTS AND DISCUSSIONS
65
CHAPTER 6. RESULTS AND DISCUSSIONS
Figure 6.9: Result of Bounding boxes. Red: false positives , Green: true positives , Blue: ground
truth detected , Pink: ground-truth not detected.
• Figure 6.11 shows that the program detects 277 faces from 386 faces from the ground-
truth (true positive) which means 109 faces in ground-truth not detected. It shows also
false alarms of 12971 face bounding boxes (false positive).
• Figure 6.12 shows an example of face detection with high performance of 85.04%. The
green rectangle is the ground-truth and the blue one is the bounding box detected.
• Figure 6.17 shows face detection with confidence of 38.03% that is why we change the
threshold from 50% to 35% because the program detect a lot of faces but with less
confidence.
• Figure 6.14 shows detected bounding boxes (true and false ). The red ones are false
positives, the green ones are the bounding boxes that match the ground-truth, the blue
ones are the ground-truth which where detected by the program and the pink ones are
the ground-truth not detected.
• As we see in the curve of Figure 6.15 the average precision obtained is 64.91%.
66
CHAPTER 6. RESULTS AND DISCUSSIONS
Figure 6.10: The Average precision obtained for video test by Faceboxes
67
CHAPTER 6. RESULTS AND DISCUSSIONS
Figure 6.14: Result of Bounding boxes. Red: false positives, Green: true positives, Blue: ground-
truth detected, Pink: ground-truth not detected.
68
CHAPTER 6. RESULTS AND DISCUSSIONS
• Figure 6.16 shows that the program detects 1212 faces from 1342 faces from the ground-
truth (true positive) which means 130 faces in ground-truth not detected. It shows also
that the false alarms is 13920 face bounding boxes (false positive).
Figure 6.15: The Average precision obtained for test images by Faceboxes in HPC
6.4 Discussion
After seeing obtained results by our model based on Faceboxes method, let’s see the reasons
behind these results:
69
CHAPTER 6. RESULTS AND DISCUSSIONS
• We did the training using our datasets that contain 1277 images with 4537 faces which
is quite small comparing to the training done with faceboxes on Wider-face dataset that
has for training 12881 images and around 157481 faces. So their model is stronger than
ours.
• Labelling the images manually make us sometimes fall in no drawing boxes as shown in
image. The program detect the face but it does not find any match.
• Doing our tests with 425 images that has 1342 faces and then with the video that has 205
frame with 386 faces, the program gives good results with these small numbers, otherwise
it will give high AP if the testing dataset is more wide.
• The reason of our reduced results is because the change in the batch size from 30 to 15
due to our machine memory capacity, so the network efficiency would be reduced because
its small input.
• The false positives was more than the true positives as shown in Figures 6.4, 6.7, 6.9 and
6.12 which is a limit of this approach.
70
CHAPTER 6. RESULTS AND DISCUSSIONS
an AP of 64.91% which is more better and higher than the previous results obtained by our
machine ( 30.88% ). We obtained 1212 detected faces from the ground-truth comparing to the
first model which detected 1096 faces from the ground-truth; and beside that, it gives less false
alarms. Here we can check our hypothesis where we said that the capacity of the machine can
affect the results; yet the results are still far from the optimal result obtained by the Faceboxes
experiments. So, we can say that the reason behind this is still on the dataset that need to be
really wide and big to feed the network and gives more powerful model that needs more time
to collect more data.
6.5 Conclusion
In face detection tests, our model gives an AP of 34.03% using images, 22.71% using video
and 64.91% using images on HPC. It got a speed on our machine of 1.283s per image with the
CPU and 50 FPS on GPU, and on HPC 0.7s per image on CPU and 50 FPS on GPU, which
is less than the results obtained by Faceboxes methods. It is due in first degree to the dataset
which is not big enough to give optimal model and also the test dataset is small, so we can not
judge the speed. Second, to the machine capacity, because we confirmed that when we do the
training and testing on the HPC then by getting improved results in term of AP we conclude
that the machine capacity affects the results. Beside that, we got so many false alarms and
this is because to the dataset which is not big enough. Third, it’s due to the less accuracy in
doing images labels.
71
7
Chapter
Conclusion
F
ace detection has always been a challenging issue that researchers are working to
achieve more higher performance. Because the need of face detection becomes impor-
tant in the world of the technology, we are working on surveillance scenarios. The
security cameras are installed everywhere and we aim to add the option of face detection. This
option will meet the request for anyone looking for detecting humans by detecting faces. Ex-
amples: in the shop, in the police office, in the airport and much more. The face detection is
useful in statistics by counting the number of humans in specific area, they can go further by
recognizing faces and use face detection as first step especially when it gives efficient results,
they can use it in some secured places where it should not have the existence of humans. So,
by applying face detection it will set an efficient alarms. In our work, we studied four existing
methods related to face detection by comparing them. We selected the Faceboxes method to do
our study for its efficiency. We created our dataset which has different surveillance scenarios
conditions with 1277 images composed of 4537 faces for training, and 425 images composed of
1342 faces and one video which has 205 frames with 386 faces for testing.
Our model reaches 1.738s per image on CPU and 50 FPS on GPU using our personal machine,
and it reaches 0.7s per image on CPU and 50 FPS on GPU using the HPC of the university,
which means that we achieved a real time only on GPU. We achieve an Average Precision for
images and video tests of 34.03% and 22.71% respectively on our machine, and 64.91% on HPC
for images test. We get results less than Faceboxes results using FDDB, PASCAL and AFW
due to the capacity of our machine and not having more data. Otherwise, we can say that the
results obtained are good enough for the requirements that we offered.
As perspectives, we have to reduce false positives presented in figures 6.6, 6.11 and 6.16 by
cropping the images on the faces and using these faces in the training phase which means
72
CHAPTER 7. CONCLUSION
enriching our dataset through the data augmentation. This solution will give more accurate
results and an improved AP by having a robust model. We have also to build a more precise
ground-truth using more powerful machines and big data.
73
8
Chapter
Appendix
I
n this section we are going to mention the most common errors that we faced during the
development and preparation in python of the existing SRN and faceboxes code. As it
is the first time working in this kind of codes and environments it took a lot of time to
satisfy the conditions to execute the code.
1. First thing we face is that we need to change our operating system to Linux environment
(we choose Ubuntu 18.04 to get the python3.6 version by default which is requested by
the code owners) because the code has the Bash file which can be compiled only under
Linux.
2. If you have already Windows on your machine, install Linux beside it and make sure to
prepare free partition for the Linux system so during the installation.
3. The both codes has Cuda cudnn code, so your machine should has a GPU. Before in-
stalling Cuda cudnn you need to install first the Nvidia driver.
4. You need to choose carefully the versions for each package to be installed to have a com-
patible environment (in the SRN code they request pytorch V < (1.0.0) and torchvision
0.2.1; the cuda should be V9.0 or under, cudnn V9.5.6 and Nvdia driver 380).
74
CHAPTER 8. APPENDIX
5. The faceboxes requet pytorch version > (1.0.0). The cuda version should be > 9.0.
Working under anaconda environment may allow us to save more time and organize each
folder by its code and libraries with needed version.
• If you do not want to make dual operating system in your machine; only one which is
Linux, you need to be very careful in uninstalling the other operating system.
• The data should be variant and have different positions and scenes.
• Select the data that matches the goal which is detecting faces in surveillance scenarios.
To accomplish those points you need a lot of time and searches to fall in the right area and
build dataset that fits the study and has good quality that matches the requirements.
75
CHAPTER 8. APPENDIX
• During the creation of images labels in XML VOC format, if you want to change some-
thing be sure that all the tags are closed and you did not delete any character because
it will be so confusing if you have many files, in this case you will pass by all of them to
find the issue. (see Figure 8.1)
• Second error may appear if your have a limit memory capacity, be sure to change the
parameters to be suitable with your device as shows in Figure 8.2.
• Make sure that the name and extension are the same in the img_list.txt file (example
.jpg .JPG are not the same). (see Figure 8.3)
• Be sure that the path in all images labels are the same because it will produce occlusion
for the program.
76
CHAPTER 8. APPENDIX
Figure 8.2:
77
CHAPTER 8. APPENDIX
• When we install Pycharm CE for python using Ubuntu software, sometimes interruptions
happened (example because of the Internet connection), so as a result error message will
appear when you try to reinstall again:
78
Bibliography
[2] Z. Cao, T. Simon, S. Wei, and Y. Sheikh, Realtime multi-person 2d pose estimation
using part affinity fields, CoRR, abs/1611.08050 (2016).
[3] J. Cartucho, R. Ventura, and M. Veloso, Robust object recognition through sym-
biotic deep learning in mobile robots, in 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), 2018, pp. 2336–2341.
[4] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou, Selective refinement network
for high performance face detection, CoRR, abs/1809.02693 (2018).
[5] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.
[10] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition,
CoRR, abs/1512.03385 (2015).
79
BIBLIOGRAPHY
[12] R. Khandelwal, Deep learning using transfer learning -python code for
resnet50. https://towardsdatascience.com/deep-learning-using-transfer-learning-python-
code-for-resnet50-8acdfb3a2d38, 2019. Accessed on 2020-09-03.
[16] M. I. Nouyed and G. Guo, Face detection on surveillance images, arXiv preprint
arXiv:1910.11121, (2019).
[18] L. R., Focus: Mobilenet, a powerful real-time and embedded image recognition, 2018. Ac-
cessed on 2020-09-03.
[19] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detec-
tion with region proposal networks, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39 (2017), pp. 1137–1149.
[21] W. Shang, K. Sohn, D. Almeida, and H. Lee, Understanding and improving con-
volutional neural networks via concatenated rectified linear units, CoRR, abs/1603.05201
(2016).
[24] P. Viola and M. Jones, Robust real-time face detection, International Journal of Com-
puter Vision, 57 (2004), pp. 137–154.
80
BIBLIOGRAPHY
[25] S. Yang, P. Luo, C. C. Loy, and X. Tang, WIDER FACE: A face detection benchmark,
CoRR, abs/1511.06523 (2015).
[26] J. Yoon and D. Kim, An accurate and real-time multi-view face detector using orfs and
doubly domain-partitioning classifier, Real-Time Image Processing, 16 (2019).
[27] J. Zhang, X. Wu, J. Zhu, and S. C. H. Hoi, Feature agglomeration networks for single
stage face detection, CoRR, abs/1712.00721 (2017).
[28] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, Single-shot refinement neural
network for object detection, CoRR, abs/1711.06897 (2017).
[29] S. Zhang, L. Wen, H. Shi, . Z. Lei, S. Lyu, and S. Z. Li, Single-shot scale-aware
network for real-time face detection, Int. J. Comput. Vision, 127 (2019), p. 537–559.
[30] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei, and S. Z. Li,
Improved selective refinement network for face detection, CoRR, abs/1901.06651 (2019).
[31] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, Faceboxes: A CPU real-time
face detector with high accuracy, CoRR, abs/1708.05234 (2017).
81