Karuizawa Paper4

A real-time parallel architecture for human face detection
based on the Algorithm Architecture Adequation approach

Dmitriev Ivan1, Grandpierre Thierry2, Akil Mohamed2, Ghorayeb Hicham3,
Goto Satoshi1, Ikenaga Takeshi1
1 Waseda University 2 Ecole Supèrieure des Ingénieurs 3 Ecole Nationale Supérieure des
Graduate School of Information en Eléctronique et Electricité, Mines de Paris,
Production and Systems Architectures et Algorithmes pour Centre de Robotique
Systèmes Intégrés Laboratory
ABSTRACT algorithm and together with the working demonstrator

technology can provide a tool[8] for automatically
Today the human face detection is an essential part of
generating the architecture itself. The AdaBoost
bio-medical and security equipment. Following the
hardware implementation discussed later was however
introduction of AdaBoost and new methodologies –
developed without such a tool, but following the AAA
such as SVM, the human face detection shows a steady
method.
increase in diversity of its applications, but mainly in
software form. This work focuses on the optimizations In this paper we will present a development of
of AdaBoost/CA algorithm from CAOR for its scalable parallel architecture from algorithm to
implementation in hardware architecture and the prototype , but it should be noticed, however, that the
application of Algorithm Architecture Adequation weak learner of the AdaBoost was not implemented in
(AAA) methodology for the development of a real- hardware. The main challenge was to make a real-time
time scalable parallel architecture prototype. Real-time scalable face detection system, the network training
performance is achieved by fully exploring the data component was not included in the architecture.
and command parallelism, using the AAA approach, Therefore at the beginning of the detection cycle the
and the scalability is achieved and maintained by the architecture is preloaded with already trained
use of generic components. The architecture prototype classifiers from external source or from an internal
is capable of real time proceedings and it is scalable ROM.
for images from 160 * 120 till 1920 * 1440 pixels.
The real-time scalable architecture prototype which
Keywords: AdaBoost, scalable real-time architecture, can be put on an FPGA can process a 30FPS gray scale
scalable parallel architecture, AAA, face detection 8bpp video of widely-used size 640 * 480 pixels.
1. Introduction 2. Notations of AAA and Adaboost
The rapid prototyping of complex applications for its The naming conventions of this paper follows the
implementation in hardware has become a very works on AAA[3,5], and the algorithm is employed as
important issue, since those applications are it was in the original context – upright position human
increasingly varied. Face detection component is an face detection.
example of such application. It is present today in
The feature is a characteristic operator allowing a data
many software-based video processing platforms. The
set to be classified in two categories according to the
face detection method discussed in this paper is based
probability of face appearance.
on the AdaBoost/CA algorithm, derived from the
original AdaBoost[2] and it uses multilevel classifiers
to accelerate detection. Previously existing face
detection applications were rarely implemented in
hardware [1,7,6], due to the cost and the complex
scheduling required for such architecture to function;
then the fact that the components of this architecture
would rarely be reusable and the architecture itself
wouldn't be scalable made unattractive such
developments. __
The AAA methodology[8] facilitates the
implementation and algorithmic optimizations of any
Figure 1: The feature mask applied on faces.

The classifier is a set of features which gives a binary In the AAA methodology the algorithm is describes
response to indicate whether the face may be absent in as a directed graph, the architecture is described as
a sub-image. non-directed graph. The AAA optimization is a series
of heuristic-driven graph transformations, producing an
The face detector uses a set of classifiers to ultimately
optimised scheduled implementation graph for current
detect the presence of a human face.
architecture.
Notations in the algorithm :
The notation of AAA graph of the algorithm is
Res0 : The original image of resolution 0, its described as follows[9]:
dimensions are ImX pixels * ImY pixels.
– a function is a general representation with no
Res1 : The original image of resolution 1, its edition restriction: a function can contain depen-
dimensions are ½ ImX pixels * ½ImY pixels. dences, references and ports.
Res2 : The original image of resolution 2, its – a sensor is an representation of a physical device
dimensions are ¼ ImX pixels * ¼ ImY pixels. producing data: it can only contain output ports.
Ci : The classifier number i – as it is said the – an actuator is an representation of a physical de-
classifier is a set of features. vice consuming data: it can only contain input
Ni : The number of features in the classifier Ci. ports.
For example N1= 2. – a constant is a an representation of a typed value: it
Threshi : The threshold of the classifier Ci. can only contain one output port producing that
Thglobal : The global sum of all the already used value. For convenience, the value hold by the con-
features of the classifier Ci. stant can be given as a parameter to the constant
M0 : The number of sub-images in the image of definition.
resolution Res0.
– a delay is an representation of memory region: it
Fj,i : The feature number j of the classifier Ci.
must contain one input port and one output port of
Imk : The kth sub-image of the original image of
the same type, but nothing more.
resolution 0.
Taking those notations the AdaBoost algorithm is
The notation of AAA graph of the architecture is
described as follows.
described as:
Inputs: Res0, Facek=1 for k from 1 to M0
– Square elements are operators, they perform
Outputs: Facek , the decision for each sub-image Im0 of operations on the input and intermediate data,
Res0. Facek =1 if there is a face, 0 otherwise.
– round elements are medium locations – where the
data is stored, there is 2 types of medium elements
1 Generation of the original image 3 resolutions
– Random Access Medium (RAM) and Sequential
//Classifiers loop
Access Medium (SAM). SAM has two subtypes –
2 For each classifier Ci (For i from 1 to S)
point to point and multi point.
//Sub-images loop
3 For each sub-image Imk (Pour k from 1 to M0) In our case we will define an architecture on the basis
4 If (Facek = 1) Then of the optimized algorithm. We can therefore freely
//Features loop explore and implement the parallel sections of
5 For each feature Fj,i of classifier Ci (For j from 1 AdaBoost algorithm in parallel hardware.
to Ni)
3. Optimization of the algorithm with AAA
6 Vj,i <= F (Im, Res1 ,Res2 ); // Apply the feature The algorithm as presented here has a routine for
//Fj,i on sub-image Imk applying the features to a sub-image. The feature in our
7 EndFor; case is represented by a set of representative points in a
8 Si,k<= ΣVj,i // Si,k is the sum of votes of all square mask of Xfeat pixels. The number of points in the
//features of the classifier Ci applied on Imk set does not exceed a predefined quantity Pts. The
9 Facek <= comparison(Si – Sglobal,i > Threshi ) ; routine consists in comparing extremal values of those
10 EndIf; EndFor; two sets in order to define the probability of obtaining
11 EndFor; a face in the sub-image.
12 For k from 1 to M0;// Retrieve the faces
Therefore we can notice that reducing the processing
13 return Imk if Facek=1;
time of one sub-element is the way to improve our
14 EndFor;
conceptual architecture. As described earlier, at the end
of processing of each classifier there is a conditional
transfer of sub-image to the next classifier. We would the source image except the first one, where faces are
naturally propose to introduce the verification whether located. Tface is the average time to process the entire
the sub-image can proceed to the next classifier after set of classifier and to detect a face in a sub-image,
each feature, so the processing of a sub-image will be Tempty is the average time needed to verify that there is
stopped, if it cannot pass the final threshold even if all no face in sub-image.
subsequent features weights in current classifier are
The figure 2 illustrates current algorithm in the AAA
added to the sub-image current score.
form.
That is : As we see from the above mentioned graph, there is no
data dependence between the images. The only data
if( Si,k + Srest < Threshi) then fetch next sub-image. dependence present is a comparison between the sum
of weights of processed features and the classifier
threshold inside a classifier. Knowing all those
Where Srest is the sum of all subsequent features particular points we can propose several schemes to
weight. reduce the detection time are:
As the basic algorithm deals only with detection in – Apply the features in parallel, to reduce individual
the original size image we need an additional loop to sub-image processing time
create a set of increasingly small scales of the original
image in order to detect faces on various focal levels of – Apply the classifiers in parallel to reduce
the original image – the scales number is: individual sub-image processing time
Nscale = inf((min(ImX,ImY))/Xfeat) ; (3.1) – Process several sub-images in parallel to reduce
In order to have all the faces detected in real time we the processing time of one image scale
would have to process the whole image and all its
scales in a stretch of time that would be less than the – Process several image-scales in parallel to reduce
threshold Tthresh required for the type of video used and the processing time of a single image
its size. In the worst case of longest operational time,
– Load the next image data in parallel to the
we would have an image composed of faces, and each
processing to be able to maximize effective
face would be of the same size as the feature mask.
operational time of the detection architecture.
With this assumption we can establish the following
formula for worst case processing time. – Output the detected faces in parallel to the
processing.
Tpp = (ImX· ImY · FPS)/D ; (3.2) However not all of this available algorithmic
parallelism may be implemented in hardware, mainly
Tw = Σi=2Nscale ((ImX · ImY · Tempty)/(i·Xfeat)²) ; (3.3) due to the limitations of memory size and data path
Tthresh> Tface · (ImX · ImY)/Xfeat² + Tpp+ Tw ; (3.4) width.
4. Hardware optimizations for scalability
Tpp is the preprocessing time needed to load all the In order to be reused later the components of the
images to process. D is the data channel rate. FPS architecture must use generic notions for variable sizes
image frames per second. Tw is the time needed to of each component as often as it is possible. The
process all the non-face elements of all the scales of architecture itself was designed to be implemented on
INPUT
Function 2
Delay 3
Comparison of
Passed
classifiers
the sum with
Main PE FIFO
threshold
RAM1
Buffer PE FIFO
Function 1:
Delay 2:
Delay 1: Application of the RAM
OUTPUT
Sum of
sub-images feature unto a
weights
sub-image
Main
PE FIFO
Figure 2: Graph of the Adaboost algorithm. RAM N
Figure 3: The architecture graph of Adaboost face

detector created according to AAA
an FPGA, so the optimization on the gate level of the The processing element is a generic structure – it is
architecture components was not always possible and fully configurable and its size varies depending on the
suitable. The description was done in VHDL according linear size of feature mask and how much features we
to the adaptation of the pictured above architectural would like to process simultaneously.
graph.
Each feature contains from 40 to 140 bits of data,
_As it can be seen from the figure 3, there is a buffer which needed to be accessed and processed in parallel,
memory, storing several dozen sub-images, for because of that the choice of the number of features
immediate processing. This is done because of the fact that can be processed simultaneously by the processing
that it is impossible to fit the memory containing main element is a result of experiments with different
image to an FPGA. processing element (PE) size.
The first eventual optimization concerning the It is unwise to implement wholly parallel classifiers in
memory is to create all the scales of image in parallel each of PE, as classifiers vary in size greatly: the first
and store them independently in external RAM. This is classifier contains only 2 features, while the last one
needed because if we want to reload the memory as the ~1600 – such a classifier alone will take much of the
new image arrives we could load only the non-reduced FPGA surface. The architecture, using fully parallel
image and generate all the needed scales later, as each classifier of this type, would not adapt to different
of current image scales is processed. picture sizes and therefore would not be scalable.
Following experimental results it was chosen to
Second optimization is to allow the scale generation
implement 8 features processing elements.
and reload automata operate in parallel to the reload
automate of the internal memory. Each processing element is connected to the buffer
memory, it outputs the coordinates of the detected
This is also why the buffer memory must contain
faces to results FIFO and has its independent features
more sub-images than can be assigned to all processing
RAM, which contains all the classifiers needed to
elements in one clock cycle. This allows to keep a
detect a face.
margin for continuous operation of processing
At the end of processing, if there is a positive result,
the coordinate of the face is recalculated and put into a
CMP CMP CMP CMP corresponding FIFO.
SUM SUM SUM SUM
The number of results collecting FIFO queues is the
same as the number of processing elements to allow
The conditional sum block the processing elements to be reloaded immediately
It processes 8 features
after previous sub-image had been processed. The
PE FSM simultaneously
FIFO control logic operates a Round-Robin type
Contols the
processsing of
automate which gives priority to the full active queues
CMP CMP CMP CMP to prevent output buffer congestion and resulting
single sub-image
SUM SUM SUM SUM increase in PE idle time.
FIFO for faces The scheduling is dynamic and done by a set of

coordinates The multisegment feature automates - each processing element contains one to
RAM. Each segment be able to detect a face in a sub-image independently
contains a classifier of others and to produce requests to the buffer memory
as needed.
Seg 1 Seg N The decision to start or stop a processing element is
taken by the buffer memory automate in relation to its
available sub-images and the need to process them.
Figure 4: Processing element (PE) diagram.
The main difficulty in scheduling technique is
maintaining the equality of load distribution, regardless
For the purpose of continuous operation and the of varying number of processing elements and the
scalability, the buffer memory is composed of an array buffer memory size, while avoiding congestion, which
of fully shiftable 8 bit registers. Those registers can could result from such configuration.
shift left and right 8bit and also up or ”far left” and
down or ”far right” one memory row. The design is Congestion problems may arise due to the growing
chosen to suit the snake reloading pattern, which number of PE and therefore growing number of update
economizes 90% reload time for the buffer memory, requests to the main memory. Because of this we must
compared to a simple register array. calculate the optimal size of buffer memory.
In the figure 5 fully scalable components are shown in As we can see from the figure 6 the results are
satisfying for PE number greater than 48, providing the
real-time performance for medium-sized images.
The figure 7 gives us the magnitude of operational
Main memory Buffer memory Array of N ex PE
3.75
N scale segments H buf x L buf pixels and Nex FIFO 3.50
3.25 12 FPS
Processing time (seconds)

3.00 25 FPS
2.75
Main memory Output control 30 FPS
Buffer memory 2.50
and processing Round-Robin 2.25
automate
automata type 2.00
1.75
1.50
1.25
Figure 5: Resulting architecture diagram. 1.00
0.75
0.50
0.25
fine dashed contours, the components which change
0.00
only partially are shown in dashed line.
3 6 12 24 36 48 72
Congestions are prevented by the use of preemptive PE number
reloading and by keeping the buffer memory line size
higher than the number of the processing elements. Figure 6: Real-time capabilities for an image of 640
The expression used to calculate the line size is given x 480 pixels.
below: complexity after synthesis of the real-time architecture
for different sizes of image. All the components,
except those which did not needed such capabilities,
Lbuf > Nex+ Nex/2 · (Hbuf/Wch)· (Hbuf/Tempty); (4.1) were made scalable. Each component has a generic
base modules which are instantiated for simulation and
they have corresponding, ready for synthesis, models.
Where Nex is the number of processing elements , Hbuf However not all of those models were optimized –
is the height of buffer memory , Wch is the width of either on the gate level, or for a particular FPGA type,
write data port. which explains the rapidly growing complexity
If this proportion is maintained the processing The synthesis showed that the architecture for
elements can operate at their maximal efficiency for maximal sizes cannot fit largest Xilinx FPGA
each image line. Image resolutions from 160 by 120 XC5VLX200 model.
pixels till 1920 by 1440 were tested without producing The reason of this is the fact that we do not track
congestion. faces and face tracking[6] would have spared much
As a result of above mentioned architectural space on the FPGA, but would also also introduce
optimizations we have a scalable parallel architecture some additional complexity to the design, it is also not
which was subsequently subject to evaluations to find sure whether such a development would satisfy time
out which configuration is sufficient to produce face constraints for the detection part and have a real time
detection in real time. performance as a whole.
6. Conclusion
5. Experimental results The technique for design proposed in this paper has
proved itself and we achieved real-time performance
The simulation result of actual architecture with the for a scalable architecture built using AAA and generic
640 to 480 pixel 8bpp gray scale - a maximal size of configurable components. However this hardware
architecture which can fit on an FPGA is shown on the implementation is too consuming for big image sizes.
figure 6. The architecture was tested to process It is possible that with further optimizations this
images from 160 x 120 till 1920 x 1440 and shown no architecture can be suitable for detecting faces in
sign of congestion or other malfunctions. After images of 1080 HDTV format.
synthesis and knowing possible cycle time for this
architecture we have concluded that this particular Moreover if we choose to implement face tracking as
design is capable of real time processing. a further development of such architecture we could
600
550 Slices
500
Flip-flops
Logic elements, x1000
450
400 LUT
350
300
250
200
150
100
50
0
160x120 180x140 200x150 240x160 360x240 480x360 640x480 1024x768 1280x107 1440x119 1920x144
8 2 0
Image size, pixels
Figure 7: The number of logic elements needed to implement a real-time architecture for a given image size.
produce less complex designs with the same 2005 IEEE Engineering in Medicine and Biology
performance. 27th Annual Conference, Shanghai, China, 2005,
pp. 463-466
Acknowledgments
[5] Kaouane L., Sorel Y. et al. “From algorithm graph
This research was made possible by the kind help and
specification to automatic synthesis of FPGA
cooperation of ENSMP. The simulation and synthesis
circuit: a seamless flow of graph transformations”
environment, as well as FPGA was provided by
FPL 2003, Int. Conference on Field-Programmable
Waseda University. This research was also supported
Logic and Applications, Lisbon, Portugal, 2003,
by CREST, JST.
pp. 153-164
References
[6] T. Burghardt and J. Calic “Analising animal
[1] Everingham M.R. and Zisserman A. “Automated behaviour in wildlife videos using face detection
person identification in video”. 3rd Int. Conference and tracking”, IEE Proceedings – Vision Image
on Image and Video Retrieval, 2004, vol. 1, pp. Signal Processing., Vol. 153, No. 3, pp. 305-311,
289-298 2006
[2] Viola P. and Jones M.:”Robust real-time object [7] S. Kawato, and N. Tetsutani: “Scale Adaptive Face
detection”. Second Int. Workshop on Statistical Detection and Tracking in Real Time with SSR
and Computational Theories of Vision”, 2001 Filter and Support Vector Machine”, Proceedings
of ACCV 2004, vol. 1, pp.132-137, 2004.
[3] Kaouane L., Sorel Y. et al.”Implantation optimisée
sur circuit dédié d'algorithmes spécifiés sous la [8] “Optimized Rapid Prototyping For Real Time
forme d'un Graphe Factorisé de Dépendances de Embedded Heterogeneous Multiprocessors.”
Données: application aux traitements d'images.” CODES'99 7th International Workshop on
GRETSI 2003, 19ème Synposium on Signal and Hardware/Software Co-Design, Rome, May 1999
Image Processing, Paris, France, 2003, pp. 24-27
[9] For AAA applications and tools look: http://www-
[4] Quddus A. , Fieguth P., Otman B. “Adaboost and rocq.inria.fr/syndex/pub.htm
Support Vector Machines for White Matter Lesion
Segmentation in MR Images”, Proceedings of the

Karuizawa Paper4

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Karuizawa Paper4

Transféré par

Droits d'auteur :

Formats disponibles

A real-time parallel architecture for human face detection

based on the Algorithm Architecture Adequation approach

ABSTRACT algorithm and together with the working demonstrator

Figure 1: The feature mask applied on faces.

Figure 3: The architecture graph of Adaboost face

FIFO for faces The scheduling is dynamic and done by a set of

Processing time (seconds)

Vous aimerez peut-être aussi