Vous êtes sur la page 1sur 63

Human behaviour analysis and

interpretation based on the video


modality: postures, facial
expressions and head movements
A. Benoit, L. Bonnaud, A. Caplier, N.
Eveno, V. Girondel, Zakia Hammal,
M. Rombaut
Introduction

« Looking at people domain » : automatic analysis


and interpretation of human actions (gestures,
behaviour, expressions…)
Needs low level information (video analysis step
answering to « how are things ? ») and high
level interpretation (data fusion answering to
« what is happening ? »).
Applications

 multimodal interactions and interface cf Similar NOE


 mixed reality systems
 smart rooms
 smart surveillance systems (hypovigilance detection,
detection of distress cases: old people surveillance, bus
surveillance)
 e-learning assistance
Outline

1. Global posture recognition


2. Facial expressions recognition
3. Head motion analysis and interpretation
4. Conclusion

which expression ? which head motion ? which posture ?


System overview
1. Low level data extraction :
- segmentation
- temporal tracking
- skin detection and face localization
2. Static posture recognition :
- low-level data
- belief theory, definitions, models, data fusion, decision
3. Results :
- training and test sets
- recognition results
- video sequence
System overview
Indoor scene filmed by a static camera
1030 1030
Low level data extraction

Video sequence

Segmentation

Temporal tracking 1030 1030

Skin detection / Face localization

1030 sitting
Static posture recognition
Low Level data: person segmentation
Adaptive background removal algorithm :
consecutive frame differences + adaptive reference image
A. Caplier, L. Bonnaud and J. M. Chassery « Robust fast extraction of video objects combining frame
differences and reference image » in Proc. IEEE International Conference on Image Processing, pp. 785-
788, September 2001.

Low-level data computed : 475


- rectangular bounding box
(SRBB)
SRBB
- principal axes box (SPAB)
- gravity center ---- SPAB
475 gravity center
Low Level data: skin detection
Color space YCbCr :
no conversions, luminance and chrominances apart
skin databases : Von Luschan’s, one obtained with camera
Cr Y

Cb

Method :
thresholding in CbCr plane
with initial thresholds :
Y
Cb Є [86, 140] Cr Є [139, 175]
Low Level data: temporal tracking
Computation of SRBBs overlap :
forward backward result

T-1

Low-level data computed :


- Identification numbers (IDs)
- temporal split/merge information
Low Level data: face localization
Cr
Automatic thresholds adaptation :
translation, reduction of detection
intervals towards Cb and Cr mean values Cb

Face and hands identification :


sorted lists of skin patches
criteria related to temporal tracking and human morhology
face
right
left
hand
hand

V. Girondel, A. Caplier, and L. Bonnaud « Hands Detection and Tracking for Interactive Multimedia
Applications » in Proc. International Conference on Computer Vision and Graphics, pp. 282-287,
September 2002.
System overview
1. Low level data extraction :
- segmentation
- temporal tracking
- skin detection and face localization
2. Static posture recognition :
- low-level data
- belief theory, definitions, models, data fusion, decision
3. Results :
- training and test sets
- recognition results
- video sequence
Posture recognition: measures

Distance measurements
D1, D2, D3, D4
ideas : person height and shape compactness
Reference posture
Normalization : ri=Di/Diref « Da Vinci Vitruvian man posture »
standing, arms stretched horizontally
i
860 112

SRBB
SPAB

SPAB
SRBB
Posture recognition:belief theory
Advantages :
- use of imprecise, conflictous data
- not computationally expensive (HMMs, NNs…)
Universe : Ω = {Hi} i=1…n 2N subsets A of Ω

Hypotheses : Hi disjunctive if exhaustive closed universe


else open universe
Considered postures :
standing (H1), sitting (H2), squatting (H3) and lying (H4)
one hypothesis added for unknown postures (H0)
Belief mass distribution m :
confidence degree in A with
Posture recognition: measures evolution
Example for r1 measurement :

frame
Posture recognition: measures modeling
Belief mass distributions : mri , measurements imprecision

Ω=

Thresholds obtained by human expertise on ri data statistics


Posture recognition: data fusion
Final belief mass distribution : mr1234
Orthogonal sum of Dempster :

Example :

Conflict : non-null belief mass for empty set -> contradictory


measurements
System overview
1. Low level data extraction :
- segmentation
- temporal tracking
- skin detection and face localization
2. Static posture recognition :
- low-level data
- belief theory, definitions, models, data fusion, decision
3. Results :
- training and test sets
- recognition results
- video sequence
Results: implemented system and
computing time
Implemented system
- Sony DFW-VL500 camera
- YCbCr 4:2:0 format, 30 fps, 640x480 resolution
- low-end PC : 1.8 GHz
- unoptimized C++ code
Computing time :
- segmentation : 42%
- temporal tracking : 3%
- skin detection and face localization : 50%
- static posture recognition : 5%
Results :
obtained at approximately 11 fps
Results: databases
Training set :
- 6 video sequences, different persons of various heights
- 10 consecutive postures
- « normal » postures, in front of the camera
Test set :
- 6 video sequences, other persons of various heights
- 7 different postures
- « free » postures, i.e. move the arms, sit sideways…
Results: recognition rates
Training set :
mean
recognition
rate :
88.2%

Test set :
mean
recognition
rate :
80.8%
Results: video example
Outline

1. Global posture recognition


2. Facial expressions recognition
3. Head motion analysis and interpretation
4. Conclusion

which expression ? which head motion ? which posture ?


Facial expressions analysis
1 Assumptions
2 Facial features segmentation: low level data
3 Facial expressions recognition: high level
interpretation
4 Facial expression recognition based on audio: a
step towards a multimodal system
Assumptions
Facial regognition based on the analysis of the
deformations of facial permanent features
(lips, eyes and brows)
 6 universal emotions: surprise, joy, disgust,
anger, fear, sadness

Is it possible to recognize
the facial expression ?
Facial expressions analysis
1 Assumptions and Applications
2 Facial features segmentation: low level data
3 Facial expressions recognition: high level
interpretation
4 Facial expression recognition based on audio: a
step towards a multimodal system
Facial features extraction:
models choice
P6

P3 P7
P5

P1 P2
P4

Open eye : circle, parabola, External lips : 4 cubics,


Bezier curve 2 broken lines
Brow : Bezier curve
More complex model, more possible
P1 . .P2 deformations

Closed eye : line


Facial features extraction:
models initialisation

Detection of characteristic points: eyes corners, mouth corners…

P2 P3 P4
x
x x
P1 P5
x x x

P6

Luminance gradient information

Luminance and chrominance


gradient information
Facial features extraction:
models deformations (1)

Gradient flows of luminance and/or chrominance maximisation

 
E  
p  cercle
 I ( p ).n ( p )

n(p) I(p)
Facial features extraction:
models deformations (2)

A single control point displacement 2 or 3 control points displacement


Gradient flow of luminance Gradient flow of luminance

Mouth corners displacement


Gradient flow of chrominance and luminance
Facial features extraction:
some results

Flexibility of the chosen models, accuracy


Facial expressions analysis
1 Assumptions and Applications
2 Facial features segmentation: low level data
3 Facial expressions recognition: high level
interpretation
4 Facial expression recognition based on audio: a
step towards a multimodal system
Facial expressions recognition:
recognition on facial skeletons

Joy Surprise sadness

disgust Fear anger


Facial expressions recognition:
characteristic distances

Facial features deformations related to characteristic distances


 Neutral => Dni reference values

 Joy :
D2

D1
{open mouth} => D3 >Dn3 and D4 >Dn4 ,
{mouth corners backwards} => D5 < Dn5 ,
{slakened brows} => D2 no modification
D5

D3
 Surprise :
{raising up brows} => D2 >Dn2,
{stared eyes} => D1 >Dn1,
D4 {open mouth} => D3<Dn3 and D4>Dn4.
Facial expressions recognition:
distances discretisation

D2

Distances discretisation : 3 states


Dni : distance for the neutral expression
 S : stable
 C+ : Di >> Dni
 C- : Di << Dni D2 evolution (surprise)

Doubt modelling :
 SC+ : Di > Dni (state S C+)
 SC- : Di < Dni (state S C-)

D5
D5 evolution (sourire)
Facial expressions recognition:
basis of rules

D3 D5
D1 D2
D4

joy C- S / C- C+ C+ C-

Surprise C+ C+ C- C+ C+

disgust C- C- S / C+ C+ S/C-

anger C+ C- S S/C- S

sadness C- C+ S S S

fear S / C+ S / C+ S / C- S / C+ S

neutral S S S S S
Facial expressions recognition:
evidence mass distribution and modelling
To each Di is associated the following mass of evidence :

m Di: 
2 [0,1]
A m Di
(A )

Modelling mDi

Di

thresholds (a…h) related to each Di are estimated after a


training step (analysis of distances evolution for 4 facial
expressions and 13 different persons.
Facial expressions recognition:
method principle (1)

 Distances Di measurement and symbolic state determination

 Computation of the mass distribution for each Di state

 With the basis of rules, computation of the evidence mass for


each expression and each Di

 Combination of evidence mass distribution in order to take all


the measures into account before taking a decision.
Facial expressions recognition:
method principle (2)

Mass of evidence combination (orthogonal sum):


m  mD1  mD 2
m( A)   m  B .m
B C  A
D1 D2 (C )

A, B and C : expressions or subset of expressions.

Reject class (E8 : unknown) : this is an expression which is different


from all the expressions described in the rules table.
Facial expressions recognition:
method principle (3)
Facial expressions recognition:
results (1)

Hammal-Caplier database

Cohn-Kanade database
Facial expressions recognition:
results (2)

neutral : 100% unknown : 100% joy : 100%

3 frames from neutral to joy


=> transitory expression
Facial expressions recognition:
results (3)
Results on the Hammal-Caplier database (21 samples for each considered
expression, 630 images).

Syst\Exp joy Surprise disgust neutral


E1 joy 76,36% 6,06% 9,48% 0

E2 Surprise 0 12% 0 0

E3 disgust 0 0 43,10% 0

E7 neutral 6,66% 0,78% 15,51% 88%

E8 unknown 6,06% 11,08% 12,06% 0

E1  E3 10,90% 0 8,62% 0

E2  E6 0 72,44% 0 0

other 0,02% 2,08% 11,32% 12%

Total 87,26% 84,44% 51,72% 88%


Facial expressions analysis
1 Assumptions
2 Facial features segmentation: low level data
3 Facial expressions recognition: high level
interpretation
4 Facial expression recognition based on audio: a
step towards a multimodal system
Facial expressions analysis based on audio
(collaboration with Mons University)

Idea: characterization of expressions in the speech


signal => use of statistical speech features such as
speech rate, SPI, energy and pitch

Problem: expressions classes are different. After the


preliminary study, 2 classes active (joy, surprise, anger)
and passive (neutral, sadness) suitable for speech

Perspectives: definition of a multimodal system for facial


expression recognition.
Outline

1. Global posture recognition


2. Facial expressions recognition
3. Head motion analysis and interpretation
4. Conclusion

which expression ? which head motion ? which posture ?


Head motion interpretation
1 Introduction
2 Head motion estimation: biological modelling
3 Examples of head motion interpretation
Introduction
Idea: Global head motions such as nods and local
facial motion such as blinking… are involved in the
human to human communication process.
Aim: automatic analysis and interpretation of
such “gestures”.
Head motion interpretation
1 Introduction
2 Head motion estimation: biological modelling
3 Examples of head motion interpretation
Head motion estimation: biological modelling
Algorithm overview: human visual system modelling
Head motion estimation: retina filtering
OPL stage: spatio-temporal filtering IPL stage: temporal high pass
- contours enhancement filtering dedicated to moving
- noise attenuation Stimulus
- illuminations variations removing - moving contours (perpendicular
to the motion direction) extraction
- static contours removing
Head motion estimation: log-polar spectrum
computation

Computation of the spectrum of the retina filtered image


in the log polar domain => spectrum easier to analyse:
- roll and zoom = global energy spectrum translations
- pan and tilt = local energy spectrum translations
- translations = no changes in the energy spectrum from frame to frame
Head motion estimation: log-polar spectrum
interpretation (1)
Maximums of energy on the contours
perpendicular to the motion direction
=> Cumulated curve of energy per orientation

Abscissa of the maximums = motion direction


Temporal evolution of the abscissa = motion type
Amplitude of the maximum proportional
to the motion amplitude => energy decreasing
Or annulation in case of stops.
Head motion estimation: log-polar spectrum
interpretation (2)
Head motion estimation: log-polar spectrum
interpretation (3)

Each minimum of energy is related to a motion stop


Head motion estimation: log-polar spectrum
interpretation (4)

To summarize : properties of the retina filtered


energy spectrum in the log-polar domain

 max of energy associated to the contours moving to the motion


 no motion = no energy
Orientation of energy maximums = motion directions
« movements » of energy maximums = motion type
Head motion interpretation
1 Introduction
2 Head motion estimation: biological modelling
3 Examples of head motion interpretation
Head nods of approbation or negation (1)

Goal : recognition of head nods of approbation and negation


I am still with you

Idea : detection of periodic head motions


 approbation: periodic head tilting
 negation: periodic head panning

Approach: to put a biological head motion detector


on a face bounding box and to control all the
head movements
Head nods of approbation or negation (2)
Blinking detection

Blink : vertical motion of the eyelid


Approach: to put a biological motion detector on a bounding box
around the eyes
Yawning detection

Yawn : vertical motion of the mouth


Approach: to put a biological motion detector on a bounding box
around the mouth
Hypo vigilance detection

Hypovigilence : short or long eyes closing


multiple head rotations
frequent yawning
Approach: combine the information coming from 3 biological
motion detectors
Outline

1. Global posture recognition


2. Facial Expressions recognition
3. Head motion analysis and interpretation
4. Conclusion

which expression ? which head motion ? which posture ?


Conclusion

 Human activities analysis and recognition based on


video and images data
 Unified approaches: extraction of low level data and
fusion process for high level semantic interpretation
 Correlation with application; exple: project 4 of
Enterface Workshop about Attention level detection
of driver

Vous aimerez peut-être aussi