HUMAN BEHAVIOUR ANALYSIS

Human behaviour analysis and
interpretation based on the video

modality: postures, facial
expressions and head movements
A. Benoit, L. Bonnaud, A. Caplier, N.
Eveno, V. Girondel, Zakia Hammal,
M. Rombaut
Introduction
« Looking at people domain » : automatic analysis

and interpretation of human actions (gestures,
behaviour, expressions…)
Needs low level information (video analysis step
answering to « how are things ? ») and high
level interpretation (data fusion answering to
« what is happening ? »).
Applications
 multimodal interactions and interface cf Similar NOE

 mixed reality systems
 smart rooms
 smart surveillance systems (hypovigilance detection,
detection of distress cases: old people surveillance, bus
surveillance)
 e-learning assistance
Outline
1. Global posture recognition

2. Facial expressions recognition
3. Head motion analysis and interpretation
4. Conclusion
which expression ? which head motion ? which posture ?

System overview
1. Low level data extraction :
- segmentation
- temporal tracking
- skin detection and face localization
2. Static posture recognition :
- low-level data
- belief theory, definitions, models, data fusion, decision
3. Results :
- training and test sets
- recognition results
- video sequence
System overview
Indoor scene filmed by a static camera
1030 1030
Low level data extraction
Video sequence
Segmentation
Temporal tracking 1030 1030
Skin detection / Face localization
1030 sitting
Static posture recognition
Low Level data: person segmentation
Adaptive background removal algorithm :
consecutive frame differences + adaptive reference image
A. Caplier, L. Bonnaud and J. M. Chassery « Robust fast extraction of video objects combining frame
differences and reference image » in Proc. IEEE International Conference on Image Processing, pp. 785-
788, September 2001.
Low-level data computed : 475

- rectangular bounding box
(SRBB)
SRBB
- principal axes box (SPAB)
- gravity center ---- SPAB
475 gravity center
Low Level data: skin detection
Color space YCbCr :
no conversions, luminance and chrominances apart
skin databases : Von Luschan’s, one obtained with camera
Cr Y
Cb
Method :
thresholding in CbCr plane
with initial thresholds :
Y
Cb Є [86, 140] Cr Є [139, 175]
Low Level data: temporal tracking
Computation of SRBBs overlap :
forward backward result
T-1
Low-level data computed :

- Identification numbers (IDs)
- temporal split/merge information
Low Level data: face localization
Cr
Automatic thresholds adaptation :
translation, reduction of detection
intervals towards Cb and Cr mean values Cb
Face and hands identification :

sorted lists of skin patches
criteria related to temporal tracking and human morhology
face
right
left
hand
hand
V. Girondel, A. Caplier, and L. Bonnaud « Hands Detection and Tracking for Interactive Multimedia
Applications » in Proc. International Conference on Computer Vision and Graphics, pp. 282-287,
September 2002.
System overview
- segmentation
- temporal tracking
- low-level data
3. Results :
- video sequence
Posture recognition: measures
Distance measurements
D1, D2, D3, D4
ideas : person height and shape compactness
Reference posture
Normalization : ri=Di/Diref « Da Vinci Vitruvian man posture »
standing, arms stretched horizontally
i
860 112
SRBB
SPAB
SPAB
SRBB
Posture recognition:belief theory
Advantages :
- use of imprecise, conflictous data
- not computationally expensive (HMMs, NNs…)
Universe : Ω = {Hi} i=1…n 2N subsets A of Ω
Hypotheses : Hi disjunctive if exhaustive closed universe

else open universe
Considered postures :
standing (H1), sitting (H2), squatting (H3) and lying (H4)
one hypothesis added for unknown postures (H0)
Belief mass distribution m :
confidence degree in A with
Posture recognition: measures evolution
Example for r1 measurement :
frame
Posture recognition: measures modeling
Belief mass distributions : mri , measurements imprecision
Ω=
Thresholds obtained by human expertise on ri data statistics

Posture recognition: data fusion
Final belief mass distribution : mr1234
Orthogonal sum of Dempster :
Example :
Conflict : non-null belief mass for empty set -> contradictory

measurements
System overview
- segmentation
- temporal tracking
- low-level data
3. Results :
- video sequence
Results: implemented system and
computing time
Implemented system
- Sony DFW-VL500 camera
- YCbCr 4:2:0 format, 30 fps, 640x480 resolution
- low-end PC : 1.8 GHz
- unoptimized C++ code
Computing time :
- segmentation : 42%
- temporal tracking : 3%
- skin detection and face localization : 50%
- static posture recognition : 5%
Results :
obtained at approximately 11 fps
Results: databases
Training set :
- 6 video sequences, different persons of various heights
- 10 consecutive postures
- « normal » postures, in front of the camera
Test set :
- 6 video sequences, other persons of various heights
- 7 different postures
- « free » postures, i.e. move the arms, sit sideways…
Results: recognition rates
Training set :
mean
recognition
rate :
88.2%
Test set :
mean
recognition
rate :
80.8%
Results: video example
Outline

4. Conclusion

Facial expressions analysis
1 Assumptions
2 Facial features segmentation: low level data
3 Facial expressions recognition: high level
interpretation
4 Facial expression recognition based on audio: a
step towards a multimodal system
Assumptions
Facial regognition based on the analysis of the
deformations of facial permanent features
(lips, eyes and brows)
 6 universal emotions: surprise, joy, disgust,
anger, fear, sadness
Is it possible to recognize
the facial expression ?
1 Assumptions and Applications
interpretation
Facial features extraction:
models choice
P6
P3 P7
P5
P1 P2
P4
Open eye : circle, parabola, External lips : 4 cubics,

Bezier curve 2 broken lines
Brow : Bezier curve
More complex model, more possible
P1 . .P2 deformations
Closed eye : line

models initialisation
Detection of characteristic points: eyes corners, mouth corners…
P2 P3 P4
x
x x
P1 P5
x x x
P6
Luminance gradient information
Luminance and chrominance

gradient information
models deformations (1)
Gradient flows of luminance and/or chrominance maximisation
 
E  
p  cercle
 I ( p ).n ( p )
n(p) I(p)
models deformations (2)
A single control point displacement 2 or 3 control points displacement

Gradient flow of luminance Gradient flow of luminance
Mouth corners displacement

Gradient flow of chrominance and luminance
some results
Flexibility of the chosen models, accuracy

1 Assumptions and Applications
interpretation
Facial expressions recognition:
recognition on facial skeletons
Joy Surprise sadness
disgust Fear anger

characteristic distances
Facial features deformations related to characteristic distances

 Neutral => Dni reference values
 Joy :
D2
D1
{open mouth} => D3 >Dn3 and D4 >Dn4 ,
{mouth corners backwards} => D5 < Dn5 ,
{slakened brows} => D2 no modification
D5
D3
 Surprise :
{raising up brows} => D2 >Dn2,
{stared eyes} => D1 >Dn1,
D4 {open mouth} => D3<Dn3 and D4>Dn4.
distances discretisation
D2
Distances discretisation : 3 states

Dni : distance for the neutral expression
 S : stable
 C+ : Di >> Dni
 C- : Di << Dni D2 evolution (surprise)
Doubt modelling :
 SC+ : Di > Dni (state S C+)
 SC- : Di < Dni (state S C-)
D5
D5 evolution (sourire)
basis of rules
D3 D5
D1 D2
D4
joy C- S / C- C+ C+ C-
Surprise C+ C+ C- C+ C+
disgust C- C- S / C+ C+ S/C-
anger C+ C- S S/C- S
sadness C- C+ S S S
fear S / C+ S / C+ S / C- S / C+ S
neutral S S S S S
evidence mass distribution and modelling
To each Di is associated the following mass of evidence :
m Di: 
2 [0,1]
A m Di
(A )
Modelling mDi
Di
thresholds (a…h) related to each Di are estimated after a

training step (analysis of distances evolution for 4 facial
expressions and 13 different persons.
method principle (1)
 Distances Di measurement and symbolic state determination
 Computation of the mass distribution for each Di state
 With the basis of rules, computation of the evidence mass for

each expression and each Di
 Combination of evidence mass distribution in order to take all

the measures into account before taking a decision.
Mass of evidence combination (orthogonal sum):

m  mD1  mD 2
m( A)   m  B .m
B C  A
D1 D2 (C )
A, B and C : expressions or subset of expressions.
Reject class (E8 : unknown) : this is an expression which is different

from all the expressions described in the rules table.
results (1)
Hammal-Caplier database
Cohn-Kanade database
results (2)
neutral : 100% unknown : 100% joy : 100%
3 frames from neutral to joy

=> transitory expression
results (3)
Results on the Hammal-Caplier database (21 samples for each considered
expression, 630 images).
Syst\Exp joy Surprise disgust neutral

E1 joy 76,36% 6,06% 9,48% 0
E2 Surprise 0 12% 0 0
E3 disgust 0 0 43,10% 0
E7 neutral 6,66% 0,78% 15,51% 88%
E8 unknown 6,06% 11,08% 12,06% 0
E1  E3 10,90% 0 8,62% 0
E2  E6 0 72,44% 0 0
other 0,02% 2,08% 11,32% 12%
Total 87,26% 84,44% 51,72% 88%

1 Assumptions
interpretation
Facial expressions analysis based on audio
(collaboration with Mons University)
Idea: characterization of expressions in the speech

signal => use of statistical speech features such as
speech rate, SPI, energy and pitch
Problem: expressions classes are different. After the

preliminary study, 2 classes active (joy, surprise, anger)
and passive (neutral, sadness) suitable for speech
Perspectives: definition of a multimodal system for facial

expression recognition.
Outline

4. Conclusion

Head motion interpretation
1 Introduction
2 Head motion estimation: biological modelling
3 Examples of head motion interpretation
Introduction
Idea: Global head motions such as nods and local
facial motion such as blinking… are involved in the
human to human communication process.
Aim: automatic analysis and interpretation of
such “gestures”.
1 Introduction
Head motion estimation: biological modelling
Algorithm overview: human visual system modelling
Head motion estimation: retina filtering
OPL stage: spatio-temporal filtering IPL stage: temporal high pass
- contours enhancement filtering dedicated to moving
- noise attenuation Stimulus
- illuminations variations removing - moving contours (perpendicular
to the motion direction) extraction
- static contours removing
Head motion estimation: log-polar spectrum
computation
Computation of the spectrum of the retina filtered image

in the log polar domain => spectrum easier to analyse:
- roll and zoom = global energy spectrum translations
- pan and tilt = local energy spectrum translations
- translations = no changes in the energy spectrum from frame to frame
interpretation (1)
Maximums of energy on the contours
perpendicular to the motion direction
=> Cumulated curve of energy per orientation
Abscissa of the maximums = motion direction

Temporal evolution of the abscissa = motion type
Amplitude of the maximum proportional
to the motion amplitude => energy decreasing
Or annulation in case of stops.
interpretation (2)
interpretation (3)
Each minimum of energy is related to a motion stop

interpretation (4)
To summarize : properties of the retina filtered

energy spectrum in the log-polar domain
 max of energy associated to the contours moving to the motion

 no motion = no energy
Orientation of energy maximums = motion directions
« movements » of energy maximums = motion type
1 Introduction
Head nods of approbation or negation (1)
Goal : recognition of head nods of approbation and negation

I am still with you
Idea : detection of periodic head motions

 approbation: periodic head tilting
 negation: periodic head panning
Approach: to put a biological head motion detector

on a face bounding box and to control all the
head movements
Head nods of approbation or negation (2)
Blinking detection
Blink : vertical motion of the eyelid

Approach: to put a biological motion detector on a bounding box
around the eyes
Yawning detection
Yawn : vertical motion of the mouth

Approach: to put a biological motion detector on a bounding box
around the mouth
Hypo vigilance detection
Hypovigilence : short or long eyes closing

multiple head rotations
frequent yawning
Approach: combine the information coming from 3 biological
motion detectors
Outline

2. Facial Expressions recognition
4. Conclusion

Conclusion
 Human activities analysis and recognition based on

video and images data
 Unified approaches: extraction of low level data and
fusion process for high level semantic interpretation
 Correlation with application; exple: project 4 of
Enterface Workshop about Attention level detection
of driver

HUMAN BEHAVIOUR ANALYSIS

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

HUMAN BEHAVIOUR ANALYSIS

Transféré par

Droits d'auteur :

Formats disponibles

Human behaviour analysis and

interpretation based on the video

« Looking at people domain » : automatic analysis

 multimodal interactions and interface cf Similar NOE

1. Global posture recognition

which expression ? which head motion ? which posture ?

Temporal tracking 1030 1030

Skin detection / Face localization

Low-level data computed : 475

Low-level data computed :

Face and hands identification :

Hypotheses : Hi disjunctive if exhaustive closed universe

Thresholds obtained by human expertise on ri data statistics

Conflict : non-null belief mass for empty set -> contradictory

1. Global posture recognition

which expression ? which head motion ? which posture ?

Open eye : circle, parabola, External lips : 4 cubics,

Closed eye : line

Detection of characteristic points: eyes corners, mouth corners…

Luminance gradient information

Luminance and chrominance

Gradient flows of luminance and/or chrominance maximisation

A single control point displacement 2 or 3 control points displacement

Mouth corners displacement

Flexibility of the chosen models, accuracy

Joy Surprise sadness

disgust Fear anger

Facial features deformations related to characteristic distances

Distances discretisation : 3 states

thresholds (a…h) related to each Di are estimated after a

 Distances Di measurement and symbolic state determination

 Computation of the mass distribution for each Di state

 With the basis of rules, computation of the evidence mass for

 Combination of evidence mass distribution in order to take all

Mass of evidence combination (orthogonal sum):

A, B and C : expressions or subset of expressions.

Reject class (E8 : unknown) : this is an expression which is different

neutral : 100% unknown : 100% joy : 100%

3 frames from neutral to joy

Syst\Exp joy Surprise disgust neutral

E7 neutral 6,66% 0,78% 15,51% 88%

E8 unknown 6,06% 11,08% 12,06% 0

other 0,02% 2,08% 11,32% 12%

Total 87,26% 84,44% 51,72% 88%

Idea: characterization of expressions in the speech

Problem: expressions classes are different. After the

Perspectives: definition of a multimodal system for facial

1. Global posture recognition

which expression ? which head motion ? which posture ?

Computation of the spectrum of the retina filtered image

Abscissa of the maximums = motion direction

Each minimum of energy is related to a motion stop

To summarize : properties of the retina filtered

 max of energy associated to the contours moving to the motion

Goal : recognition of head nods of approbation and negation

Idea : detection of periodic head motions

Approach: to put a biological head motion detector

Blink : vertical motion of the eyelid

Yawn : vertical motion of the mouth

Hypovigilence : short or long eyes closing

1. Global posture recognition

which expression ? which head motion ? which posture ?