Académique Documents
Professionnel Documents
Culture Documents
1
(I1 ,x 1 ) (I20 ,x20 ) (I30 ,x30 ) (I40 ,x40 ) (I50 ,x50 ) (It-1 ,xt-1 ) (It ,xt )
scribes our experimental methodology and show our com-
parative results. yt
t-1
[17, 24, 15, 21] and some hybrid approaches [13, 2]. Static
(Ik ,xk )
pose analysis is inherently immune to accuracy drift, but it (Ik ,xk ) 6 6
4 4 (Ik ,xk )
also ignores highly useful temporal information that could
*
0 0
Yaw
improve estimation accuracy. (Ik ,xk )
1 1
Very accurate shape models are possible using the Ac-
tive Appearance Model (AAM) methodology [8], such as Figure 1. Generalized Adaptive View-based Appearance
was applied to 3D head data in [5]. However, tracking 3D Model (GAVAM). The pose of the current frame xt is estimated
AAMs with monocular intensity images is currently a time- using the pose-change measurements from three paradigms: dif-
t
consuming process, and requires that the trained model ferential tracking yt−1 , keyframe tracking ykt 2 and static pose es-
be general enough to include the class of the user being timation ykt 0 . During the same pose update process (described in
tracked. Section 3.3), the poses {xk1 , xk2 , ...} from keyframes acquired
Early work in the dynamic paradigm assumed sim- online will be automatically adapted. The star ∗ depicts the refer-
ential keyframe (Ik0 , xk0 ) set at the origin.
ple shape models (e.g., planar[4], cylindrical[20], or
ellipsoidal[3]). Tracking can also be performed with a 3D
face texture mesh [25] or 3D face feature mesh [31]. Some namic) and keyframe (user-dependent) tracking. GAVAM
recent work looked morphable models rather than rigid generalizes the AVAM approach by integrating all three
models [6, 7, 27]. Differential registration algorithms are paradigms and operating on intensity images from a single
known for user-independence and high precision for short monocular camera. This generalization faced three difficult
time scale estimates of pose change, but they are typically challenges:
susceptible to long time scale accuracy drift due to accumu-
lated uncertainty over time. • Integrating static user-independent paradigm into the
Some earlier work in static user-dependent paradigm probabilistic framework (see Section 3);
include nearest-neighbors prototype methods [32, 13] and
template-based approaches [19]. Vacchetti et al. suggested • Segmenting the face and selecting base frame set with-
a method to merge online and offline keyframes for stable out any depth information by using a multiple face hy-
3D tracking [28]. These approaches are more accurate and potheses approach (described in Section 3.1).
suffer only bounded drift over time, but they lack the rela- • Computing accurate pose-change estimation between
tive precision of dynamic approaches. two frames with only intensity images using iterative
Several previous pose detection algorithms combine Normal Flow Constraint (described in Section 4.1);
both tracking and static pose analysis. Huang and Trivedi
[18] combine a subspace method of static pose estimation GAVAM also includes some new functionality such as the
with head tracking. The static pose detector uses a contin- keyframe management and a 4D pose tessellation space for
uous density HMM to predict the pose parameters. These the keyframe acquisition (see Section 3.4 for details). The
are filtered using a Kalman filter and then passed back to following two sections formally describe this generaliza-
the head tracker to improve face detection during the next tion.
video frame. Sherrah and Gong [26] detect head position
and pose jointly using conditional density propagation with 3. Generalized Adaptive View-based Appear-
the combined pose and position vectors as the state. To our ance Model
best knowledge, no previous pose detection work has com-
bined the three paradigms of dynamic tracking, keyframe The two main components of the Generalized Adaptive
tracking, and static pose detection into one algorithm. View-based Appearance Model (GAVAM) are the view-
Morency et al. [23] presented the Adaptive View-based based appearance model M which is acquired and adapted
Appearance Model (AVAM) for head tracking from stereo over time, and the series of change-pose measurements Y
images which integrates two paradigms: differential (dy- estimated every time a new frame is grabbed. Figure 1
shows an overview our GAVAM framework. Algorithm 1 Algorithm 1 Tracking with a Generalized Adaptive View-
presents a high-level overview of the main steps for head based Appearance Model (GAVAM).
pose estimation using GAVAM. for each new frame (It ) do
A conventional view-based appearance model [9] con- Base Frame Set Selection: Select the nb most similar
sists of different views of the same object of interest (e.g., keyframes to the current current frame and add them to
images representing the head at different orientations). the base frame set. Always include the previous frame
GAVAM extends the concept of view-based appearance (It−1 , xt−1 ) and referential keyframe (IK0 , xK0 ) in
model by associating a pose and covariance with each view. the base frame set (see Section 3.1);
Our view-based model M is formally defined as Pose-change measurements: For each base frame,
compute the relative transformation yst , and its co-
M = {{Ii , xi }, ΛX } variance Λyst , between the current frame and the base
frame (see Sections 3.2 and 4 for details);
where each view i is represented by Ii and xi which are re- Model adaptation and pose estimation: Simulta-
spectively the intensity image and its associated pose mod- neously update the pose of all keyframes and com-
eled with a Gaussian distribution, and ΛX is the covariance pute the current pose xt by solving Equations 1 and 2
matrix over all random variables xi . For each pose xi , there given the pose-change measurements {yst , Λyst } (see
exist a sub-matrix Λxi in the diagonal of ΛX that represents Section 3.3);
the covariance of the pose xi . The poses are 6 dimensional Online keyframe acquisition and management: En-
vector consisting of the translation and the three Euler an- sure a constant tessellation of the pose space in the
gles [ T x T y T z Ωx Ωy Ωz ]. The pose estimates view-based model by adding new frames (It , xt ) as
in our view-based model will be adapted using the Kalman keyframe if different from any other view in M, and by
filter update with pose change measurements Y as observa- removing redundant keyframes after the model adapta-
tions and the concatenated poses as the state vector. Sec- tion (see Section 3.4).
tion 3.3 describes this adaptation process in detail. end for
The views (Ii , xi ) represent the object of interest (i.e.,
the head) as it appears from different angles and depths.
Different pose estimation paradigms will use different type 3.1. Base Frame Set Selection
of views:
The goal of the base frame set into selection process is
• A differential tracker will use only two views: to find a subset of views (base frames) in the current view-
the current frame (It , xt ) and the previous based appearance model M that are similar in appearance
frame (It−1 , xt−1 ). (and implicitly in pose) to the current frame It . This step
reduces the computation time since pose-change measure-
• In a keyframe-based (or template-based) approach ments will be computed only on this subset.
there will be 1 + n views: the current frame (It , xt ) To perform good base frame set selection (and pose-
and the j = 1...n keyframes {IKj , xKj }. Note change measurements) we need to segment the face in the
that GAVAM acquires keyframes online and GAVAM current frame. In the original AVAM algorithm [23], face
adapts the poses of these keyframes during tracking so segmentation was simplified by using the depth images
n, {xKj } and ΛX change over time. from the stereo camera; with only an approximate estimate
of the 2D position of the face and a simple 3D model of
• A static user-independent head pose estimator uses the head (i.e., a 3D box), AVAM was able to segment the
only the current frame (It , xt ) to produce its es- face. Since GAVAM uses only a monocular camera model,
timate. In GAVAM, this pose estimate is model its base frame set selection algorithm is necessarily more
as pose-change measurement between the current sophisticated. Algorithm 2 summarizes the base frame set
frame (It , xt ) and a reference keyframe (IK0 , xK0 ) selection process.
placed at the origin. The ellipsoid head model used to create the face mask
for each keyframe is a half ellipsoid with the dimensions of
Since GAVAM integrates all three estimation paradigms, an average head (see Section 4.1 for more details). The el-
its view-based model M consists of 3+n views: the current lipsoid is rotated and translated based on the keyframe pose
frame (It , xt ), the previous frame (It−1 , xt−1 ), a reference xKj and then projected in the image plane using the cam-
keyframe(IK0 , xK0 ), and n keyframe views {IKj , xKj }, era’s internal calibration parameters (focal length and image
where j = 1...n. The keyframes are selected online to best center).
represent the head under different orientation and position. The face hypotheses set represents different positions
Section 3.4 will describe the details of this tessellation. and scales of where the face could be in the current frame.
The first hypothesis is created by projecting pose xt−1 Algorithm 2 Base Frame Set Selection Given the current
from the previous frame in the image plane of the current frame It and view-based model M, returns a set of selected
frame. Face hypotheses are created around this first hy- base frames {Is , xs }.
pothesis based on the trace of the previous pose covari- Create face hypotheses for current frame Based on the
ance tr(Λxt−1 ). If tr(Λxt−1 ) is larger than a preset thresh- previous frame pose xt−1 and its associated covariance
old, face hypotheses are created around the first hypothe- Λxt−1 , create a set of face hypotheses for the current
sis with increments of one pixel along both image plane frame (see Section 3.1 for details). Each face hypothe-
axes and of 0.2 meters along the Z axis. Thresholds were sis is composed of a 2D coordinate and and a scale factor
set based on preliminary experiments and the same val- representing the face center and its approximate depth.
ues used for all experiments. For each face hypothesis and for each keyframe (IKj , xKj ) do
each keyframe, a L2-norm distance is computed and the nb Compute face segmentation in keyframe Position
best keyframes are then selected to be added in the base the ellipsoid head model (see Section 4.1) at pose xKj ,
frame set. The previous frame (It−1 , xt−1 ) and referential back-project in image plane IKj and compute valid
keyframe (IK0 , xK0 ) are always added to the base frame face pixels
set. for each current frames face hypothesis do
Align current frame Based on the face hypothesis,
3.2. Pose-Change Measurements scale and translate the current image to be aligned
Pose-change measurements are relative pose differences with center of the keyframe face segmentation.
between the current frame and one of the other views in our Compute distance Compute the L2-norm distance
model M. We presume that each pose-change measure- between keyframe and the aligned current frame for
ment is probabilistically drawn from a Gaussian distribution all valid pixel from the keyframe face segmentation.
N (yst |xt − xs , Λyst ). By definition pose increments have to end for
be additive, thus pose-changes are assumed to be Gaussian. Select face hypothesis The face hypothesis with the
Formally, the set of pose-change measurements Y is defined smallest distance is selected to represent this keyframe.
as: end for
Base frame set selection Based on their correlation
Y = yst , Λyst
scores, add the nb best keyframes in the base frame set.
Note that the previous frame (It−1 , xt−1 ) and referen-
Different pose estimation paradigms will return different tial keyframe (IK−0 , xK−0 ) are always added to the base
pose-change measurements: frame set.
• The differential tracker compute the relative pose be-
tween the current frame and the previous frame, and
t
returns the pose change-measurements yt−1 with co- mulation described in [23]. The state vector X is the con-
t
variance Λt−1 . Section 4.1 describes the view regis- catenation of the view poses {xt , xt−1 xK0 , xK1 , xK2 , . . .}
tration algorithm. as described in Section 3 and the observation vec-
tor Y is the concatenation of the pose measurement
• The keyframe tracker uses the same view registra- t
{yt−1 t
, yK t
, yK t
, yK , . . .} as described in the previous sec-
0 1 2
tion algorithm described in Section 4.1 to compute the tion. The covariance between the components of X is de-
t
pose-change measurements {yK s
t } between the
, ΛyK noted by ΛX .
s
current frame and the selected keyframes frames.
The Kalman filter update computes a prior for
• The static head pose estimator (described in Sec- p(Xt |Y1..t−1 ) by propagating p(Xt−1 |Y1..t−1 ) one step for-
tion 4.2) returns the pose-change measurement ward using a dynamic model. Each pose-change measure-
(yKt
0
, ΛyK
t ) based on the intensity image of the cur- ment yst ∈ Y between the current frame and a base frame of
0
rent frame. X is modeled as having come from:
[8] T. Cootes, G. Edwards, and C. Taylor. Active appearance [21] A. Lanitis, C. Taylor, and T. Cootes. Automatic interpreta-
models. PAMI, 23(6):681–684, June 2001. tion of human faces and hand gestures using flexible models.
[9] T. F. Cootes, G. V. Wheeler, K. N. Walker, and C. J. Tay- In FG, pages 98–103, 1995.
lor. View-based active appearance models. Image and Vision [22] C. Lawson and R. Hanson. Solving Least Squares Problems.
Computing, Volume 20:657–664, August 2002. Prentice-Hall, 1974.
[10] V. Design. MEGA-D Megapixel Digital Stereo Head. [23] L.-P. Morency, A. Rahimi, and T. Darrell. Adaptive view-
http://www.ai.sri.com/ konolige/svs/, 2000. based appearance model. In CVPR, volume 1, pages 803–
[11] M. Eckhardt, I. Fasel, and J. Movellan. Towards practical 810, 2003.
facial feature detection. Submitted to IJPRAI, 2008. [24] E. Murphy-Chutorian, A. Doshi, and M. M. Trivedi. Head
[12] The Flock of Birds. P.O. Box 527, Burlington, Vt. 05402. pose estimation for driver assistance systems: A robust al-
[13] Y. Fu and T. Huang. Graph embedded analysis for head pose gorithm and experimental evaluation. In Intelligent Trans-
estimation. In Proc. IEEE Intl. Conf. Automatic Face and portation Systems, 2007.
Gesture Recognition, pages 3–8, 2006. [25] A. Schodl, A. Haro, and I. Essa. Head tracking using a tex-
[14] Y. Fu and T. S. Huang. hmouse: Head tracking driven virtual tured polygonal model. In PUI98, 1998.
computer mouse. In IEEE Workshop Applications of Com- [26] J. Sherrah and S. Gong. Fusion of perceptual cues for ro-
puter Vision, pages 30–35, 2007. bust tracking of head pose and position. Pattern Recognition,
[15] N. Gourier, J. Maisonnasse, D. Hall, and J. Crowley. Head 34(8):1565–1572, 2001.
pose estimation on low resolution images. ser. Lecture Notes [27] L. Torresani and A. Hertzmann. Automatic non-rigid 3D
in Computer Science, 4122:270–280, 2007. modeling from video. In ECCV, 2004.
[16] C. Huang, , H. Ai, Y. Li, and S. Lao. High-performance rota- [28] L. Vacchetti, V. Lepetit, and P. Fua. Fusing online and offline
tion invariant multiview face detection. PAMI, 29(4), 2007. information for stable 3d tracking in real-time. In CVPR,
[17] J. Huang, X. Shao, and H. Wechsler. Face pose discrimi- 2003.
nation using support vector machines (svm). In Proc. Intl. [29] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade.
Conf. Pattern Recognition, pages 154–156, 1998. Three-dimensional scene flow. In ICCV, pages 722–729,
[18] K. Huang and M. Trivedi. Robust real-time detection, track- 1999.
ing, and pose estimation of faces in video streams. In ICPR, [30] P. Viola and M. Jones. Robust real-time face detection. In
2004. ICCV, page II: 747, 2001.
[19] R. Kjeldsen. Head gestures for computer control. In Proc. [31] L. Wiskott, J. Fellous, N. Kruger, and C. von der Malsburg.
Second International Workshop on Recognition, Analysis Face recognition by elastic bunch graph matching. PAMI,
and Tracking of Faces and Gestures in Real-time Systems, 19(7):775–779, July 1997.
pages 62–67, 2001.
[32] J. Wu and M. M. Trivedi. An integrated two-stage framework
[20] M. La Cascia, S. Sclaroff, and V. Athitsos. Fast, reli-
for robust head pose estimation. In International Workshop
able head tracking under varying illumination: An approach
on Analysis and Modeling of Faces and Gestures, 2005.
based on registration of textured-mapped 3D models. PAMI,
22(4):322–336, April 2000.