Académique Documents
Professionnel Documents
Culture Documents
Real-Time Fingertip
Tracking and
Gesture Kenji Oka and Yoichi Sato
University of Tokyo
2 Enhanced-
Desk’s two-
handed
drawing system.
(a) (b)
3 Fingertip detection.
window’s size. However, we found that a fixed-size multiple matching around the fingertip’s true location
search window works reliably because the distance from by suppressing neighbor candidates around a candidate
the infrared camera to a user’s hand on the augmented with the highest matching score. We then remove
desk interface system remains relatively constant. matching that occurs in the finger’s middle by examin-
We then search for fingertips within the new window. ing surrounding pixels around a matched template’s
A cylinder with a hemispherical cap approximates finger center. If multiple diagonal pixels lie inside the hand
shape, and the projected finger shape in an input image region, we consider the candidate not part of a fingertip
appears to be a rectangle with a semicircle at its tip, so and therefore discard it (Figure 3b).
we can search for a fingertip based on its geometrical
features. Our method uses normalized correlation with Finding a palm’s center. In our method, the cen-
a template of a properly sized circle corresponding to a ter of a user’s hand is given as the point whose distance
user’s fingertip size. to the closest region boundary is the maximum. This
Although a semicircle reasonably approximates pro- makes the hand’s center insensitive to changes such as
jected fingertip shape, we must consider false detection opening and closing of the hand. We compute this loca-
from the template matching and must also find a suffi- tion by a morphological erosion operation of an extract-
ciently large number of candidates. Our current imple- ed hand region. First, we obtain a rough shape of the
mentation selects 20 candidates with the highest user’s palm by cutting out the hand region at the esti-
matching scores inside each search window, a sample mated wrist. We assume the wrist’s location is at the pre-
we consider large enough to include all true fingertips. determined distance from the top of the search window
Once we’ve selected the fingertip candidates, we and perpendicular to the hand region’s principal direc-
remove false candidates using two methods. We remove tion (Figure 3c).
66 November/December 2002
4 Taking fin-
gertip corre-
spondences:
(a) detecting
fingertips and
(b) comparing
detected and
predicted fin-
gertip locations
to determine
trajectories.
(a) (b)
Determining trajectories.
Suppose that we detect nt fingertips
in the tth image frame It. We refer to
these nt fingertips’ locations as Fi,t(i
= 1,2, …, nt) as Figure 4a shows.
First, we predict the locations F′i,t+1 of fingertips in the
next frame It+1. Then we compare the locations Fj,t+1 (j x t +1 = Fxt + Gwt (2)
= 1,2, …, nt+1) of nt+1 fingertips detected in the t + 1th
image frame It+1 with the predicted location F′i,t+1 (Fig- y t = Hx t + vt (3)
ure 4b). Finding the best combination among these two
sets of fingertips lets us reliably determine multiple fin- where F is the state transition matrix, G is the driving
gertip trajectories in real time (Figure 5). matrix, H is the observation matrix, wt is system noise
added to the velocity of the state vector xt, and vt is the
Predicting fingertip locations. We use the observation noise—that is, error between real and
Kalman filter to predict fingertip locations in one image detected location.
frame based on their locations detected in the previous Here we assume approximately uniform straight
frame. We apply this process separately for each fingertip. motion for each fingertip between two successive image
First, we measure each fingertip’s location and veloc- frames because the frame interval ∆T is short. Then, F,
ity in each image frame. Hence we define the state vec- G, and H are given as follows:
tor as xt 1
0 ∆T 0
( ( ) ( ) ( ) ( ))
T
x t = x t , y t ,vx t ,v y t (1) F = 0 1 0 ∆T (4)
0 0 1 0
0 0 0 1
where x(t), y(t), vx(t), vy(t) shows the location of fin- 0 0 1 0 T
gertip (x(t), y(t)) and the velocity of fingertip (vx(t), G= (5)
0 0 0 1
vy(t)) in tth image frame. We define the observation vec-
tor yt to represent the location of the fingertip detected
in the tth frame. The state vector xt and observation vec- H = 1 0 0 0 (6)
tor yt are related as the following basic system equation: 0 1 0 0
6 Correspon-
dences of
detected and
predicted fin-
gertips:
(a) fingertip
order and
(b) incorrect
thumb and
finger
detection.
(a) (b)
The (x, y) coordinates of the state vector xt coincide where x̂t+m|t is the estimated value of xt+m from
with those of the observation vector yt defined with y0,…,yt, P̂t+m|t equals Σ̂t+m|t/σ2v, and Σ̂t+m|t represents
respect to the image coordinate system. This is for sim- the covariance matrix of estimation error of x̂t+m|t.
plicity of discussion without loss of generality. The
observation matrix H should be in an appropriate form, Fingertip correspondences between succes-
depending on the transformation between the world sive frames. For each image frame, we detect finger-
coordinate system defined in the work space—for exam- tips as described earlier and examine correspondences
ple, a desktop of our augmented desk interface system— between the detected fingertips’ locations and the pre-
and the image coordinate system. dicted fingertip locations from Equation 8 or 10.
Also, we assume that both the system noise wt and More precisely, we compute the sum of the square
the observation noise vt are constant Gaussian noise of distances between a detected fingertip and a pre-
with a zero mean. Thus the covariance matrix for wt and dicted fingertip for all possible combinations and con-
vt becomes σ2w I2×2 and σ2v I2×2 respectively, where I2×2 sider the combination with the least sum to be the most
represents a 2 × 2 identity matrix. This is a rather coarse reasonable.
approximation, and those two noise components should To avoid a high computational cost for examining all
be estimated for each image frame based on some clue possible combinations, we reduce the number of combi-
such as a matching score for normalized correlation for nations by considering the clockwise (or counterclock-
template matching. We plan to study this in the future. wise) fingertip order around the hand’s center (Figure
Finally, we formulate a Kalman filter as 6a). In other words, we assume the fingertip order in
input images doesn’t change. For instance, in Figure 6a,
( )
−1
˜ H T I + HP
Kt = P ˜ HT (7) we consider only three combinations:
t 2×2 t
˜ =F P
Pt +1 t t(
˜ − K HP
˜ FT +
t ) σ2w
σ2v
Λ (9) This reduces the maximum possible combinations from
5P5 to 5C5.
Occasionally, the system doesn’t detect one or more
where x̃t equals x̂t|t−1, the estimated value of xt from fingertips in an input image frame. Figure 6b illustrates
y0,…,yt−1, P̃t equals Σ̂t|t−1/σ2v, Σ̂t|t−1 represents the covari- an example where an error prevents detection of the
ance matrix of estimation error of x̂t|t−1, Kt is Kalman thumb and little finger. To improve our method’s relia-
gain, and Λ equals GGT. bility for tracking multiple fingertips, we use a missing
Then the predicted location of the fingertip in the t + fingertip’s predicted location to continue tracking it. If
1th image frame is given as (x(t + 1), y(t + 1)) of x̃t+1. we find no fingertip corresponding to the predicted one,
If we need a predicted location after more than one image we examine the element (1, 1) of the covariance matrix
frame, we can calculate the predicted location as follows: P̃t+1 in Equation 9 for the predicted fingertip. This ele-
ment represents the ambiguity of the predicted finger-
{ (
xˆt+m t = F m x˜t + K t y t − Hx˜t )} (10) tip’s location—if it’s smaller than a predetermined
ambiguity threshold, we consider the fingertip to be
undetected because of an image frame error. We then
( )( )
m use the fingertip’s predicted location as its true location
ˆ m ˜ ˜ FT
Pt +m t = F Pt − K t HPt and continue tracking it.
m−1 If the element (1, 1) of the covariance matrix exceeds
∑ F Λ( F )
σ2w l
l
+ T (11) a predetermined threshold, we determine that the fin-
σ2v l=0 gertip prediction is unreliable and terminate its track-
68 November/December 2002
ing. Our current implementation fixes an experimental- 100
ly chosen ambiguity threshold.
If we detect more fingertips than predicted, we start
tracking a fingertip that doesn’t correspond to any of the
Accuracy (percent)
95
predictions. We treat its trajectory as that of a new fin-
gertip after the predicted fingertip location’s ambiguity
falls below a predetermined threshold. 90
Application
Location controller for
direct
Gesture mode manipulation
selector
8 Interaction
Tracking Application
result (direct manipulation based on direct
or manipulation
symbolic gesture) and symbolic
Application gestures.
Size, location of gesture controller for
symbolic
gesture
Processing in symbolic
gesture mode Kind of gesture
Symbolic
gesture
Trajectory recognizer
dard angle θT and that of the forefinger θF (θT > θF). First,
we apply the morphological process and the image sub-
traction to a binarized hand image to extract finger
regions. We regard the end of the extracted finger oppo-
site the fingertip as the base of the finger and calculate θ.
9 Definition of Here we define θk as θ in the kth frame from the fin-
the angle θ ger trajectory’s origin, and the current frame is the Nth
between finger frame from the origin. Then, the score sT, which repre-
direction and sents a thumb’s likelihood, is given as follows:
arm orientation,
for thumb 1.0
detection. if θ k > θ T
() θk − θ F
sT′ k =
θT − θ F
if θ F ≤ θ k ≤ θ T (12)
0.0 if θ k < θ F
∑ ()
N
sT′ k
k=1 (13)
sT =
symbolic gesture recognizer in addition to recognizing N
gesture locations and sizes based on trajectories.
To distinguish symbolic gestures from direct manip- If sT exceeds 0.5, we regard the finger as a thumb.
ulation, our system locates a thumb in the measured tra- To evaluate this method’s reliability, we performed
jectories. As described earlier, we regard gestures with three kinds of tests mimicking actual desktop work:
an extended thumb as direct manipulation and those drawing with only a thumb, picking up with a thumb
with a bent thumb as symbolic gestures, and use the and forefinger, and drawing with only a forefinger. Table
HMM to recognize the segmented symbolic gestures. 1 shows the results, demonstrating that the method reli-
Our gesture recognition system should prove useful for ably distinguishes the thumb from other hand parts.
augmented desk interface applications such as the draw-
ing tool shown in Figure 2. Symbolic gesture recognition
Like other recognition techniques,2,3 our symbolic ges-
Detecting a thumb ture recognition system uses HMM. The input to HMM
Our method for distinguishing a thumb from the consists of two components for recognizing multiple fin-
other tracked fingertips uses the angle θ between finger gertip trajectories: the number of tracked fingertips and
direction—the direction from the hand’s center to the a discrete code from 1 to 16 that represents the direc-
finger’s base—and arm orientation (Figure 9). We use tion of the tracked fingertips’ average motions. It’s
the finger’s base because it’s more stable than the tip unlikely that we would move each of our fingers inde-
even if the finger moves. pendently unless we consciously tried to do so. Thus,
In the initialization stage, we define the thumb’s stan- we decided to use the direction of multiple fingertips’
average motions instead of each fin-
gertip’s direction. We used code 17
to represent no motion.
We tested our recognition system
using 12 kinds of hand gestures with
the fingertip trajectories shown in
10 Symbolic Figure 10. As a training data set for
gesture each gesture, we used 80 hand ges-
examples. tures made by a single person to ini-
tialize HMM. Six other people also
participated in testing. For each
trial, a test subject made one of 12
gestures 20 times at arbitrary loca-
tions and with arbitrary sizes. Table
2 shows this experiment’s results,
70 November/December 2002
indicating average accuracy and standard deviation for Table 2. Evaluating gesture recognition.
single-finger and double-finger gestures.
Our system offers reliable, near-perfect recognition Gesture Type Single-Finger Double-Finger
of single-finger gestures and high accuracy for double- Average (percent) 99.2 97.5
finger gestures. Our gesture recognition system proves Standard deviation (percent) 0.5 1.8
suitable for natural interactions using hand gestures.
Future work
We plan to improve our tracking method’s reliability Kenji Oka is a PhD candidate at the
by incorporating additional sensors. Although using an University of Tokyo Graduate School
infrared camera had some advantages, it didn’t work of Information Science and Technol-
well on cold hands. We’ll solve this problem by using a ogy. His research interests include
color camera in addition to the infrared camera. human–computer interaction and
We also plan to extend our system for 3D tracking. computer vision, particularly per-
Currently, our tracking method is limited to 2D motion ceptual user interfaces and human
on a desktop. Although this is enough for our augment- behavior understanding. He received BS and MS degrees in
ed desk interface system, other application types require information and communication engineering from the
interaction based on 3D hand and finger motion. We’ll University of Tokyo.
therefore investigate a practical 3D hand and finger
tracking technique using multiple cameras. ■