Vous êtes sur la page 1sur 6

Real-time Robust Body Part Tracking

for Augmented Reality Interface

Jinki Jung Kyusung Cho Hyun S. Yang


AI & Media Lab. AI & Media Lab. AI & Media Lab.
KAIST KAIST KAIST
jk@paradise.kaist.ac.kr qtboy@paradise.kaist.ac.kr hsyang@kaist.ac.kr

ABSTRACT Although a considerable proportion of augmented reality research


involves the use of an interface, the interfaces developed to date
We proposed real-time robust body part tracking for augmented fail to deliver diverse expression; they typically provide only
reality interface that does not limit the user’s freedom. The monotonous interaction[Buchmann et al. 2004; Perttu et al. 2005],
generality of the system was upgraded relative to body part e.g., selecting a virtual object or hitting a virtual agent. To address
tracking by establishing an ability to recognize details, such as, this shortcoming, we have designed the body part tracking for
whether the user wears long sleeves or short sleeves. For precise augmented reality interface in order to allow natural and diverse
body part tracking, we obtained images of hands, head, and feet interactions.
separately via a single camera, and when detecting each body part,
we separately chose appropriate features for specific parts. Using Our goals are to obtain enhanced expressiveness in gestures and
a calibrated camera, we transferred 2D detected body parts into an real-time performance. In order to accomplish these two goals, we
approximate 3D posture. In experiments conducted to evaluate the detect the most expressive part of the human body: the head,
body part tracking module, the application with the proposed hands, and feet. Detection of major body parts from single image
interface showed advanced hand tracking performance in real is difficult due to image variations caused by the varying shape,
time(43.5fps). viewpoint, clothing and degree of occlusion. We present a new
approach and show results on estimation of the approximate 3-
dimentional pose in real-time. We have also implemented an
Keywords: application, BeNature, an augmented reality ecosystem where the
Body-part Tracking, Augmented Reality, Interface System.
user’s gestures are recognized by proposed interface.

2. Body part tracking system structure


1. INTRODUCTION
Figure 1 shows the block diagram of the system. It consists of 3
The use of human-computer interaction, or HCI, is experiencing
major tracking parts that run in order of precedence. All of the
continuous and accelerated growth. As shown in Nintendo Wii
tracking parts send or receive information of features to the
[Nintendo 2009]and Project Natal[Microsoft 2009], the new
feature extraction module. If all tracking results are successful,
advance for HCI towards creating more natural human–computer
the position of the body parts is used to obtain the 3D position and
interaction methods continues to move forward. In particular,
user’s gesture.
extensive work in HCI for Augmented Reality, or AR, has been
carried out in efforts to provide more natural and intuitive means The flow of the system is as follows:
for interacting with computers [Feng et al. 2008; Poupyrev et al.
2002; Azuma et al. 2001]. Most augmented reality applications Step0. By deleting the background and shadow of the user from
have an interface between virtual agent and the user who wears a the original image, a foreground image is obtained.
head-mounted display or operates a handheld device. We define Step1. Using the face texture, the system detects a face from the
natural interaction as providing the user the capacity to act foreground image. The system then tracks the head using a
naturally without any encumbrances such as bothersome devices, particle filter. The system measures the particle filter value by the
markers, or restrictive clothes. To achieve natural interaction, we face color. We use face texture as a starting point for detection
propose a fully vision approach that only uses the whole body as because it is the most special feature of the human body. Also it is
an input. irrelevant to skin color.
Step2. The system extracts contour features, as well as the skin
blob feature using an adaptive skin color model.
Copyright © 2009 by the Association for Computing Machinery, Inc. Step3. The system tracks the user’s head using a particle filter.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Step4. By segmenting the skin blob image, the system detects and
for commercial advantage and that copies bear this notice and the full citation on the tracks the two hands.
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
permissions@acm.org.
VRCAI 2009, Yokohama, Japan, December 14 – 15, 2009.
© 2009 ACM 978-1-60558-912-1/09/0012 $10.00

203
Step5. Using the contour of the lower body, the system detects The reasons for selecting these features are as follows. First, face
and tracks the feet. The 3D feet position is then used to estimate texture has lower false detection than others in head detection.
the 3D body pose. Because of large motion scale, diversity of shape, and agile
movement, the hands are the most difficult part to detect, and thus
Step6. With the position of the right hand, the system extracts
they detected by using skin color. As we use the skin color, the
meaningful gestures.
system can track both hands when hands are inside of body
Step7. The system visualizes all augmented objects and the user. contour. Feet are always close to floor, and have a smaller range
of movement compared with the hands. Therefore, we use the
lower body contour to detect feet. With these body parts, the
system recognizes simple human gestures and visualizes
interactions with augmented objects..

Figure 2: System features: the skin blob for head and hands,
the contour for feet, edge information for hands, and
foreground
3.2 Enhanced hand detection
Figure 1: The block diagram of the body part tracking The hand is essential to many human gestures and is used when
we create or control objects. Our body part tracking module
shows the most robust performance for the hand, compared to
3. Real-time body part tracking for Augmented other parts of the body. We developed a scheme that both detects
Reality interface and tracks the hands at the same time. In addition, the length of
the sleeve worn by the user does not affect the detection process
The goals of interface of proposed system are to obtain precise
due to the hand energy map that is applied.
body parts from foreground image and real-time performance. In
order to accomplish these two goals, we detect the parts of the The hand energy map uses two types of features, edges and skin
human body using the most efficient features: the skin blob, edge color blobs, because the area of hands includes many edges and
information, and the contour(see Figure 2). Because of heavy the skin color as compared with other areas in the arm. The
processing time to obtain 3D body parts with a single camera, likelihood function of the hand energy map can be expressed as:
however, it is difficult to track the 3D human body parts in real-
time[Sminchisescu et al. 2001]. By approximating 3D body parts,
the proposed interface can process the entire interaction in real-
time.

3.1 Visual feature extraction (1)


As noted above, our approach recognizes the human gesture as an where h is the edge information and s is the blobs of the skin color.
input: however, rather than use the whole body[Perttu et al. 2005], As it can be seen in Figure 3, we can find the hand in the arm area
we only use major body parts that can indicate a variety of correctly using the hand energy map.
gestures. In detecting body parts, we employ different features
that can be more stably detected for each part as shown Figure 2.
We use face texture and skin color as features to detect the head,
skin color and edge information to detect hands, and the contour
of the lower body to detect feet.

204
The system needs at least 3 points to define the floor as a plane.
Therefore, we provide 3 point positions of screen coordinates and
real world coordinates to define a virtual plane associate with the
real world. However, if we provide only the least points that the
system needs, the error may be considerably large. Therefore, we
provide 9 points, as this is deemed to be sufficient. With these 9
Figure 3: The hand energy map for detecting a hand in the points, we acquire the registration data which includes internal
arm area. parameters and external parameters using Tsai’s method[Tsai
1987]. Intrinsic parameter data include the skew, focal length, and
3.3 Generality location of principal point while extrinsic parameter data include
We have attempted to give our interface generality. To this end, the value of translation and the rotation of the camera.
we use an adaptive skin color model in order to increase the
robustness to change of lighting. As an initial value, we select the 3.5 Approximation of 3D pose
color information of the face that is searched by the face detector.
To obtain the 3D pose of the user, we use the 3D position of the
The color information in the image region of the detected face can
user's feet in real coordinates. We calculate the distances between
then be used to initialize a hue-saturation space histogram. The
all the contour points and the center point of the contour, and then
mean values of hue and saturation are selected and used to train
obtain 3 local maximum points to compute the cross product. We
Gaussian model as the skin color model. The adaptive skin color
then select two convex points as the position of the feet. With the
model shows robustness to change of lighting and different skin
registration data mentioned above, the 3D position of the feet can
color. To give the system generality, we also establish an ability
be established.
to recognize body parts whether the user wears long sleeves or
short sleeves as stated above. Figure 4 shows the interface allows Furthermore, our method can also be applied to obtain the vector
generality that is robust to change of lighting, clothing. of the user's body. First, we select two points that are located at a
particular distance from vertices of the contour of one of the two
feet(both points maintain the same distance from the contour). We
then calculate the middle angle by dividing the internal angle
between the selected points equally. We use this angle as the
angle of the foot. We can then obtain the vector of the user's body
by interpolating the two angles of the foot.
In order to derive a planar human model, we build a plane that is
perpendicular to the floor and that meets the two endpoints of the
feet. We then select the points in that plane that meet the
projected line into the camera center C (see Figure 5) as the
approximate values of the human pose. Because the user is distant
from the camera in the proposed system, the error between the
real and approximate values is deemed negligible.

Figure 4: Robustness of the proposed interface

3.4 Registration
To facilitate appropriate augmentation between a user and virtual
objects in a virtual-object-filled augmented reality, it is necessary
to know the person’s 3D-position. For this, we need to know the
real environment which is identified through a camera in advance
and use the information as a floor and modify the size or the
position of the feet. Changing the virtual 3D-position information Figure 5: The planar human model for approximation of the
into real information is called registration. For registration, 3D pose
information about the actual background is additionally needed.
Before starting tracking, registration information about the floor is
obtained using our own registration tool.

205
4. Experiments that we defined. It shows that the estimated areas of the body
parts are matched with the actual body accurately. The overall
In the experiment, we designed and implemented a simple AR average error value when user wears long sleeves is 5.48 pixels
application to evaluate the efficiency and accuracy of our (150 frames). Figure 7 shows slightly lower the performance of
interface when used with an AR application. Experiments were hands tracking than Figure 6. Even though the skin blobs do not
conducted on a dual 2.20GHz notebook computer. We were provide a full solution to find hands, it still serves well to
particularly interested to determine how quickly the system could illustrate the capability of the hand energy map. The average error
recognize the user's body pose and simple gestures. of tracking with short sleeves is 9.16 pixels.
To test the accuracy of our system, we manually annotated the
position of the parts of the body to be modeled in the test
sequences (150 frames) and computed the image distance error of
those parts as estimated by our system. To compare the accuracy
of our method to that of other systems, we also applied the 2D
whole body part detection method of [Mun Wai and Ram 2007].
Table 1 shows the average error for each part of the body for each
system. The total average error is 5.59 pixels, which is a
significant improvement over previous method [Mun Wai and
Ram 2007] which results 9.5 pixels as the overall average error.

Table 1. Breakdown of the error in pixels after 2D tracking


Head Hands Ankles/Feet
Mun Wai’s
9.28 9.52 9.70
method
Our method 5.26 7.32 4.20 Figure 7: Hands tracking performance with short sleeves

To get convincing accuracy evaluation, we defined 3 cases that Table 2 shows the computation time of each phase of the tested
cause problematic situations based on hand tracking performance tracking processes. The overall computation time of our proposed
out of overall tracking performance. At first, we considered a self- method was only 23ms (43.5 fps), which is a significant
occlusion situation in which two hands are folded and released improvement over previous methods [Mun Wai and Ram 2007]
afterward. Second situation that is considered is a crossing-arms that require 14 seconds per frame on a 2.8GHz computer. Our
situation in which two arms returns from crossed state. The last is proposed method is also faster than that in [Siddiqui and Medioni
an occlusion with face in which a hand touches a face similar to it 2006], which takes 10 fps (Intel Xeon Dual 3GHz). This
in color. All of these situations are instances of when the skin improvement is achieved because our method reduces the
blob is split into two. computation area in each stage. Each stage receives image data
with a minimum of area to compute. The speed of our method is
also enabled by the application of a constant velocity model,
particle filter, and Adaboost, which has a low computational load.

Table 2. The necessary time for a body-part tracking


Background Head Hands Feet
Total
Subtraction Tracking Tracking Tracking
14.4 ms 3.5 ms 2.35 ms 2.82 ms 23 ms

5. Application
For testing expressiveness of interface with our method, we have
implemented BeNature system that recognizes simple gestures
using proposed interface and visualizes feedback on a screen.
BeNature recognizes human gestures in a static location, and
manipulates or controls a virtual natural object on screen
according to the user’s gestures. For example, to make a natural
Figure 6: Hands tracking performance with long sleeves object for user’s ecosystem, the user touches the virtual points of
creation which randomly moves in a 3D space. The user can also
cultivate the natural environment. With the user’s gesture, the
In our experiment, we evaluated 2D tracking performance
user can grow on grass, trees, and flowers by making rain and
separately, whether the user wears long sleeves or short sleeves.
sunshine. If the users grow these plants well, the artificial location
Figure 6 shows the results of body part tracking at each situation
can be greener autonomously; the wall is crept by ivies, the floor

206
is mossy. Figure 8 shows created virtual ecosystem in artificial lighting, clothing, and color of skin. We reduce amount of
space by the user. calculation in image to image which takes heavy load, and all
calculation is done in the smallest area as possible if we need.
BeNature consists of two parts: body part tracking and augmented
Also we use tools which have low computational load. By using
reality. To provide input to the system, body part tracking is
multiple features, the system apparently has greater robustness in
employed as an interface that detects major parts from the whole
dealing with failure to track one of the features.
body, and recognizes simple gestures of the user. As an output,
augmented reality is the visualization component wherein virtual In our future work we will adapt a statistical model to calculate
objects are placed into 3-dimentional actual locations in image the 3D human body configuration. Based on the body
using registration information. Using proposed method, we can configuration, the system will recognize the precise 3D hand
implement body part tracking module such that it is faster than position, rather than simply obtaining an approximate value. With
30fps because the visualization component requires considerable the precise position of hands, we expect that user will have more
time. immersive experience.

7. REFERENCES
[1] Feng Z., Duh, H.B.-L., Billinghurst, M. 2008. Trends in
augmented reality tracking, interaction and display: A review
of ten years of ISMAR. Mixed and Augmented Reality, 2008.
ISMAR 2008. 7th IEEE/ACM International Symposium on ,
vol., no., pp.193-202,
[2] Poupyrev, I., Tan, D.S., Billinghurst, M., Kato, H.;
Regenbrecht, H., Tetsutani, N. 2002. Developing a generic
augmented-reality interface, Computer, vol.35, no.3, pp.44-
50.
[3] Azuma, R.; Baillot, Y., Behringer, R., Feiner, S., Julier, S.,
MacIntyre, B. 2001. Recent advances in augmented reality,
Computer Graphics and Applications, IEEE , vol.21, no.6,
pp.34-47.
[4] Buchmann, V., Violich, S., Billinghurst, M., and Cockburn,
A. 2004. FingARtips: gesture based direct manipulation in
Augmented Reality. GRAPHITE '04. ACM, New York, NY,
212-221.
[5] Perttu H., Tommi I., Johanna H., Mikko L., and Ari
Nykaenen. 2005. Martial arts in artificial reality. In CHI ’05:
Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 781–790, NY, USA.
[6] Sminchisescu, C.; Triggs, B. 2001. Covariance scaled
sampling for monocular 3D body tracking, Computer Vision
and Pattern Recognition, 2001. CVPR 2001. Proceedings of
the 2001 IEEE Computer Society Conference on , vol.1, no.,
pp. I-447-I-454 vol.1.
Figure 8: Snapshot of BeNature [7] Tsai, Roger Y. 1987. A Versatile Camera Calibration
Technique for High-Accuracy 3D Machine Vision
Metrology Using Off-the-Shelf TV Cameras and Lenses.
BeNature also provides numerous interactions such as moving a IEEE Journal of Robotics and Automation, Vol. RA–3, No. 4,
cloud via the user’s hand, creating a grass field by touching a August 1987, pp. 323–344.
virtual point, stepping on grass with his or her feet. In conclusion,
the user maintains his or her own virtual ecosystem using nature [8] Siddiqui, M., and Medioni, G. 2006. Robust real-time upper
objects by gesture. body limb detection and tracking, In VSSN ’06: Proceedings
of the 4th ACM international workshop on Video
surveillance and sensor networks, pages 53–60.
6. Conclusion
[9] Mun Wai, L., Ram N. 2007. Body Part Detection for Human
In this paper we proposed the body part tracking for augmented Pose Estimation and Tracking, Motion and Video Computing.
reality interface, which detects and tracks each body part WMVC '07. IEEE Workshop on , vol., no., pp.23-23,
efficiently in real-time. The approximate 3D position of the
human body part is obtained using only a single camera. Our [10] Microsoft Project Natal, http://www.xbox.com/en-
approach works robustly and efficiently, and has been tested US/live/projectnatal/, accessed September, 2009.
qualitatively. Due to the use of an adaptive skin color model and a [11] Nintendo Wii, http://wii.nintendo.com, accessed September,
hand energy map, the system has loose constraints regarding 2009.

207
208

Vous aimerez peut-être aussi