Académique Documents
Professionnel Documents
Culture Documents
203
Step5. Using the contour of the lower body, the system detects The reasons for selecting these features are as follows. First, face
and tracks the feet. The 3D feet position is then used to estimate texture has lower false detection than others in head detection.
the 3D body pose. Because of large motion scale, diversity of shape, and agile
movement, the hands are the most difficult part to detect, and thus
Step6. With the position of the right hand, the system extracts
they detected by using skin color. As we use the skin color, the
meaningful gestures.
system can track both hands when hands are inside of body
Step7. The system visualizes all augmented objects and the user. contour. Feet are always close to floor, and have a smaller range
of movement compared with the hands. Therefore, we use the
lower body contour to detect feet. With these body parts, the
system recognizes simple human gestures and visualizes
interactions with augmented objects..
Figure 2: System features: the skin blob for head and hands,
the contour for feet, edge information for hands, and
foreground
3.2 Enhanced hand detection
Figure 1: The block diagram of the body part tracking The hand is essential to many human gestures and is used when
we create or control objects. Our body part tracking module
shows the most robust performance for the hand, compared to
3. Real-time body part tracking for Augmented other parts of the body. We developed a scheme that both detects
Reality interface and tracks the hands at the same time. In addition, the length of
the sleeve worn by the user does not affect the detection process
The goals of interface of proposed system are to obtain precise
due to the hand energy map that is applied.
body parts from foreground image and real-time performance. In
order to accomplish these two goals, we detect the parts of the The hand energy map uses two types of features, edges and skin
human body using the most efficient features: the skin blob, edge color blobs, because the area of hands includes many edges and
information, and the contour(see Figure 2). Because of heavy the skin color as compared with other areas in the arm. The
processing time to obtain 3D body parts with a single camera, likelihood function of the hand energy map can be expressed as:
however, it is difficult to track the 3D human body parts in real-
time[Sminchisescu et al. 2001]. By approximating 3D body parts,
the proposed interface can process the entire interaction in real-
time.
204
The system needs at least 3 points to define the floor as a plane.
Therefore, we provide 3 point positions of screen coordinates and
real world coordinates to define a virtual plane associate with the
real world. However, if we provide only the least points that the
system needs, the error may be considerably large. Therefore, we
provide 9 points, as this is deemed to be sufficient. With these 9
Figure 3: The hand energy map for detecting a hand in the points, we acquire the registration data which includes internal
arm area. parameters and external parameters using Tsai’s method[Tsai
1987]. Intrinsic parameter data include the skew, focal length, and
3.3 Generality location of principal point while extrinsic parameter data include
We have attempted to give our interface generality. To this end, the value of translation and the rotation of the camera.
we use an adaptive skin color model in order to increase the
robustness to change of lighting. As an initial value, we select the 3.5 Approximation of 3D pose
color information of the face that is searched by the face detector.
To obtain the 3D pose of the user, we use the 3D position of the
The color information in the image region of the detected face can
user's feet in real coordinates. We calculate the distances between
then be used to initialize a hue-saturation space histogram. The
all the contour points and the center point of the contour, and then
mean values of hue and saturation are selected and used to train
obtain 3 local maximum points to compute the cross product. We
Gaussian model as the skin color model. The adaptive skin color
then select two convex points as the position of the feet. With the
model shows robustness to change of lighting and different skin
registration data mentioned above, the 3D position of the feet can
color. To give the system generality, we also establish an ability
be established.
to recognize body parts whether the user wears long sleeves or
short sleeves as stated above. Figure 4 shows the interface allows Furthermore, our method can also be applied to obtain the vector
generality that is robust to change of lighting, clothing. of the user's body. First, we select two points that are located at a
particular distance from vertices of the contour of one of the two
feet(both points maintain the same distance from the contour). We
then calculate the middle angle by dividing the internal angle
between the selected points equally. We use this angle as the
angle of the foot. We can then obtain the vector of the user's body
by interpolating the two angles of the foot.
In order to derive a planar human model, we build a plane that is
perpendicular to the floor and that meets the two endpoints of the
feet. We then select the points in that plane that meet the
projected line into the camera center C (see Figure 5) as the
approximate values of the human pose. Because the user is distant
from the camera in the proposed system, the error between the
real and approximate values is deemed negligible.
3.4 Registration
To facilitate appropriate augmentation between a user and virtual
objects in a virtual-object-filled augmented reality, it is necessary
to know the person’s 3D-position. For this, we need to know the
real environment which is identified through a camera in advance
and use the information as a floor and modify the size or the
position of the feet. Changing the virtual 3D-position information Figure 5: The planar human model for approximation of the
into real information is called registration. For registration, 3D pose
information about the actual background is additionally needed.
Before starting tracking, registration information about the floor is
obtained using our own registration tool.
205
4. Experiments that we defined. It shows that the estimated areas of the body
parts are matched with the actual body accurately. The overall
In the experiment, we designed and implemented a simple AR average error value when user wears long sleeves is 5.48 pixels
application to evaluate the efficiency and accuracy of our (150 frames). Figure 7 shows slightly lower the performance of
interface when used with an AR application. Experiments were hands tracking than Figure 6. Even though the skin blobs do not
conducted on a dual 2.20GHz notebook computer. We were provide a full solution to find hands, it still serves well to
particularly interested to determine how quickly the system could illustrate the capability of the hand energy map. The average error
recognize the user's body pose and simple gestures. of tracking with short sleeves is 9.16 pixels.
To test the accuracy of our system, we manually annotated the
position of the parts of the body to be modeled in the test
sequences (150 frames) and computed the image distance error of
those parts as estimated by our system. To compare the accuracy
of our method to that of other systems, we also applied the 2D
whole body part detection method of [Mun Wai and Ram 2007].
Table 1 shows the average error for each part of the body for each
system. The total average error is 5.59 pixels, which is a
significant improvement over previous method [Mun Wai and
Ram 2007] which results 9.5 pixels as the overall average error.
To get convincing accuracy evaluation, we defined 3 cases that Table 2 shows the computation time of each phase of the tested
cause problematic situations based on hand tracking performance tracking processes. The overall computation time of our proposed
out of overall tracking performance. At first, we considered a self- method was only 23ms (43.5 fps), which is a significant
occlusion situation in which two hands are folded and released improvement over previous methods [Mun Wai and Ram 2007]
afterward. Second situation that is considered is a crossing-arms that require 14 seconds per frame on a 2.8GHz computer. Our
situation in which two arms returns from crossed state. The last is proposed method is also faster than that in [Siddiqui and Medioni
an occlusion with face in which a hand touches a face similar to it 2006], which takes 10 fps (Intel Xeon Dual 3GHz). This
in color. All of these situations are instances of when the skin improvement is achieved because our method reduces the
blob is split into two. computation area in each stage. Each stage receives image data
with a minimum of area to compute. The speed of our method is
also enabled by the application of a constant velocity model,
particle filter, and Adaboost, which has a low computational load.
5. Application
For testing expressiveness of interface with our method, we have
implemented BeNature system that recognizes simple gestures
using proposed interface and visualizes feedback on a screen.
BeNature recognizes human gestures in a static location, and
manipulates or controls a virtual natural object on screen
according to the user’s gestures. For example, to make a natural
Figure 6: Hands tracking performance with long sleeves object for user’s ecosystem, the user touches the virtual points of
creation which randomly moves in a 3D space. The user can also
cultivate the natural environment. With the user’s gesture, the
In our experiment, we evaluated 2D tracking performance
user can grow on grass, trees, and flowers by making rain and
separately, whether the user wears long sleeves or short sleeves.
sunshine. If the users grow these plants well, the artificial location
Figure 6 shows the results of body part tracking at each situation
can be greener autonomously; the wall is crept by ivies, the floor
206
is mossy. Figure 8 shows created virtual ecosystem in artificial lighting, clothing, and color of skin. We reduce amount of
space by the user. calculation in image to image which takes heavy load, and all
calculation is done in the smallest area as possible if we need.
BeNature consists of two parts: body part tracking and augmented
Also we use tools which have low computational load. By using
reality. To provide input to the system, body part tracking is
multiple features, the system apparently has greater robustness in
employed as an interface that detects major parts from the whole
dealing with failure to track one of the features.
body, and recognizes simple gestures of the user. As an output,
augmented reality is the visualization component wherein virtual In our future work we will adapt a statistical model to calculate
objects are placed into 3-dimentional actual locations in image the 3D human body configuration. Based on the body
using registration information. Using proposed method, we can configuration, the system will recognize the precise 3D hand
implement body part tracking module such that it is faster than position, rather than simply obtaining an approximate value. With
30fps because the visualization component requires considerable the precise position of hands, we expect that user will have more
time. immersive experience.
7. REFERENCES
[1] Feng Z., Duh, H.B.-L., Billinghurst, M. 2008. Trends in
augmented reality tracking, interaction and display: A review
of ten years of ISMAR. Mixed and Augmented Reality, 2008.
ISMAR 2008. 7th IEEE/ACM International Symposium on ,
vol., no., pp.193-202,
[2] Poupyrev, I., Tan, D.S., Billinghurst, M., Kato, H.;
Regenbrecht, H., Tetsutani, N. 2002. Developing a generic
augmented-reality interface, Computer, vol.35, no.3, pp.44-
50.
[3] Azuma, R.; Baillot, Y., Behringer, R., Feiner, S., Julier, S.,
MacIntyre, B. 2001. Recent advances in augmented reality,
Computer Graphics and Applications, IEEE , vol.21, no.6,
pp.34-47.
[4] Buchmann, V., Violich, S., Billinghurst, M., and Cockburn,
A. 2004. FingARtips: gesture based direct manipulation in
Augmented Reality. GRAPHITE '04. ACM, New York, NY,
212-221.
[5] Perttu H., Tommi I., Johanna H., Mikko L., and Ari
Nykaenen. 2005. Martial arts in artificial reality. In CHI ’05:
Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 781–790, NY, USA.
[6] Sminchisescu, C.; Triggs, B. 2001. Covariance scaled
sampling for monocular 3D body tracking, Computer Vision
and Pattern Recognition, 2001. CVPR 2001. Proceedings of
the 2001 IEEE Computer Society Conference on , vol.1, no.,
pp. I-447-I-454 vol.1.
Figure 8: Snapshot of BeNature [7] Tsai, Roger Y. 1987. A Versatile Camera Calibration
Technique for High-Accuracy 3D Machine Vision
Metrology Using Off-the-Shelf TV Cameras and Lenses.
BeNature also provides numerous interactions such as moving a IEEE Journal of Robotics and Automation, Vol. RA–3, No. 4,
cloud via the user’s hand, creating a grass field by touching a August 1987, pp. 323–344.
virtual point, stepping on grass with his or her feet. In conclusion,
the user maintains his or her own virtual ecosystem using nature [8] Siddiqui, M., and Medioni, G. 2006. Robust real-time upper
objects by gesture. body limb detection and tracking, In VSSN ’06: Proceedings
of the 4th ACM international workshop on Video
surveillance and sensor networks, pages 53–60.
6. Conclusion
[9] Mun Wai, L., Ram N. 2007. Body Part Detection for Human
In this paper we proposed the body part tracking for augmented Pose Estimation and Tracking, Motion and Video Computing.
reality interface, which detects and tracks each body part WMVC '07. IEEE Workshop on , vol., no., pp.23-23,
efficiently in real-time. The approximate 3D position of the
human body part is obtained using only a single camera. Our [10] Microsoft Project Natal, http://www.xbox.com/en-
approach works robustly and efficiently, and has been tested US/live/projectnatal/, accessed September, 2009.
qualitatively. Due to the use of an adaptive skin color model and a [11] Nintendo Wii, http://wii.nintendo.com, accessed September,
hand energy map, the system has loose constraints regarding 2009.
207
208