Académique Documents
Professionnel Documents
Culture Documents
https://www.facebook.com/IntelRealSense
/IntelRealSense
https://twitter.com/IntelRealSense
http://www.intel.com/realsense
intel.com/RealSense
http://www.intel.com/realsense/SDK
intel.com/RealSense/SDK
Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN
INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS
ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or
death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY
AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF
EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF,
DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH
MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE,
OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or
characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no
responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change
without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling
1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
*Other names and brands may be claimed as the property of others.
Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or
other countries.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
Intel and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries.
Java is a registered trademark of Oracle and/or its affiliates.
Copyright 2014, Intel Corporation. All rights reserved.
04 Introduction
Welcome
Intel RealSense SDK Architecture
Camera Specs
Capture Volumes
10 Overview
Input Modalities
High-Level Design Principles
14 Hands
Contour Mode
Skeleton Tracking
Gesture Recognition
Gestures
Common Actions
Supported Hand Positions
Best Practices
Designing Gesture Interactions
How to Minimize Fatigue
Visual Feedback
General Principles
User/World Interactions
Action/Object Interactions
Cursor Interactions
Contents
Intel RealSense technology will change
how you interact- not simply with your
devices, but with the world around you.
You'll work and play like never before,
because your devices can see, hear, and
feel you.
35 Face
Face Detection
Head Orientation
Landmark Tracking
Facial Expressions
Emotion Recognition
Avatar Control
Use Cases
Best Practices
44 Speech
Speech Recognition
Command Mode
Dictation Mode
Speech Synthesis
Best Practices
50 Background Removal
Background Removal
Use Cases
Best Practices
54 Object Tracking
Object Tracking
57 Samples
04
Introduction
Welcome!
Imagine new ways of navigating the
world with more senses and sensors
integrated into the computing
platforms of the future. Give your users
a new, natural, engaging way to
experience your applications. At Intel
we are excited to provide the tools as
the foundation for this journey with the
Intel RealSense SDKand look
forward to seeing what you come up
with.
Over the new few months, you will be
able to incorporate new capabilities
into your applications including
close-range hand tracking, speech
recognition, face tracking, background
segmentation, and object tracking, to
fundamentally change how people
interact with their devices and the
world around them.
05
SDK Application
C# Interface
SDK Samples/Demos/Tools
Java* Interface
...
SDK Core
Module Management
Pipeline Execution
Interoperability
I/O
Module
Capability
Module
Multiple Modalities
Capability
Module
Multiple Implementations
SDK Interfaces
2014 Intel RealSense SDK Design Guidelines | Introduction | Features and Requirements
Required OS
Supported IDE
Multi-mode Support
Extensible Framework
06
07
Camera Specs
Alignment Hole
IR Laser Projector
SoC Color Camera
IR depth camera
Green LED
Resolution
Up to 1080p@30FPS (FHD)
Active Pixels
1920x1080 (2M)
Aspect Ratio
16:9
4:3
Frame Rate
30/60/120FPS*
90 x 59 x 73 (Cone)
IR Projector FOV- N/A x 56 x 72 (Pyramid)
Color Formats
YUV4:2:2 (Skype/Lync Modes**)
N/A
Alignment Hole
Effective Range
0.2m-1.2m
Environment
Indoor/Outdoor
(Depending on Conditions)
08
Capture Volumes
The capture volume of a depth-sensing camera is visualized as a frustum defined by near and far planes and a field of
view (FOV). Capture volume constraints limit the practical range of motion of the user and the physical space within
which users can interact successfully. You must make sure to be aware of, and make your users aware of, the available
interaction zone. Enthusiastic users can inadvertently move outside of the capture volume, so the feedback you
provide must take these situations into account.
The user is performing a hand gesture inside the cameras capture volume.
The user is performing a hand gesture outside of the capture volume. The
camera will not see this gesture.
09
2014 Intel RealSense SDK Design Guidelines | Introduction | Integrated Depth Camera
Single hand
2 hands
Contour mode blob tracking works up to 1m/s for up to 2 blobs in VGA mode from 20-85cm, and up to 2m/s in HVGA
mode from 20-75cm. 3D face tracking works best in the 35-70cm range, while 2D face tracking works from 35-120cm,
MAX 120cm
VGA mode: 2060cm
HVGA mode: 2055cm
MIN 20cm
10
Overview
Input Modalities
The new Intel RealSense technology offers amazing opportunities to completely redefine how we interact with
our computing devices. To design a successful app for this platform, you must understand its strengths. Make sure
to take advantage of combining different input modalities. This will make it a more exciting and natural experience
for the user, and can minimize fatigue of the hands, fingers, or voice.
Design in such a way that extending to different modalities and combinations of modalities is easy. Also keep in mind
that some of your users may prefer certain modalities over others, or have differing abilities. Regardless of the input
method used, it is always critical to study and evaluate how users actually engage with their devices and then to build
the interface in support of those natural movements. Heres a quick rundown of some traditional and RealSense
modalities, and what each is best used for:
2014 Intel RealSense SDK Design Guidelines | Overview Chapter | Design Principles
11
2014 Intel RealSense SDK Design Guidelines | Overview Chapter | Design Principles
Extensible. Keep future SDK enhancements in mind. Unlike mouse interfaces, the power, robustness, and flexibility of
platforms with Intel RealSense will improve over time. How will your app function in the future when sensing of hand
poses improves dramatically? How about when understanding natural language improves? Design your app such that
it can be improved as technology improves and new senses are integrated together.
Reliable and Recoverable. It only takes a small number of false positives to discourage a user from your application.
Focus on simplicity where possible to minimize errors. Forgive your users when they make errors, and find ways to help
users recover gracefully. For example, if a users hand goes out of the field of view of the camera, make sure that your
application doesnt crash or do something completely unexpected. Intelligently handle such types of situations and
provide feedback.
Contextually appropriate. Are you designing a game? A medical application? A corporate content-sharing application?
Make sure that the interactions you provide match the context. For example, users expect to have more fun
interactions in a game, but may want more straightforward interactions in a more serious context. Pay attention to
modalities (e.g., dont rely on voice in a noisy environment).
Comfortable. Make sure using the application does not require hand positions or other physical activity that is
obviously awkward, painful, or extremely tiring.
Designed with the user in mind. Take user-centered design seriously. Make sure you know who your audience is before
choosing the users you work with. Not all users will perform the actions the same ways, or want the same experience.
Your intended users need to test even the best-designed applications. Dont do this right before you plan to launch
your application or product. Unexpected issues will come up and require you to redesign your application. Iterate!
Test, evaluate, tune, and retest!
12
2014 Intel RealSense SDK Design Guidelines | Overview Chapter | Design Principles
When using a laptop, the users hands tend to be very close to the screen.
The screen is usually lower, relative to the users head.
When using an AIO, users hands are farther away from the screen. The screen
is also higher, relative to the users head.
13
14
Hands
Contour Mode
Skeleton Tracking
Gesture Recognition
Best Practices
Visual Feedback
15
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Contour Mode
Contour Mode
The Intel RealSense SDK provides articulated hand and finger skeleton tracking, as well as object tracking. For simpler,
quicker blob tracking, the SDK also has a mode called contour mode. Contour mode can track up to 2 blobs in the
scene. These blobs can be anything (e.g., an open hand, a fist, a hand holding a phone, 2 hands touching) that is a
connected component.
A blob is formed and selected by choosing the biggest blob that passes a virtual wall. You can adjust the distance of the
virtual wall (the default is 55cm). You can also decide the max depth of the object that is to be tracked (the default is
30cm). This is useful, for example, for only segmenting the hand without the arm. You can also set the minimal blob size.
Maximum Depth
Virtual Wall
Each blob has a mask, a contour line, and a pixel count associated with it. The blob also has information about its location
in 3D space and its extremity points. The blobs and contours are smoothed- this can also be controlled through the SDK.
Blob
Mask
Contour
Pixel Count
16
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Skeleton Tracking
Skeleton Tracking
Middle Fingertip
Ring Fingertip
Index Fingertip
Thumb Fingertip
Pinky Fingertip
Pinky Joint C
Pinky Joint B
Thumb Joint C
Pinky Joint A
Thumb Joint B
Thumb Joint A
Palm
Wrist
Left Hand
Right Hand
17
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Gesture recognition
Gesture Recognition
Gestures can refer to static poses (e.g., open hand), or to the movement within and between poses, commonly called
gestures (e.g., wave).
The Intel RealSense SDK has a set of built-in gestures to start from. There will be a larger set of gestures as the SDK
matures. All single-handed gestures can be performed with either the right or left hand. You can use these gestures as is,
combine them into composite gestures, or create custom gestures that arent included in the SDK using the hand skeleton
points. If you choose to create a custom gesture, ensure that it is important to your application design. Make sure there are
no conflicts in the gesture definitions that are used in the same context (e.g., if you enable Big 5 and Wave in the same
level of the game, when the user waves, the spread hand gesture will be triggered as well).
Gestures
Here are the gestures currently recognized as part of the Intel RealSense SDK:
Big 5
Pose
V-Sign
Pose
Extend the index and
middle fingers and curl
the other fingers. Have
some sensitivity in the
hand orientation as long
as its fingers are stretched
up.
Tap
Gesture
Keep your hand in a
natural relaxed pose and
move it in Z as if you are
pressing a button.
Wave
Gesture
Wave an open hand
facing the screen. Wave
length can be any number
of repetitions.
18
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Gesture recognition
Fist
Side-Swipe
Thumbs Down
Thumbs Up
Pose
Gesture
Pose
Pose
Pose
Pose
19
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Common Actions
Common Actions
Think of the poses and gestures in the Intel RealSense SDK as primitives that can be used alone or in combinations to
perform certain actions. Below are some common actions that may come up in many applications. When these actions
exist in your application, they should generally be performed using the given gestures. Providing feedback for these
gestures is critical, and is discussed in the Visual Feedback section. We dont require that you conform to these guidelines,
but if you depart from these guidelines you should have a compelling user experience reason to do so. This set of
universal gestures will become learned by users as standard and will become more expansive over time.
All dynamic gestures require a specific pose to register the gesture recognition system, as well as a pose to terminate the
action. For actions that are only using a single SDK primitive (like Big 5, or wave), no explicit activation or closure pose is
needed. There are a few ways of activating a dynamic gesture, including using a time variable, a pinch pose, an open hand
pose, or having the user enter a virtual plane.
Types of actions
REGISTRATION
CONTINUATION
TERMINATION
Pinch detected.
Movement is tracked.
REGISTRATION
CONTINUATION
TERMINATION
Big 5 detected.
Action is completed
after time has passed.
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Common Actions Examples
Here is a list of common actions, along with examples of how to implement them using the gestures in the SDK* or the
touchless controller interface**:
Engage**
Scroll**
1) engage the system, 2) move the cursor with your palm to one
of the screen edges, and 3) hold the cursor at the edge to scroll
in that direction
Escape/Reset**
Swipe
1) engage the system, and 2) wave with an open palm from side
to side naturally to reset or escape from an application mode.
Go Back**
Push to Select**
Hover Select*
20
21
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Single-Handed Supported Positions
Single-Handed:
Strong self-occlusions
supported
Weak self-occlusions
22
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | 2-Handed Supported Positions
2-Handed:
Touching, strong occlusions
unsupported
supported
23
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Other Considerations
Other Considerations
Rate- Vs. Absolute-Controlled Gesture Models
You can use an absolute-controlled model or a rate-controlled model to
control gesture-adjusted parameters such as rotation, translation (of
object or view), and zoom level. In an absolute model, the magnitude to
which the hand is rotated or translated in the gesture is translated directly
into the parameter being adjusted, e.g., rotation or translation. For
example, a 90-degree rotation by the input hand results in a 90-degree
rotation of the virtual object.
In a rate-controlled model, the magnitude of rotation/translation is
translated into the rate of change of the parameter, e.g. rotational velocity
or linear velocity. For example, a 90-degree rotation could be translated
into a rate of change of 10 degrees/second (or some other constant rate).
With a rate-controlled model, users release the object or return their
hands to the starting state to stop the change.
24
The users hand is mapped on the screen on the line of sight, which leads to
occlusion and inability for the user to see the action on the screen.
The users hand is mapped onto the screen away from the line of sight, and
the user can naturally navigate without occluding the screen.
Occlusion
For applications involving mid-air gestures, keep in mind the problem of a users hands occluding the users view of the
screen. It is awkward if users raise their hand to grab an object on the screen, but cant see the object because their
hand caused the object to be hidden. When mapping the hand to the screen coordinates, map them in such a way that
the hand is not in the line of sight of the screen object to be manipulated.
25
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Best Practices
Best Practices
Dont model your app on existing user interfaces - build user experiences that are tuned for gesture input. Leverage the
3rd dimensionality of the hand, and dont force gesture when other methods work better.
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Designing Gesture Interactions
26
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Designing Gesture Interactions
Comfortable
Design gestures to be ergonomically comfortable. If the user gets tired or uncomfortable, they will likely stop
using your application. See next section for specific tips on how to avoid user fatigue.
Natural and Intuitive
Gestures are understandable. Use appropriate mappings of motions to conceptual actions.
As part of your app design, you need to teach the user that they are being recognized and will be interacting
in a new way with the system, especially before this modality becomes commonplace.
Stay away from abstract gestures that require users to memorize a sequence of poses. Abstract gestures are
gestures that do not have a real-life equivalent and dont fit any existing mental models. An example of a
confusing pose is v-sign to delete something, as opposed to placing or throwing an item into a trash can.
Some gestures will be innate to the user (e.g., grabbing an object on the screen), while some will have to be
learned (e.g., waving to escape a mode). Make sure you keep the number of learned gestures small for a low
cognitive load on the user.
Design for the right space. Be aware of designing for a larger world space (e.g., with larger gestures, more arm
movement) versus a smaller more constrained space (e.g., manipulating a single object).
Distinguish between environmental and object interaction.
Gestures can have different meanings in different cultures so be conscious of context when designing your
app. For example, the thumbs up and peace signs both have positive connotations in North America but
quite the opposite in Greece and areas of the Middle East, respectively. It may be a good idea to do a quick
check on gestures before going live in other countries.
Be aware of which gestures should be actionable. What will you do if the user fixes her hair, drinks some
coffee, or turns to talk to a friend? The user should not have to keep their hands out of view of the camera in
order to avoid accidental gestures. Normal resting hand poses or activity should not trigger gestures.
27
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | How to Minimize Fatigue
Break up activities into small, short actions. Try to keep actions relatively brief with rest periods in between rather
than having users move all over the screen. Long-lasting gestures, especially ones where the arms must be held in a static
pose, quickly induce fatigue in the users arm and shoulder (e.g., holding the arm up for several seconds to make a selection).
Design for breaks. Users naturally, and often subconsciously, take quick breaks. Short, frequent breaks are better than
long, infrequent ones.
Allow users to interact with elbows rested on a surface. Perhaps the best way to alleviate arm fatigue is by resting
elbows on a chairs armrest. Support this kind of input when possible. This, however, reduces the usable range of motion of
the hand to an arc in the left and right direction. Evaluate whether interaction can be designed around this type of motion.
Do not require many repeating gestures. If you require users to constantly move their hands in a certain way for a long
period of time (e.g., moving through a long list of items by panning right), they will become tired and frustrated very quickly.
Avoid actions that require your users to lift their hand above the height of their shoulder. This can get pretty
tiring pretty quickly, and it can even be challenging for some users to have to lift their arms high.
Design for left-right or arced gesture movements. Whenever presented with a choice, design for movement in the
left-right directions versus up-down for ease and ergonomic considerations.
Allow for relative motion instead of absolute motion wherever it makes sense. Relative motion allows the user
to reset her hand representation on the screen to a location more comfortable for her hand. Do not feel the need to map the
interaction zone 1:1 with the screen coordinates as this could get very tiring.
Do not require precise input. Precision is a good thing. . . up to a point. Imagine using your finger instead of a mouse to
select a specific cell in Excel. This would be incredibly frustrating and tiring. Users naturally tense up their muscles when
trying to perform very precise actions. This, in turn, accelerates fatigue. Allow for gross gestures and make your interactive
objects large (see more in the Visual Feedback section).
28
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Visual Feedback
Visual Feedback
For apps that use gesture, effective feedback is especially critical. This is a novel input modality for most people, and there
are no explicit physical confirmations of interaction (as with keyboard, mice, and touchscreens). When developing your
app, you need to ensure that user understands how to control an application and feels the system is responsive, accurate,
and satisfying.
Designers will have to step outside the existing WIMP paradigm, and
create interfaces that are well attuned to human motions. One
example is creating an arc-based menu that enables a user to rest
their elbow on their desk while still controlling the menu.
What
can I do?
What does
it mean?
How do
I do it?
What
happened?
feedback
d - for ward
fee
USER
SYSTEM
Don Normans user action framework
Be informative
Make sure that your visual designs for icons or text feedback
are legible and written to communicate effectively.
29
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | User/World Interactions
We will cover 3 general levels of interaction and feedback that you should pay attention to when building your app:
User/World, Actions/Objects, and Cursors.
User/World Interactions
When designing your app, you need to communicate information to the user about how the camera is seeing them. Are
they in the interaction zone? How should they position themselves in front of the camera?
The field of view/interaction zone of the camera is discussed in the Introduction. Here are a few recommended ways of
giving user feedback- choose the one that makes sense for your particular app, and dont combine them.
View of User
30
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | User/World Interactions
Distance Prompts
Simple prompts indicate the near and far boundaries of the interaction zone. Without prompts, users see the system
become unresponsive and dont understand what to do next. Filter the distance data and show the prompt after a
slight delay. Use positive instructions instead of error alerts.
CLOS
MOVE
ER
BACK
MOVE
World Diagrams
World diagrams orient the user and introduce them to the notion of a depth camera with an interaction zone. This is
recommended for help screens and tutorials, and games for users new to the camera. Dont show this every time, only
during a tutorial or on a help screen. Dont make it too technical and consider the audience.
Introduce the user to the optimum interaction range, using diagrams similar to the above.
31
32
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Action/Object Interactions
Action/Object Interactions
When designing apps for gesture, you need to consider how to visually show the user how they will be interacting with
objects on the screen.
Actionable Buttons
Show what to do and what is actionable. Use color and animation to draw objects according to interaction state (e.g., hover,
selected, not selected). Distinguish actionable 3D objects from the rest of the scene. Use standard selection gestures, and
suggest the select action.
Press
Hover
33
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Action/Object Interactions
Grabbable Objects
2D input is most robust and will be most familiar to users. Since we are currently using 2D displays, use it for all simple
interactions, especially in productivity apps. 2.5D adds simple depth input for special actions. Full 3D-space interaction is
most challenging for users. To show that the user is operating in a 3D space, aggressively use depth cues- shadowing,
shading, texture, and perspective.
2D
2.5D
3D
2014 Intel RealSense SDK Design Guidelines | Hands Chapter | Cursor Interactions
Cursor Interactions
For air gestures, users prefer a cursor that mimics their hand over an abstract design. Cursors are recommended for
apps that do pointing and selecting. Do not use a cursor for apps that implement a user overlay or user viewport, or
with apps that use heavy 3D interaction. A 2D visual cursor assumes that the user is mostly operating on a plane in 3D
space that is parallel to the screen. Make sure to smooth the cursor motion to avoid visual jitter. Animate the cursor to
show the appropriate level of detail. If relevant to your interactions, you should show individual finger motions.
A cursor will focus the users attention; dont use it if its not necessary. Put important alerts about system state at the
cursor.
Here is an example of a dynamic hand cursor showing the left or right hand according to which is seen.
Here we see the dynamic hand cursor giving visual feedback that the user has made a pinch gesture.
34
35
Face
36
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Face Detection
Face Detection
The Intel RealSense SDK provides accurate 3D detection and tracking of all faces in the scene at a range of up to 1.2
meters. There is a maximum number of 4 faces that can be tracked. You can choose which 4 to detect (e.g., 4 closest to
the screen, 4 most right). Each marked face will be outlined with an approximated rectangle (you can get the x, y, and z
coordinates of this rectangle). Compared to 2D tracking, 3D head tracking capability maintains tracking with much
greater head movement. It works in wide lighting conditions. You can get alerts for when the face is in the FOV, partially
in the FOV, or occluded.
Up to four faces
can be tracked at the same time
One face has
complete landmark tracking
Face Recognition
The SDK provides the ability to recognize specific faces. Once a face is registered, an ID is assigned to it and some
information about it is stored in the memory of the Face library. If the same face is registered multiple times, it will
improve the chances of correct recognition of that face in the future. Whenever there is an unrecognized face in the
frame, the recognition module will compare it against the available data in the database, and if it finds a match it will
return the stored ID for that face.
Users do not need to worry abut their images being stored. The data saved is a collection of features gathered from the
image by the algorithm.
37
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Face Orientation
Head Orientation
roll
pitch
yaw
The SDK provides you with the 3D orientation of the head. This gives you an idea of where the users head is pointing (in
some cases, you could infer they are likely looking in that direction as well). You can experiment with this as a very coarse
version of eye tracking. Stay tuned for finer-tuned gaze tracking in the next release! For now, tracking head movements
works best on the yaw and pitch axes.
38
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Landmark Tracking
Landmark Tracking
RIGHT EYEBROW
LEFT EYEBROW
RIGHT EYE
LEFT EYE
JAW
NOSE
MOUTH
39
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Facial Expressions
Facial Expressions
The SDK also includes higher-level facial expression recognition. This can make the creation of cartoonish avatars easier.
Each of the expressions have an intensity level from 0 to 100 to allow for smoother, more natural animation.
The following expressions are in the SDK:
Smile
Mouth Opened
Tongue out
Kiss
Eye Closed
Pupil Move
Eye-brow Raised
Eye-brow Lowered
40
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Emotion Recognition
Emotion Recognition
Emotients emotion recognition algorithms in our SDK use 2D RGB data. The Emotion module is not part of the Face
module. The faces in an image or video are located in real-time up to 30fps, and interpolation and smoothing of face data
is extracted from consecutive frames of a video. For emotion recognition to work, the faces in the image have to be at least
48x48 pixels. You can query for the largest face in the FOV. The algorithms are not limited to RGB data and can be used
with greyscale data as well.
With the SDK, you can detect and estimate the intensity of the six primary expressions of emotion (anger, disgust, fear, joy,
sadness, surprise) seen below:
Anger
Disgust
Fear
Joy
Sadness
Surprise
The emotion channels operate independently of one another. There are 2 ways to access the channel data- either by
intensity (between 0 and 1) or evidence (the odds in log scale of a target expression being present). See the SDK
documentation for more details. You can also access aggregate indicators of positive and negative emotion. There is a
neutral emotion channel to help calibrate the other channels, and an experimental contempt channel. Currently, joy,
surprise, and disgust are the easiest emotions to recognize.
Facial artifacts, such as glasses or facial hair, may make it advisable to focus on changes in the emotion outputs rather than
absolute values. For example, a person may have facial hair that makes him appear less joyful than he would if he did not
have facial hair.
You can combine a few emotions together to give a guess of negative emotion. For example, you could combine varying
intensities of disgust, fear, and anger and assume your user is likely expressing a negative emotion.
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Pulse Estimator
Avatar Control
The SDK provides simple avatar control for use in your applications by combining the facial expressions and
landmarks available in the Face module. The SDK provides sample code for a Character Animation that enables your
application to use any face model and animate the user as part of your application. All the code, assets and rigging
are part of the code distribution of the SDK.
You can:
Add your own avatar models
Adjust smoothing on the landmarks and expressions for various usages and behaviors
Enable/disable specific expressions or landmark subsets (e.g., you may want to only move the eyes on your avatar)
Allow mirroring (e.g., when the user blinks their right eye, the left eye of the avatar will blink)
Tune the resolution of the animation graphics
Plug the avatar into different environments (you can provide your own background, e.g. an image, video, or camera stream)
41
42
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Use Cases
Use Cases
There are many reasons why you might want to use head tracking, face tracking, and emotion recognition in your apps.
We outline a few examples that can work well leveraging the Intel RealSense SDK:
Gaming/App Enhancements
Face Augmentation
Avatar Creation
Affective Computing
2014 Intel RealSense SDK Design Guidelines | Face Chapter | Best Practices
Best Practices
Lighting and Environment
Good lighting is very important especially for 2D RGB tracking
For optimal tracking:
use indoor, uniform, diffused illumination
have ambient light or light facing the users face (avoid shadows)
avoid backlighting or other strong directional lighting
User Feedback
Give feedback to the user to make sure they are in a typical working distance away from the computer, for
optimal feature detection.
Notify the user if they are moving too fast to properly track facial features.
User Interactions
Design short interactions
Avoid user fatigue. Do not ask or expect the users to move their head/neck quickly or to tilt their head on any
axis more than 30 degrees from the screen. If you are asking the user to tilt their head, make sure they can still
see relevant cues and information on the screen.
Test emotion detection and landmark-based expression detection by testing with your audience. People
express their emotions very differently based on culture, age, and situation. Also, remember that every face
may have a different baseline for any given emotion.
User Privacy
Communicate to users about where images go and what happens to them. For most applications, you will not
need to save the actual images, but could save the information about them (e.g. emotion intensity, landmark
coordinates, angle, size)
43
44
Speech
Speech Recognition
Speech Synthesis
Best Practices
45
2014 Intel RealSense SDK Design Guidelines | Speech Chapter | Speech Recognition
Speech Recognition
uments...
c
o
yd
m
en
p
o
z...
z
...
ja
y
a
l
p
...onc
e up
on
a
me time
eti
ng ...
no
te
s.
Command Mode
Command mode is for issuing commands tied to discrete actions (e.g., saying fire to trigger cannon fire in a game).
In command mode, the SDK module recognizes only from a predefined list of context phrases that you have set. The
developer can use multiple command lists, which we will call grammars. Good application design would create multiple
grammars and activate the one that is relevant to the current application state (this limits what the user can do at any
given point in time based on the command grammar used). You can get recognition confidence scores for command and
control grammars. To invoke the command mode, provide a grammar.
2014 Intel RealSense SDK Design Guidelines | Speech Chapter | Speech Recognition
Flexible.
Dictation Mode
Dictation mode is for open-ended language from the user (e.g., entering in the text for a Facebook status update). Dictation
mode has a predefined vocabulary. It is a large, generic vocabulary containing 50k+ common words. Highly domain
specific terms (e.g., medical terminology) may not be widely represented in the generic vocabulary file, but you can
customize your app to a specialized domain if desired. You can add in your own custom vocabulary to the dictation engine
if you find that specific words you need are not in the dictionary.
common
50k words
30sec limit
Absence of a provided command grammar will invoke the SDK in dictation mode. Dictation is limited to 30 seconds.
Currently, grammar mode and dictation mode cannot be run at the same time.
46
2014 Intel RealSense SDK Design Guidelines | Speech Chapter | Speech Synthesis
Speech Synthesis
You can also generate speech using the built in Nuance speech synthesis that comes with our SDK. Currently a female
voice is used for text-to-speech (TTS). Speech can be synthesized dynamically.
Make sure to use speech synthesis where it makes sense. If there is a narrator or a character speaking throughout your
application, it may make more sense to pre-record speech where dynamic speech synthesis isnt needed. Have an
alternative for people who cannot hear well, or if speakers are muted.
565
47
2014 Intel RealSense SDK Design Guidelines | Speech Chapter | Best Practices
Best Practices
Be aware of speechs best uses. Some great uses of speech are for short dictations, triggers, or shortcuts.
For example, speech could be used as a shortcut for a multi-menu action (something that requires more than a
first-level menu and a single mouse click). However, to scroll down a menu, it may make more sense to use the
touchscreen or a gesture rather than repeatedly have the user say Down, Down, Down.
Here are some best practices for using speech in your applications:
Environment
Be aware that speech can be socially awkward in public, and background noise can easily get in the way of
successful voice recognition.
Test your application in noisy backgrounds and different environmental spaces to ensure robustness of
sound input.
Naturalness/Comfort
People do not speak the way they write. Be aware of pauses and interjections such as um and uh.
Dont design your application such that the user must speak constantly. Make verbal interactions short, and
allow for breaks to alleviate fatigue.
Listening to long synthesized speech will be tiresome. Synthesize and speak only short sentences.
48
2014 Intel RealSense SDK Design Guidelines | Speech Chapter | Best Practices
User Control
Watch out for false positives- some sounds could unexpectedly crop up as background noise. Do not
implement voice commands for
dangerous or unrecoverable actions (such as deleting a file without verification).
Give users the ability to make the system start or stop listening. You can provide a button or key press for
push to talk control.
Give users the ability to edit/redo/change their dictation quickly. It might be easier for the user to edit with
the mouse, keyboard, or touchscreen at some point to edit their dictations.
Feedback
Teach the user how to use your system as they use it. Give more help initially, then fade it away as the user
gets more comfortable (or have it as a customizable option).
Always show the listening status of the system. Is your application listening? Not listening? Processing
sound? Let the user know how to initiate listening mode.
Let the user know what commands are possible. It is not obvious to the user what your applications current
grammar is. This information can be shown in an easily accessibly
Let the user know that their commands have been understood. The user needs to know this to trust the
system, and know which part is broken if something doesnt go the way they planned. One easy way to do
this is to relay back a command. For example, the user could say Program start, and the system could
respond by saying Starting program, please wait.
If you give verbal feedback, make sure it is necessary, important, and concise! Dont overuse verbal feedback
as it could get annoying to the user.
If sound is muted, provide visual components and feedback, and have an alternative for any synthesized
speech.
49
50
Background
Removal
User Segmentation
Use Cases
Best Practices
2014 Intel RealSense SDK Design Guidelines | Background Removal Chapter | Background Removal
Background Removal
The Intel RealSense SDK allows you to remove users backgrounds without the need for specialized equipment. You can
mimic green-screen techniques in real-time without post-processing. A segmented image can be generated per frame
which can remove or replace portions of the image behind the users head and shoulders. If the user is holding an object,
this will also be segmented.
Background removal
Background replacement
The resolution of the segmented output image matches the resolution of the input color image, and contains a copy of the
input color data and an alpha channel mask. Pixels that are part of the background have an alpha value less than 128, and
pixels that correspond to the users head and torso have an alpha value greater than or equal to 128.
51
52
2014 Intel RealSense SDK Design Guidelines | Background Removal Chapter | Use Cases
Use Cases
Some common use cases for background removal include:
Sharing content
Sharing workspaces
Video Chat/Teleconferencing
2014 Intel RealSense SDK Design Guidelines | Background Removal Chapter | Best Practices
Best Practices
Utilize with good lighting conditions.
Background removal is optimized for a single user to be segmented out of the scene, but will
work for multiple users if they fit into the field of view.
Dont use for privacy applications in case there is could be confidential information in the
background that is being removed.
Offer the user the ability to turn segmentation on and off.
White backgrounds work best for background removal. Slight user adjustments can affect the
quality of the background removal.
Play around with the visual blending effects of the edges between the segmented object and
the background for different applications.
A person coming in and out of the scene might result in jagged segmentation. You can either
inform the user of this or try to take care of it with visual effects.
53
54
Object Tracking
2014 Intel RealSense SDK Design Guidelines | Object Tracking Chapter | Object Tracking
Object Tracking
The Metaio* 3D object tracking module provides optical-based tracking techniques that can detect and track known or
unknown objects in a video sequence or scene. The Metaio Toolbox is provided to train, create, and edit 3D models that
can be passed to various object detection and tracking algorithms.
The tracking techniques available are 2D object tracking for planar objects, feature-based 3D tracking, edge-based 3D
tracking from CAD models, and instant 3D tracking.
2D Object Tracking
2D object tracking is configured by providing a
reference image. The algorithm tracks the image in a
video sequence and returns the tracking parameters.
Feature-Based 3D Tracking
3D feature-based tracking can track any real-world
3D object. Tracking is based on a 3D feature map.
55
2014 Intel RealSense SDK Design Guidelines | Object Tracking Chapter | Object Tracking
Instant 3D Tracking
Instant 3D tracking (also known as SLAM) allows you to create a point cloud of the scene on the fly and immediately use it
as a tracking reference. With SLAM you dont need to provide any input files.
56
57
Samples
Intro
Find code samples for accessing the raw color
and depths streams here
Hands
Find code samples for hand analysis
algorithms (including using the 22 hand
skeleton jointshttp://www.intel.com/realsense/SDK
and gesture detection) here
Face
Find code samples for face analysis
algorithms (including face detection,
http://www.intel.com/realsense/SDK
landmark detection,
and pose detection) here
Speech
Find code samples for accessing the raw
audio data, and voice recognition and voice
synthesis
here
http://www.intel.com/realsense/SDK