Académique Documents
Professionnel Documents
Culture Documents
--------------------
A Thesis Proposal
Presented to the Faculty of the
Department of Electronics and Communications Engineering
College of Engineering, De La Salle University
--------------------
In Partial Fulfillment of
The Requirements for the Degree of
Bachelor of Science in Electronics and Communications Engineering
--------------------
by
Arcellana, Anthony A.
Ching, Warren S.
Guevara, Ram Christopher M.
Santos, Marvin S.
So, Jonathan N.
July 2006
1. Introduction
users and the computers. The basic goal of HCI is to improve the interaction between
users and computers by making the computers more user-friendly and accessible to
concern within several disciplines, each with different emphases: computer science,
psychology, sociology and industrial design (Hewett et. al., 1996). The ultimate goal
of HCI is to design systems that would minimize the barrier between the human’s
cognitive model of what they want to accomplish and the computer’s understanding
The thesis applies a new way to interact with sources of information using an
interactive projected display. For a long time the ubiquitous mouse and keyboard has
been used to control a graphical display. With the advent of increased processing
power and technology, there has been great interest from the academic and
the past decades. (Myers et. al., 1996). Recently advances and research in human
computer interaction (HCI) has paved the way for techniques such as vision, sound,
speech recognition, and context-aware devices that allow for a much richer,
multimodal interaction between man and machine. (Turk, 1998; Porta, 2002). This
type of recent research moves away from traditional input devices which are
essentially blind and unaware into the so called Perceptual User Interfaces (PUI). PUI
are interfaces that emulate the natural capabilities of humans to sense, perceive, and
reason. It models human-computer interaction after human-human interaction. Some
of the advantages of PUIs are as follows: (1) it reduces the dependence on being in
proximity that is required by keyboards and mouse systems, (2) it makes use of
use,(3) it allows interfaces to be built for a wider range of users and tasks, (4) it
creates interfaces that are user-centered and not device centered, and (5) it has design
intuitive interface methods that make use of body language. A subset of PUI is Vision
Based Interfaces (VBI) which focuses on the visual awareness of computers to the
people using them. Here computer vision algorithms are used to locate and identify
individuals, track human body motions, model the head and the face, track facial
features, interpret human motion and actions. (Porta, 2002) A certain class of this
research falls under bare hand human-computer interaction which this study is about.
Bare hand human interaction uses as a basis of input, the actions and gestures of the
and mouse and keyboard controlled computers in a public area would require
significant space and have maintenance concerns on the physical hardware being used
by the common public. Using a projected display and a camera based input device,
would eliminate the hardware problems associated with the space and maintenance. It
also attracts people since projected displays are new and novel.
in these environments will enhance the presentation and immersion of the exhibit.
1.3.Objectives
1.3.1.General Objectives
display system using a projector and a camera. The projector would display the
interactive content and the user would use his hand to select objects in the
projected display. Computer vision is used to detect and track the hand and
1.3.2.Specific Objectives
1.4.2.2. The projector and the camera set-up will be fixed in such a way that
1.4.2.5. The system will be designed to handle only a single user. In the
presence of multiple users, the system would respond to the first user
triggering an event.
displays and allows the user to interact with it. A projected display conserves space as
the system is ceiling mounted and there is no hardware that directly involves the user.
Using only the hands of the user as an input, the system is intuitive and natural-- key
The system can be comparable to large screen displays that are used in malls and
such. Since the system is also a novel way of presenting information. It can be used to
make interactive advertisements that are very attracting to consumers. The display
can transform from an inviting advertisement into detailed product information. With
this said, the cost of the operation of the projector can possibly be justified with the
revenue generated from effective advertising. Other applications of the system may
be for use in exhibits which may have particular requirements in uniqueness and
higher level of immersion and have a high level of impact to its viewers.
that is natural-- requiring no special goggles or gloves that the user has to wear. In
public spaces where information is very valuable, a system that can provide an added
dimension to reality is very advantageous and the use of nothing but the hands means
the user can instantly tap the content of the projected interface. Computer vision
provides the implementation of a perceptual user interface and the projection provide
means the presence of computers can be found throughout everyday life without
necessarily being attached to it. With the development of PUI there is no need for
physical interface hardware, only the use of natural interaction skills present in every
The system is comprised of 3 main components; (1) the PC which houses the
information and control content, (2) the projector which displays the information, and
The system would acquire the image from the camera. Preliminary image
processing is done to prepare the frames of the video to be ready for hand detection
and recognition. The system would then detect the position and action of the hands of
the user relative to the screen. Then it will generate a response from a specific action.
will project the campus directory of De La Salle University Manila. The camera will
capture the images needed and will upload to the computer. The user will then pick on
which building he/she would like to explore using his/her hand as the pointing tool.
Once the user has chosen a building, a menu will appear that will give information
about the building. Information includes brief history, floor plans, facilities, faculties,
etc. Once the user is finished exploring the building, he/she can touch the back button
to select another building in the campus. The cycle will just continue until the user is
satisfied.
1.7.Methodology
programming of the PC. On a hardware level, acquiring and setting up the camera and
projector is needed in the initial stages. Hardware requirements will be looked up for
acquiring the camera and selecting the development PC. However cost consideration
for the projector entail a base model projector that will have a lower luminance than
that may be desired. This has led to a delimitation that lighting conditions should be
controlled.
is necessary for the researchers to acquire the necessary skills in programming. Quick
familiarization and efficiency with the libraries and tools for computer vision is
necessary for timely progress in the study. Development of the system will be
and then event generation. At each stage, the outputs will be tested such that
Computer vision tools have acquisition routines that we can use for the study.
programming. After a working prototype is made, the system will be evaluated for
trial by other users for their inputs. Initial adjustments and fixes will be made
the study. Advice in programming will be very helpful, since the implementation is
PC based. Advice from the panel, adviser, and other people about the interface will be
helpful in removing biases the proponents may have in developing the system.
1.8.
1.9.Estimated Cost
Projector P 50,000
PC Camera P 1500-2000
Hardenburg and Berard in their work Bare Hand Human Computer Interaction
(2001) describes techniques for barehanded computer interaction. Techniques for hand
segmentation, finger finding, and hand posture classification. They applied their work for
control of an on-screen mouse pointer for applications such as a browser and presentation
tool. They have also developed a multi-user application intended as a brainstorming tool
that would allow different users to arrange text across the space in the screen.
overview of present algorithms. It is pointed out that the weaknesses of the different
Image differencing tries to segment a moving foreground (i.e. the hand) from a static
image the algorithm can detect resting hands. Additional modification for image
After segmentation, the authors discuss the techniques used for detecting the
fingers and hands. They describe a simple and reliable algorithm based on finding nails
and from which fingers and eventually the whole hand can be identified. The algorithm is
based on a simple model of a fingertip being a circle mounted on a long protrusion. After
searching the fingertips, a model of the fingers and eventually hand can be generated and
The end system that was developed had a real time capacity of around 20-25 Hz.
Data from their evaluation shows about 6 frames out of 25 are misclassified with a fast
moving foreground. Accuracy was off in between 0.5 and 1.9 pixels. They have
concluded in their paper that the system developed was simple and effective capable of
2.2. Using Marking Menus to Develop Command Set for Computer Vision
The authors Lenman, Bretzer and Thrusson focus on the use of hand gestures to
replacement for the present interaction tools such as remote control, mouse, etc.
Perspective and Multimodal User Interfaces are the two main scenarios discussed for
gestural interfaces. First the Perspective User Interface aims for automatic recognition of
human gestures integrated with other human expressions like facial expressions or body
movements. While the Multimode User Interfaces focuses more on hand poses and
command sets. The first dimension was the Cognitive aspect, this aspect refers to how
easy commands are to learn and remember, therefore command sets should be practical to
the human user. Articulation aspects being the second dimension tackle on how gestures
are easy to perform or how tiring it will be for the user. The last dimension was on the
Technical aspects. This refers to the command sets must be state of the art or futuristic
The authors concentrate on the cognitive side. Here they considered that having a
menu structure would be of great advantage because commands can be then easily
recognize. Pie and Marking menus are the two types of menu structures that the authors
discussed and explained. Pie menus are pop-up menus with included alternatives that are
development of pie menus that allows more complex choices by implementing sub-
menus.
Bretzer and Thrusson have chosen a hierarchic menu system for controlling functions of a
T.V., CD players and a lamp. As their chosen computer vision system was the
representation of the hand. The system will search and then recognize the hand poses
based on a combination of multiscale color detector and particle filtering. Hand poses are
interrelations in terms of position, orientation and scale. Their menu system has three
hierarchical levels and four choices. Menus then are shown on a computer screen which
As for their future work, they are attempting to increase the speed and tracking
stability of the system in order to acquire more position independence for gesture
recognition, increase the tolerance for varying light conditions and increase recognition
performance.
Interface
Granum et. al. (2004) presents a computer vision-based gesture that will be an
application for an augmented reality system. It contains or talks about different areas like
gesture recognition and segmentation and etc. that are needed to complete the research
and the techniques that will be used for it. Already there has been a lot of research on
vision-based hand gesture recognition and finger tracking application. Because of our
growing technology, researchers our finding ways for computer interface to perform
naturally, limitations such as sensing the environment with sense of sight and hearing
finger tracking used as interface in the PC. There structure will project a display on a
Place Holder Object (PHO) and by the use of your own hand the system can create
controls and situations for the display, movements and gestures of the hand are detected
by the Head mounted camera which serves as the input for the system.
There are two main areas of problem and the presentation of their solutions is the
main bulk of the paper. The first was segmentation; the use of segmentation was to detect
the PHO and hands in 2D images that are captured by the camera. Problems in detection
of the hands the varying forms of the hand as it move and the varying of its size from
different gestures. To solve this problem the study used a colour pixel-based
segmentation which provides extra dimension compared to gray tone methods. Colour
the intensity changes and colour changes. This problem is resolve by using normalized
RGB also called chromatics but implementing this method creates several issues one of
which is that normally cameras have limited dynamic intensity range. After segmentation
of hand pixels from the image next task is to recognize the gesture, they are subdivided
into two approach first is detection of the number of outstretched hands and second is for
the point and click gesture. For gesture recognition a simple approach is done to resolve
counting of hands is done by polar transformation around the center of the hand and
count the number of fingers which is rectangle in shape present in each radius, but in
order to speed up the algorithm the segmented image is samples along concentric circles.
Second area of concern is detection of point and click gestures. The algorithm in the
gesture recognition is used and when it detects only one finger it represents a pointing
gesture tip of the finger is defined to be an actual position. The center of the finger is
found for each radius and the values are fitted into a straight line, this line is searched
a computer-vision system for augmented reality. The research has proven qualitatively
that it can be a useful alternate interface for use in augmented reality. Also it was proven
Claudio Pinhanez and colleagues (2003) started few years ago working and
interactive “touch-screen” style projected display. In this paper, the authors demonstrated
the technology named Everywhere Display (ED) which can be used for Human-
Computer Interactions (HCI). This particular technology was implemented using an LCD
projector with motorized focus and zoom and a computer controlled pan-tilt zoom
camera. They also come up with a low-end version which they called ED-lite which
functions same as the high-end version and differs only in the devices used. In the low-
end version the group used a portable projector and an ordinary camera. A several group
of professionals were researching and working for a new method of improving the
present HCI. The most common method they make use of for HCI is through the use of
mouse, keyboard and touch-screens. But these methods require an external device for
system that would eliminate the use of such external device that would link the
communication of human and computers. The most popular method under research was
implement ED and ED-lite. With the aid of computer vision, the system was able to steer
the projected display from one surface to another. Creating a touch-screen like interaction
using Microsoft Powerpoint. They were able to create a touch-screen like function using
devices which was mentioned earlier. The ED unit was installed at ceiling height on a
tripod to cover greater space. A computer is used to control the ED unit and performs all
other functions such as vision processing from interaction and running application
software. The specific test conducted was a slide presentation application using Microsoft
Powerpoint controlled via hand gestures. There is a designated location in the projected
image which the user could use to navigate the slide or to move the content of the
projected display from one surface area to another. The user controls the slide by
touching the buttons superimposed in the specified projected surface area. With this
technology the user interacts with the computer using bare hand and without using such
Projector designs are now shrinking and are now just in the threshold of being
compact for handheld use. That is why Beardsley and his colleagues of Mitsubishi
Electric Research Labs propose a portable handheld interactive projector. Their work is
only an investigation of mobile, opportunistic projection which can make every surface
The prototype has buttons that serves as the I/O of the device. It also has a built-in
camera to detect the input of the user. Here, it discusses three broad applications of
interactive projection. First class is using a clean display surface for the projected display.
Another class creates a projection on a physical surface. This, typically, is what we call
augmented reality. The first stage is object recognition and the next is to project an
overlay that gives some information about the object. The last class is to project physical
similar to a mouse that creates a box to select the region-of-interest, but instead of using a
There are two main issues when using a handheld device to create projections.
First, is the keystone correction to produce undistorted projection and its correct aspect
ratio. Keystoning occurs when the projector is not perpendicular to the screen producing
a trapezoidal shape instead of a square. Keystone correction is used to fix this kind of
problem. Second is the removal of the effects of hand motions. Here it describes the
technique of how to make a static projection on a surface even when in motion. They use
distinctive visual markers called fiducials to define a coordinate frame on the display
surface. Basically, a camera is used to sense the markers and to infer the target area in
camera image coordinates and these coordinates are transformed to projector image
coordinates and the projection data is mapped into these coordinates giving the right
placement of projection.
Examples of applications for each main class given above are also discussed. An
example of the first class is a projected web browser. This is basically a desktop Windows
environment that is modified so that the display goes to the projector and the input is
taken form the buttons of the device. An example application of the second class is a
defining a Region of Interest (ROI) just like in a desktop but without the use of a mouse.
Pinhanez et. al. (2003) of IBM Research proposes an interactive display that is set
computer vision methods to detect interaction with the projected image. They call this
technology the Everywhere Display Projector (ED projector). They proposed using it in a
retail environment to help the customers find and give them information about a certain
product and it also tells where the product is located. The ED projector is installed on the
ceiling and it can project images on boards that are hung on every aisle of the store. At
the entrance of the store, there is a table where a larger version of the product finder is
projected. Here, a list of product is projected on the table and the user can move the
wooden red slider to find a product. The camera detects this motion and the list scrolls up
Professor Kenji Oka and Yoichi Sato of University of Tokyo together with
trajectories across image frames. They also propose a mechanism in combining direct
augmented desk interface have been developed recently. DigitalDesk is one of the earliest
attempts in augmented desk interfaces and using only charged-coupled device (CCD)
camera and a video projector the users can operate projected desktop application with
fingertip. Inspired by DigitalDesk the group developed an augmented desk interface
called EnhancedDesk that lets users performs tasks by manipulating both physical and
electronically displayed objects simultaneously with their own hands and fingers. An
example application demonstrated in the paper was EnhancedDesk’s two handed drawing
system. The application uses the proposed tracking and gesture recognition methods
which assigns different roles to each hand. The gesture recognition lets users draw
objects of different shapes and directly manipulate those objects using right hand and
fingers. Figure (#) shows the set-up used by the group which includes infrared camera,
regions, finding fingertips and finding palm’s center. In the extraction of hand regions an
infrared camera was used to measure temperature and compensate with the complicated
background and dynamic lighting by raising the pixel values corresponding to human
skin above other pixels. In finding fingertips a search window for fingertips were defined
rather than arm extraction since the searching process in this method is more
fingertip size.
subsequently frame is done then compare it to the previous. Finding the best combination
among these two sets of fingertips will determine multiple fingertip trajectories in real
time. Kalman filter is used in the prediction of fingertip location in one image frame
In the evaluation of the tracking method, the group used Linux based PC with
Intel Pentium III 500-MHz and Hitatchi IP5000 image processing board, and a Nikon
Laird-S270 infrared camera. The testing involves seven test subjects which was
correspondences between successive image frames. The method reliably tracks multiple
fingertips and could prove useful in real time human-computer interaction applications.
Gesture recognition works well with the tracking method and able the user to achieve
interaction based on symbolic gesture while performing direct manipulation with hands
and fingers. Interaction based direct on manipulation and symbolic gestures works by
first determining from the measured fingertip trajectories whether the user’s hand motion
represent direct manipulation or symbolic gesture. Then it selects operating modes such
as rotate, move, or resize and other control mode parameters if direct manipulation is
detected. While, if symbolic gesture is detected, the system recognizes gesture types
using a symbolic gesture recognizer in addition to recognizing gesture locations and sizes
based on trajectories.
The group plans to improve the tracking method’s reliability by incorporating
additional sensors. The reason why additional sensors were needed is because the infrared
camera didn’t work well on cold hands. A solution is by using color camera in addition to
infrared camera. The group is also planning to extend the system to 3D tracking since the
projection for interactive display. Occlusion happens in interactive display systems when
a user interact with the display or inadvertently blocks the projection. Occlusion in these
systems can lead to distortions in the projected image, also information is loss in the
unwanted effects, also occlusion detection can be used for hand and object tracking. This
calibration algorithm that estimates the RGB camera response to projected colors. This
allows predicted camera images to be generated for projected scene. The occlusion
detection for each video frame. Calibration is used for constructing predicted images to
the projected scene. This is needed because Hilario and Cooperstock’s occlusion
System is used with a single camera and projector; it also assumes a planar Lambertian
surface with constant lightning conditions and negligible intra-projector color calibration
to be used. Calibration is done by two steps, first is offline geometric registration which
will compute the transformation from projector to camera frames of reference. It will
center the projected image and aligned the images to the specified world coordinate
frame. For geometric registration the paper adopted the same approach based on the work
the corners of a projected and printed grid in camera view. Second step in the calibration
process is the offline color calibration. Due to certain dynamics a projected display is
unlikely to produce an image whose colors match exactly those from the source of image.
For us to determine predicted camera images correctly we must determine the color
transfer function of the camera to the projection. This is done by iterating through the
storing it as color lookup table. This response is the average RGB color over
corresponding patch pixels measured over multiple camera images. Then the predicted
camera response can be computed by summing the predicted camera responses to each of
the projected color components. Camera-projector calibration results are used in the
Figure 10. Figure shows the projector image (left), predicted image (middle), and
observed camera image (right).
Figure 11. a.) shadows as the occluded object, b.) direct contact of occluding object with
the display surface.
References
Granum, E., Liu, Y. Moeslund, T., Storring, M. (2004). Computer vision-based gesture
recognition for an augmented teality interface. proceedings of the 4th
international conference on visualization, imaging and mage processing.
Marbella, Spain.
Hewett, et. al. (1996) Chapter 2: Human computer interaction. ACM SIGCHI curricila for
human computer interaction. Retrieved from:
http://sigchi.org/cdg/cdg2.html#2_3 retrieved June 2, 2006.
Lenman S., Bretzner, L. and Thuresson, B. (2002). Using marking menus to develop
command sets for computer vision based hand gesture interfaces. in the
proceedings of the second Nordic conference on Human-computer interaction.
Aarhus, Denmark. Retrieved from
http://delivery.acm.org/10.1145/580000/572055/p239-
lenman.pdf?key1=572055&key2=1405429411&coll=GUIDE&dl=ACM&CFID=
77345099&CFTOKEN=54215790
Myers B., et. al. (1996). Strategic directions in human-computer interaction. ACM
Computing Surveys Vol.28 No.4
Oka, K. et. al. (2002). Real-time fingertip tracking and gesture recognition. Computer
Graphics and Applications, IEEE Volume 22, Issue 6, Nov.-Dec. 2002 Page(s):64
– 71.
Pinhanez, C. et. al. (2003). Creating touch-screens anywhere with interactive projected
display. proceedings of the eleventh ACM international conference on
Multimedia.