Académique Documents
Professionnel Documents
Culture Documents
79
following the stories. However, to our knowledge, no games
currently on the market allow children to communicate with
the computer via their native language of ASL. Games that
do prompt children to mimic signs have no measure of eval-
uation to help the child improve the clarity and correctness
of their signs. This lack of repetition with feedback prevents
children from benefiting fully from the software.
80
Figure 1: Screen shot of ASL game. a) tutor video b) live camera feed c) attention button d) animated
character and environment e) action buttons
81
Action
Button Level 1 Level 2 Level 3
#1 Go chase snake Go chase snake Go chase snake
#2 Go chase snake Go chase spider Go chase spider
#3 Go chase snake Go chase spider Go chase alligator
#4 White kitten behind wagon White kitten under chair White kitten in flowers
#5 Black kitten under chair Black kitten in flowers Black kitten in bedroom
#6 Orange kitten in flowers Orange kitten in bedroom Orange kitten on wall
#7 Blue kitten in bedroom Blue kitten on wall Blue kitten behind wagon
#8 Green kitten on wall Green kitten behind wagon Green kitten under chair
game play. This push-to-sign mechanism is similar to those of the desk provides a high contrast environment. The chil-
found in many speech recognition systems (called push-to- dren click the mouse to start and end each phrase, which
talk for speech) and allows our ASL recognition system to provides both location with color cues, as well as a start
perform recognition on only pertinent phrases of the chil- and end gesture. From these cues, we can extract the mouse
dren’s sign. The data can be automatically labeled at a hand region well for the first frame of the image sequence
phrase level using information from the game. We use this by simply applying a threshold.
method to remove both out-of-context and unscripted sign- We initially use the hand segmentation to create the start-
ing, as well as ignore the child’s out–of–game comments. ing histogram. Each frame is a segmentation cycle, which
Thus the segmentation and labeling is done concurrently provides feedback to the system and helps enhance the dis-
with the data collection. Post–processing and labeling of crimination of the color models. HSV histograms are up-
the data is still required, but the workload and boundary dated with a weight value ω(0 < w < 1), based on the
accuracy is greatly improved. obtained mask and then the histograms are normalized:
The Wizard of Oz setup and push–to–sign mechanism in
the game allowed us to collect a large amount of signing H ← (1 − ω)H + ωH new
during testing. This data set is unusual because it consists
of samples of children signing and interacting naturally with where H denotes the histogram value for each bin [21].
the game. Most sign language data sets are collected in the Figure 4 shows the hand tracking process for later frames.
lab under controlled conditions with well–enunciated sign- The segmentation of the hand region and the update of HSV
ing. Our data set of signing contains a variety of signing histograms are the same as the procedure in the first frame.
inflections and emphases as well as sign accents common To find both hands in the binary mask, we consider the size
among the children. of hand shapes and the distance between the center position
of the candidate blobs, as well as the hand positions in the
previous image frame.
3.2 Image Processing Figure 5 shows the results of the image processing for sev-
Our data consists of video of the user’s signing and ac- eral image sequences processing occurs at 48.574ms/f rame
celerometer data from glove-mounted, wireless accelerome- (20.59 fps) in a laptop computer with 1GHz processor. We
ters. In our system, we require the children to wear small found that the tracking results were acceptable, even when
pink-colored gloves. This bright color is easily identified the child wears a shirt with similar color patterns.
by a computer vision algorithm. Tracking skin tones can
be particularly problematic for computer vision in uncon- 3.3 Accelerometer Processing
strained environments. Additionally, it is difficult to distin- Our accelerometers (shown in 3) are a custom in–house de-
guish when the hands perform signs near the face. Many sign created for wearable, wireless sensing. These small wire-
algorithms have been suggested to segment hand region ro- less sensor platforms provide a Bluetooth serial port profile
bustly, even under illumination change [17, 28, 23, 24, 21]. link to three axes of accelerometer data. The accelerome-
However, some of them address only a narrow range of il- ter is sampled by a PIC microcontroller at approximately
lumination change, and some results do not guarantee real- 100Hz. The sensors run on a standard camera battery.
time processing (at least 10 fps with 720 × 480 sized images Each accelerometer data packet consists of four values – a
in our system) or robustness for long image sequences of ges- 16 bit hexadecimal sequence number and three 10 bit hex-
tures. Some methods extract similar color regions as well as adecimal values (one each for the X, Y, and Z axes). The
hand color region, and the performance strongly depends on axis values represent the gravitational affect of acceleration.
the result in the first image frame. Once the data packets are read from the accelerometer they
In our approach, the image pixel data is converted to HSV are post-processed. First they are synchronized with our
color space and used to create histograms for segmentation video feed so that the accelerometer data packets for each
of the hand region and background, as shown in figure 4. hand are associated with the correct video frames. Second,
HSV histograms are used to produce a binary mask using a the data is smoothed to account for variable number of pack-
Bayes classifier [21] and noise is removed by morphological ets associated with each frame. Because of sampling issues,
filters including size filtering and hole filtering. The posi- each video frame can vary in accelerometer packets by one
tion of the desk and the colored gloves provide a significant or two packets.
marker for starting the gesture recognition; the light color
82
Figure 5: Segmented hand regions in the image sequences. green: right hand, red: left hand, cyan: occluded
hands
3.4 Feature Vectors co–articulation effects of continuous signing. HTK also pro-
Our feature vectors consist of the combination of both vi- vides an infrastructure for rule–based and statistical gram-
sion data and accelerometer data. The accelerometer data mars.
consists of (x, y, z) values for accelerometers on both hands. ASL is a structured language complete with a grammar,
The vision data consists of the following hand shape charac- vocabulary, and other linguistic features. Thus, the appli-
teristics: the change in x, y center positions between frames, cation of a relevant grammar together with statistical word
mass, the length of the major and minor axes, eccentricity, models can provide a practical solution to remove ambiguity
orientation angle of the major axis, and direction of the ma- due to disfluency of the deaf children in their signing. Table
jor axis in x, y offset. The camera captures images at 10 4 shows the grammar adopted to our system given in HTK
frames a second and each frame is synchronized with the expression, where sil0, sil1, sil2 and sil3 denote silence mod-
averaged accelerometer values. For recognition we adopt els by which transitional motions and pauses between words
left to right HMMs with four states. can be segmented in the signing. Coarticulation is the effect
that words or signs have on each other when they proceed
or succeed one another. The ordering of the silences helps
3.5 Experiment maintain consistency for modeling coarticulation effects.
The data collected represents each of the five children
playing all three levels of the game at least five times each. 3.6 Results
Each sample consists of one signed phrase from the game.
Samples were initially classified as a correctly or incorrectly
signed phrase, based on feedback from our consultants from
3.6.1 User Dependant Models
AASD. The “correct” samples were those that were signed We evaluated our approach using several different meth-
correctly according to game play. These samples were fur- ods. Table 2 shows results from testing user–dependent
ther pruned to removed any samples which were evaluated as models – models which are trained and tested using a sin-
correct for content but had problems with the signing such gle child’s data. The user–dependent models were generated
as false starts, fidgeting or poorly formed signing. This set by training on 90% of the samples randomly selected from
of good samples were then labeled to create a transcript of a single child’s dataset and testing on the other 10%. This
the signed phrase. The final data set represented 541 signed was done 100 times for each child and averaged. These show
sentences and 1,959 individual signs. how well the models perform for training and testing for in-
We used the Georgia Tech Gesture Toolkit (GT 2 K) [27] dividual users. We achieved an average word accuracy of
to train and test our system. GT 2 K adapts HTK (Hidden 93.39% for the user–dependent models.
Markov Model Toolkit) for gesture recognition. HTK pro-
vides HMMs in the context of a language infrastructure for 3.6.2 User Independent Models
use in speech recognition [9]. We used the language tools to Table 3 shows results from testing for user–independent
train a single model for each sign and then ran additional models using leave–one–out validation. User–independent
training for context (equivalent to speech triphone model- models can be used to recognize signs from multiple children.
ing). This procedure allowed us to create stable individual The user–independent models were generated by training
models, and then to combine those models to represent the on a dataset consisting of four children and testing on the
83
Data Set Word Accuracy Sentence Correctness
Mean StdDev Mean StdDev
Participant 1 (90/10 split) 94.1105% 3.162212 69.3663% 11.95803
Participant 2 (90/10 split) 91.4754% 4.283267 69.3663% 11.95803
Participant 3 (90/10 split) 94.9271% 2.404333 70.9118% 11.48396
Participant 4 (90/10 split) 95.6477% 2.61991 74.6389% 12.1403
Participant 5 (90/10 split) 90.8016% 3.868346 55.779% 12.2334
other child’s dataset. This is done for each child. This helps
us understand how the models generalize across users. We
achieved an average word accuracy of 86.28% for the user–
independent models.
3.6.4 Summary
We ran three sets of experiments with our data sets: user–
dependant, user–independent, and across all samples. The
user–dependent models show how well recognizer performs
for a single individual. All of the students signing can be
modeled fairly well, with a greater than 90% word accuracy.
However, when recognizing sentences, words can be inserted
and deleted which creates errors that lower sentence accu-
racy. The range in word accuracies from 90.8% to 95.6%
shows a strong variation in the user’s signing samples – some
are clearly modeled better than others.
Participant four has the highest word accuracy and the
second lowest standard deviation on the tests. These re-
sults can be an indicator that participant four probably had
clear, consistent signing. Participant two the second lowest
word accuracy and the highest standard deviation. When
used as the test set for leave–one–out validation, partici-
pant two scored the worst against the models. Participants
three and four were the top two performers for word accura-
cies and standard deviation in both tests. Participants that
signed clear and consistent data (evident by high word accu-
racies and low standard deviations over the user dependent
Figure 4: Hand segmentation for the first image models) were also well modeled by the more general user
frame independent models.
The limited size of the participant pool restricts the gener-
84
alizations that can be drawn from the experiment, but these liminary analysis of the data from this study has been ex-
experiments show interesting trends that should be investi- tremely informative. We are currently focused on classifying
gated for larger population sizes. Increasing the number disfluencies and out–of–band communication by the children
of users for both creating and testing models will give a during game play to provide more informed models and in-
broader idea of the generalization. There is a tension be- crease recognition accuracy. The full analysis of this study
tween the usefulness of generalization and added accuracy will provide us with further insight to educational value of
of user–specific training; this can be seen in speech recog- the system, as well as increase our data bank for further
nition packages that ship with models trained from large recognition engine work.
populations and allow the user to help train the models by
providing additional training samples. 5. ACKNOWLEDGEMENTS
The leave–one–out validation shows how the models gen- This work is supported by the NSF, Grants # 0093291, #
eralize to users they have never seen. The wider range of 0511900, and the RERC on Mobile Wireless Technologies for
word accuracies from 76.9 % to 92.6% show the impact that Persons with Disabilities, which is funded by the NIDRR of
including or excluding certain students from the training the US Dept. of Education (DoE), Grant # H133E010804.
set can affect the outcome. The consistency in performance The opinions and conclusions of this publication are those of
by participants in the user–dependent and user–independent the grantee and do not necessarily reflect those of the NSF
tests indicates that the user–independent models are doing or US DoE.
a good job of generalizing. The independent models are Special thanks to students and staff at the Atlanta Area
generalizing well, even across models of varying qualities. School for the Deaf for their generous help with this project.
4. CONCLUSIONS 6. REFERENCES
We present continued work on a computer game designed [1] B. Bauer, H. Hienz, and K. Kraiss. Video-based
to help children practice their ASL skills and encourage continuous sign language recognition using statistical
their linguistic development. Linguistically, the system is methods. In Proceedings of the 15th International
designed to help children practice their vocabulary and en- Conference on Pattern Recognition, volume 2, pages
courage them to generate phrases which convey complete 463–466, September 2000.
thoughts and ideas. CopyCat provides a new domain for [2] H. Brashear, T. Starner, P. Lukowicz, and H. Junker.
the sign language recognition community and has provided a Using multiple sensors for mobile sign language
unique data set of native signers interacting naturally with a recognition. In Proceedings of the Seventh IEEE
computer system. The recognition engine represents an im- International Symposium on Wearable Computers,
portant step towards the recognition of conversational sign pages 45–52, 2003.
in contrast to other systems which largely use scripted sign [3] A. Dix, J. Finlay, G. Abowd, and R. Beale.
collected in controlled laboratory environments. Human-Computer Interaction, chapter 6.4 Iterative
We have expanded the functionality of our previous sys- Design and Prototyping. Prentice Hall, 2004.
tems [22, 2, 15] to handle non-scripted, live signing which [4] G. Fang, W. Gao, and D. Zhao. Large vocabulary sign
includes pauses and variable coarticulation effects. We show language recognition based on hierarchical decision
progress with the gesture recognition component of the game. trees. In International Conference on Multimodal
We use a hand segmentation algorithm which is robust against Interfaces, pages 125–131, 2003.
illumination change and guarantees a real-time processing.
[5] Gallaudet. Gallaudet University. Regional and
We also adopted a strong grammar to alleviate the effects
national summary report of data from the 1999–2000
of disfluencies and to achieve a 92.96% word accuracy. Our
annual survey of deaf and hard of hearing children and
dataset is unique in the sign recognition community because
youth. Washington, D. C., 2001.
it uses signers conversing in a spontaneous, unscripted man-
ner. These results show the feasibility of the system and [6] W. Gao, G. Fang, D. Zhao, and Y. Chen. Transition
provide a platform for further developing the recognition movement models for large vocabulary continuous sign
capabilities. language recognition (csl). In Sixth IEEE
International Conference on Automatic Face and
4.1 Future Work Gesture Recognition, pages 553–558, 2004.
Though this work has extended the functionality of our [7] V. Henderson, S. Lee, H. Brashear, H. Hamilton,
recognition system, it is clear that there is much work to T. Starner, and S. Hamilton. Development of an
do. The dataset collected for our experiments is both excit- American sign language game for deaf children. In
ing and challenging. This evaluation uses samples selected Proceedings of the 4th International Conference for
from the dataset for their correctness and clarity in signing. Interaction Design and Children, Boulder, CO, 2005.
These samples are used for building models and recognizing [8] J. L. Hernandez-Rebollar, N. Kyriakopoulos, and
signs. Our next challenge is enabling the recognition engine R. W. Lindeman. A new instrumented approach for
to deal with the disfluencies present in otherwise linguis- translating American sign language into sound and
tically correct samples. Disfluencies of importance include text. In Proceedings of the Sixth IEEE International
long pauses, fidgets, hesitations and false starts in signing. Conference on Automatic Face and Gesture
We have completed the next round of iterative develop- Recognition, pages 547–552, 2004.
ment with a user study at AASD. The study included an- [9] http://htk.eng.cam.ac.uk.
other round of interface development and data collection, [10] IDRT. Con-sign-tration. Product Information on the
as well as pre–study and post–study linguistic evaluations World Wide Web, Institute for Disabilities Research
to begin to explore the learning effects of the system. Pre- and Training Inc.,
85
$predator = snake | spider | alligator ;
$color1 = white | green | blue ;
$color2 = black | white | green ;
$color3 = orange | black | white ;
$color4 = blue | orange | black ;
$color5 = green | blue | orange ;
86