Académique Documents
Professionnel Documents
Culture Documents
Multimedia
Semantics
Uma Srinivasan
CSIRO ICT Centre, Australia
Surya Nepal
CSIRO ICT Centre, Australia
IRM Press
Acquisitions Editor:
Development Editor:
Senior Managing Editor:
Managing Editor:
Copy Editor:
Typesetter:
Cover Design:
Printed at:
Rene Davies
Kristin Roth
Amanda Appicello
Jennifer Neidig
Michael Jaquish
Jennifer Neidig
Lisa Tosheff
Integrated Book Technology
Managing
Multimedia Semantics
Table of Contents
Preface ........................................................................................................................... vi
SECTION 1: SEMANTIC INDEXING AND RETRIEVAL OF IMAGES
Chapter 1
Toward Semantically Meaningful Feature Spaces for Efficient Indexing in Large
Image Databases ............................................................................................................. 1
Anne H.H. Ngu, Texas State University, USA
Jialie Shen, The University of New South Wales, Australia
John Shepherd, The University of New South Wales, Australia
Chapter 2
From Classification to Retrieval: Exploiting Pattern Classifiers in Semantic
Image Indexing and Retrieval ......................................................................................... 30
Joo-Hwee Lim, Institute for Infocomm Research, Singapore
Jesse S. Jin, The University of Newcastle, Australia
Chapter 3
Self-Supervised Learning Based on Discriminative Nonlinear Features and Its
Applications for Pattern Classification ......................................................................... 52
Qi Tian, University of Texas at San Antonio, USA
Ying Wu, Northwestern University, USA
Jie Yu, University of Texas at San Antonio, USA
Thomas S. Huang, University of Illinois, USA
SECTION 2: AUDIO AND VIDEO SEMANTICS: MODELS AND STANDARDS
Chapter 4
Context-Based Interpretation and Indexing of Video Data ............................................. 77
Ankush Mittal, IIT Roorkee, India
Cheong Loong Fah, The National University of Singapore, Singapore
Ashraf A. Kassim, The National University of Singapore, Singapore
Krishnan V. Pagalthivarthi, IIT Delhi, India
Chapter 5
Content-Based Music Summarization and Classification ............................................. 99
Changsheng Xu, Institute for Infocomm Research, Singapore
Xi Shao, Institute for Infocomm Research, Singapore
Namunu C. Maddage, Institute for Infocomm Research, Singapore
Jesse S. Jin, The University of Newcastle, Australia
Qi Tian, Institute for Infocomm Research, Singapore
Chapter 6
A Multidimensional Approach for Describing Video Semantics .................................. 135
Uma Srinivasan, CSIRO ICT Centre, Australia
Surya Nepal, CSIRO ICT Centre, Australia
Chapter 7
Continuous Media Web: Hyperlinking, Search and Retrieval of Time-Continuous
Data on the Web ............................................................................................................. 160
Silvia Pfeiffer, CSIRO ICT Centre, Australia
Conrad Parker, CSIRO ICT Centre, Australia
Andre Pang, CSIRO ICT Centre, Australia
Chapter 8
Management of Multimedia Semantics Using MPEG-7 ................................................ 182
Uma Srinivasan, CSIRO ICT Centre, Australia
Ajay Divakaran, Mitsubishi Electric Research Laboratories, USA
SECTION 3: USER-CENTRIC APPROACH TO MANAGE SEMANTICS
Chapter 9
Visualization, Estimation and User Modeling for Interactive Browsing of Personal
Photo Libraries .............................................................................................................. 193
Qi Tian, University of Texas at San Antonio, USA
Baback Moghaddam, Mitsubishi Electric Research Laboratories, USA
Neal Lesh, Mitsubishi Electric Research Laboratories, USA
Chia Shen, Mitsubishi Electric Research Laboratories, USA
Thomas S. Huang, University of Illinois, USA
Chapter 10
Multimedia Authoring: Human-Computer Partnership for Harvesting Metadata from
the Right Sources .......................................................................................................... 223
Brett Adams, Curtin University of Technology, Australia
Svetha Venkatesh, Curtin University of Technology, Australia
Chapter 11
MM4U: A Framework for Creating Personalized Multimedia Content ........................ 246
Ansgar Scherp, OFFIS Research Institute, Germany
Susanne Boll, University of Oldenburg, Germany
Chapter 12
The Role of Relevance Feedback in Managing Multimedia Semantics: A Survey ........ 288
Samar Zutshi, Monash University, Australia
Campbell Wilson, Monash University, Australia
Shonali Krishnaswamy, Monash University, Australia
Bala Srinivasan, Monash University, Australia
SECTION 4: MANAGING DISTRIBUTED MULTIMEDIA
Chapter 13
EMMO: Tradeable Units of Knowledge-Enriched Multimedia Content ......................... 305
Utz Westermann, University of Vienna, Austria
Sonja Zillner, University of Vienna, Austria
Karin Schellner, ARC Research Studio Digital Memory Engineering,
Vienna, Austria
Wolfgang Klaus, University of Vienna and ARC Research Studio Digital
Memory Engineering, Vienna, Austria
Chapter 14
Semantically Driven Multimedia Querying and Presentation ...................................... 333
Isabel F. Cruz, University of Illinois, Chicago, USA
Olga Sayenko, University of Illinois, Chicago, USA
SECTION 5: EMERGENT SEMANTICS
Chapter 15
Emergent Semantics: An Overview ............................................................................... 351
Viranga Ratnaike, Monash University, Australia
Bala Srinivasan, Monash University, Australia
Surya Nepal, CSIRO ICT Centre, Australia
Chapter 16
Emergent Semantics from Media Blending ................................................................... 363
Edward Altman, Institute for Infocomm Research, Singapore
Lonce Wyse, Institute for Infocomm Research, Singapore
Glossary ......................................................................................................................... 391
About the Authors .......................................................................................................... 396
Index .............................................................................................................................. 406
vi
Preface
vii
with the same material. Nevertheless, the need to retrieve multimedia information grows
inexorably, carrying with it the need to have tools that can facilitate search and retrieval
of multimedia content at a semantic or a conceptual level to meet the varying needs of
different users.
There are numerous conferences that are still addressing this problem. Managing
multimedia semantics is a complex task and continues to be an active research area that
is of interest to different disciplines. Individual papers on multimedia semantics can be
found in many journals and conference proceedings. Meersman, Tari and Stevens
(1999), present a compilation of works that were presented at the IFIP Data Semantics
Working Conference held in New Zealand. The working group focused on issues that
dealt with semantics of the information represented, stored and manipulated by multimedia systems. The topics covered in this book include: data modeling and query
languages for multimedia; methodological aspects of multimedia database design, information retrieval, knowledge discovery and mining, and multimedia user interfaces.
The book covers six main thematic areas. These are: Video Data Modeling and Use;
Image Databases; Applications of multimedia systems; Multimedia Modeling; Multimedia Information retrieval; Semantics and Metadata. This book offers a good glimpse
of the issues that need to be addressed from an information systems design perspective. Here semantics is addressed from the point of view of querying and retrieving
multimedia information from databases.
In order to retrieve multimedia information more effectively, we need to go deeper
into the content and exploit results from the vision community, where the focus has
been in understanding inherent digital signal characteristics that could offer insights
into semantics situated within the visual content. This aspect is addressed in Bimbo
(1999), where the focus is mainly on visual feature extraction techniques used for content-based retrieval of images. The topics discussed are image retrieval by colour similarity, image retrieval by texture similarity, image retrieval by shape similarity, image
retrieval by spatial relationships, and finally one chapter on content-based video retrieval. The focus here is on low-level feature-based content retrieval. Although several algorithms have been developed for detecting low-level features, the multimedia
community has realised that content-based retrieval (CBR) research has to go beyond
low-level feature extraction techniques. We need the ability to retrieve content at more
abstract levels the levels at which humans view multimedia information. The vision
research then moved on from low-level feature extraction in still images to segment
extraction in videos. Semantics becomes an important issue when identifying what
constitutes a meaningful segment. This shifts the focus from image and video analysis
(of single features) to synthesis of multiple features and relationships to extract more
complex information from videos. This idea is further developed in Dorai and Venkatesh
(2002), where the theme is to derive high-level semantic constructs from automatic
analysis of media. That book uses media production and principles of film theory as the
bases to extract higher-level semantics in order to index video content. The main chapters include applied media aesthetics, space-time mappings, film tempo, modeling colour
dynamics, scene determination using auditive segmentation, and determining effective
events.
In spite of the realisation within the research community that multimedia research
needs to be enhanced with semantics, research output has been discipline-based. Therefore, there is no single source that presents all the issues associated with modeling,
viii
representing and managing multimedia semantics in order to facilitate information retrieval at a semantic level desired by the user. And, more importantly, research has
progressed by handling one medium at a time. At the user level, we do know that
multimedia information is not just a collection of monomedia types. Although each
media type has its own inherent properties, multimedia information has a coherence
that can only be perceived if we take a holistic approach to managing multimedia semantics. It is our hope that this book fills this gap by addressing the whole spectrum of
problems that need to be addressed in order to manage multimedia semantics, from an
application perspective, that adds value to the user community.
OUR APPROACH TO
ADDRESS THIS CHALLENGE
The objective of the book managing multimedia semantics is to assemble in
one comprehensive volume the research problems, theoretical frameworks, tools and
technologies that contribute towards managing multimedia semantics. The complexity
of managing multimedia semantics has given rise to many frameworks, models, standards and solutions. The book aims to highlight both current techniques and future
trends in managing multimedia semantics.
We systematically define the problem of multimedia semantics and present approaches that help to model, represent and manage multimedia content, so that information systems deliver the promise of providing access to the rich content held in the
vaults of multimedia archives. We include topics from different disciplines that contribute to this field and synthesise the efforts towards addressing this complex problem. It is our hope that the technologies described in the book could lead to the development of new tools to facilitate search and retrieval of multimedia content at a semantic or a conceptual level to meet the varying needs of the user community.
ix
Chapter 2 addresses the semantic gap that exists between a users query and lowlevel visual features that can be extracted from an image. This chapter presents a stateof-the-art review of pattern classifiers in content-based image retrieval systems, and
then extends these ideas from pattern recognition to object recognition. The chapter
presents three new indexing schemes that exploit pattern classifiers for semantic indexing.
Chapter 3 takes the next step in the object recognition problem, and proposes a
self-supervised learning algorithm called KDEM - Kernel Discriminant-EM to speed up
semantic classification and recognition problems. The algorithms are tested for image
classification, hand posture recognition and fingertip tracking.
We then move on from image indexing to context-based interpretation and indexing of videos.
abstraction. The chapter presents a discussion on application development using MPEG7 descriptions. Finally the chapter discusses some strengths and weaknesses of the
standard in addressing multimedia semantics.
xi
CONCLUDING REMARKS
In spite of large research output in the area of multimedia content analysis and
management, current state-of-the-art technology offers very little by way of managing
semantics that is applicable for a range of applications and users. Semantics has to be
inherent in the technology rather than an external factor introduced as an afterthought.
Situated and contextual factors need to be taken into account in order to integrate
semantics into the technology. This leads to the notion of emergent semantics which is
user-centered, rather than technology driven methods to extract latent semantics. Automatic methods for semantic extraction tend to pre-suppose that semantics is static,
which is counterintuitive to the natural way semantics evolves. Other interactive technologies and developments in the area of semantic web also address this problem. In
future, we hope to see the convergence of different technologies and research disciplines in addressing the multimedia semantic problem from a user-centric perspective.
xii
REFERENCES
Bimbo, A.D. (1999). In M. Kaufmann (Ed.), Visual information retrieval. San Francisco.
Dorai, C., & Venkatesh, C. (2002). Computational media aesthetics. Boston: Kluwer
Academic Publishers.
Meersman, R., Scott, Z., & Stevens, M. (1999, January 4-8). Database semantics - Semantic issues in multimedia systems, IFIP TC2/WG2.6. Eighth Working Conference on Database Semantics (DS-8), Rotorua, New Zealand.
xiii
Acknowledgments
The editors would like to acknowledge the help of a number of people who contributed in various ways, without whose support this book could not have been published in its current form. Special thanks go to all the staff at Idea Group, who participated from inception of the initial idea to the final publication of the book. In particular,
we acknowledge the efforts of Michele Rossi, Jan Travers and Mehdi Khosrow-Pour
for their continuous support during the project.
No book of this nature is possible without the commitment of the authors. We
wish to offer our heart-felt thanks to all the authors for their excellent contributions to
this book, and for their patience as we went through the revisions. The completion of
this book would have been impossible without their dedication.
Most of the authors of chapters also served as referees for chapters written by
other authors, and they deserve a special note of thanks. We also would like to acknowledge the efforts of other external reviewers: Zahar Al Aghbhari, Saied Tahaghoghi,
A.V. Ratnaike, Timo Volkner, Mingfang Wu, Claudia Schremmer, Santha Sumanasekara,
Vincent Oria, Brigitte Kerherve, and Natalie Colineau.
Last but, not the least, we would like to thank CSIRO (Commonwealth Scientific
and Industrial Research Organization) and the support from the Commercial group, in
particular Pamela Steele, in managing the commercial arrangements and letting us get
on with the technical content.
Finally we wish to thank our families for their love and support throughout the
project.
Uma Srinivasan and Surya Nepal
CSIRO ICT Centre, Sydney, Australia
September 2004
Section 1
Semantic Indexing and
Retrieval of Images
Chapter 1
Toward Semantically
Meaningful Feature
Spaces for Efficient
Indexing in Large
Image Databases
Anne H.H. Ngu, Texas State University, USA
Jialie Shen, The University of New South Wales, Australia
John Shepherd, The University of New South Wales, Australia
ABSTRACT
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
technique called Combining Multiple Visual Features (CMVF) that integrates multiple
visual features to get better query effectiveness. Our approach is able to produce lowdimensional image feature vectors that include not only low-level visual properties but
also high-level semantic properties. The hybrid architecture can produce feature
vectors that capture the salient properties of images yet are small enough to allow the
use of existing high-dimensional indexing methods to provide efficient and effective
retrieval.
INTRODUCTION
With advances in information technology, there is an ever-growing volume of
multimedia information from emerging application domains such as digital libraries,
World Wide Web, and Geographical Information System (GIS) systems available online.
However, effective indexing and navigation of large image databases still remains one
of the main challenges for modern computer system. Currently, intelligent image retrieval
systems are mostly similarity-based. The idea of indexing an image database is to extract
the features (usually in the form of a vector) from each image in the database and then
to transform features into multidimensional points. Thus, searching for similarity
between objects can be treated as a search for close points in this feature space and the
distance between multidimensional points is frequently used as a measurement of
similarity between the two corresponding image objects.
To efficiently support this kind of retrieval, various kinds of novel access methods
such as Spatial Access Methods (SAMs) and metric trees have been proposed. Typical
examples of SAMs include the SS-tree (White & Jain, 1996), R+-tree (Sellis, 1987) and grid
files (Faloutsos, 1994); for metric trees, examples include the vp-tree (Chiueh, 1994), mvptree (Bozkaya & Ozsoyoglu, 1997), GNAT (Brin, 1995) and M-tree (Ciaccia, 1997). While
these methods are effective in some specialized image database applications, many open
problems in image indexing still remain.
Firstly, typical image feature vectors are high dimensional (e.g., some image feature
vectors can have up to 100 dimensions). Since the existing access methods have an
exponential time and space complexity as the number of dimensions increases, for
indexing high-dimensional vectors, they are no better than sequential scanning of the
database. This is the well-known dimensional curse problem. For instance, methods
based on R-trees can be efficient if the fan-out of the R-tree nodes remain greater than
two and the number of dimensions is under five. The search time with linear quad trees
is proportional to the size of the hyper surface of the query region that grows with the
number of dimensions. With grid files, the search time depends on the directory whose
size also grows with the number of dimensions.
Secondly, there is a large semantic gap existing between low-level media representation and high-level concepts such as person, building, sky, landscape, and so forth.
In fact, while the extraction of visual content from digital images has a long history, it has
so far proved extremely difficult to determine how to use such features to effectively
represent high-level semantics. This is because similarity in low-level visual feature may
not correspond to high-level semantic similarity. Moreover, human beings perceive and
identify images by integrating different kinds of visual features in a nonlinear way. This
implies that assuming each type of visual feature contributes equally to the recognition
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
of the images is not supported in the human perceptual system and an efficient contentbased image retrieval system cannot be achieved by considering independent simple
visual feature.
In terms of developing indexing methods for effective similarity searching in large
image respository, we are faced with the problem of producing a composite feature vector
that accurately mimics human visual perception. Although many research works have
claimed to support queries on composite features by combining different features into
an integrated index structure, very few of them explain how the integration is implemented. There are two main problems that need to be addressed here. The first one is that
the integrated features (or composite features) typically generate very high-dimensional
feature space, which cannot be handled efficiently by the existing access methods. The
other problem is the discovery of image similarity measures that reflect semantic
similarity at a high level.
There are two approaches to solving the indexing problem. The first approach is to
develop a new spatial index method that can handle data of any dimension and employ
a k-nearest neighborhood (k-NN) search. The second approach is to map the raw feature
space into a reduced space so that an existing access method can be applied. Creating
a generalized high-dimensional index that can handle hundreds of dimensions is still an
unsolved problem. The second approach is clearly more practical. In this chapter, we
focus on how to generate a small but semantically meaningful feature vector so that
effective indexing structures can be constructed.
The second problem is how to use low-level media properties to represent high-level
semantic similarity. In the human perceptual process, the various visual contents in an
image are not weighted equally for image identification. In other words, the human visual
system has different responses to color, texture and shape information in an image. When
the feature vectors extracted from an image represent these visual features, the similarity
measure for each feature type between the query image and an image in the database is
typically computed by a Euclidean distance function. The similarity measure between the
two images is then expressed as a linear combination of the similarity measures of all the
feature types. The question that remains here is whether a linear combination of the
similarity measures of all the feature types best reflects how we perceive images as similar.
So far, no experiments have been conducted to verify this belief.
The main contribution of this work is in building a novel dimension reduction
scheme, called CMVF (Combining Multiple Visual Features), for effective indexing in
large image database. The scheme is designed based on the observation that humans use
multiple kinds of visual features to identify and classify images via a robust and efficient
learning process. The objective of the CMVF scheme is to mimic this process in such a
way as to produce relatively small feature vectors that incorporate multiple features and
that can be used to effectively discriminate between images, thus providing both efficient
(small vectors) and effective (good discrimination) retrieval. The core of the work is to
use a hybrid method that incorporates PCA and neural network technology to reduce the
size of composite image features (nonlinear in nature) so that they can be used with an
existing distance-based index structure without any performance penalty. On the other
hand, improved retrieval effectiveness can, in principle, be achieved by compressing
more discriminating information (i.e., integrating more visual features) into the final
vector. Thus, in this chapter, we also investigate precisely how much improvement in
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
BACKGROUND
Image Feature Dimension Reduction
Trying to implement computer systems that mimic how the human visual system
processes images is a very difficult task, because humans
use different features to identify and classify images in different contexts, and
do not give equal weight to various features even within a single context
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
S = wc S c + wt S t
(1)
where the Sc and St are the color and texture similarity functions respectively; wc and
wt are weighting factors. However, the criteria for selecting these weighting factors are
not mentioned in their research work. From the statistics viewpoint, by treating the
weighting factors as normalization factors, the definition is just a natural extension of the
Euclidean distance function to a high-dimensional space in which the coordinate axes
are not commensurable.
The question that remains to be answered is whether a Euclidean distance function
for similarity measures best correlates with the human perceptual process for image
recognition. That is, when humans perceive two images as similar, can a distance function
given in the form in Equation 1 be defined? Does this same function hold for another pair
of images that are also perceived as similar? So far, no experiments have been conducted
that demonstrate (or counter-demonstrate) whether linear combinations of different
image features are valid similarity measures based on human visual perception. Also, the
importance of designing a distance function that mimics human perception to approximate a perceptual weight of various visual features has not been attempted before. Thus,
incorporating human visual perception into image similarity measurement is the other
major motivation behind our work.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Color Features
It is known that the human eye responds well to color. In this work, the color feature
is extracted using the histogram technique (Swain & Ballard, 1991). Given a discrete color
space defined by some color axes, the color histogram is obtained by discretizing the
image colors and counting the number of times each discrete color occurs in the image.
In our experiments, the color space we apply is CIE L*u*v. The reason that we select CIE
L*u*v instead of normal RGB or other color space is that it is more perceptually uniform.
The three axes of L*u*v space are divided into four sections respectively, so we get a
total of 64 (4x4x4) bins for the color histogram. However, for the image collection that we
use, there are bins that never receive any count. In our experiments, the color features
are represented as 37-dimensional vectors after eliminating the bins that have zero count.
Texture Features
Texture characterizes objects by providing measures of properties such as smoothness, coarseness and regularity. In this work, the texture feature is extracted using a filterbased method. This method uses amplitude spectra of images. It detects the global
periodicity in the images by identifying high-energy, narrow peaks in the spectrum. The
advantage of filter-based methods is their consistent interpretation of feature data over
both natural and artificial images.
The Gabor filter (Turner, 1986) is a frequently used filter in texture extraction. It
measures a set of selected orientations and spatial frequencies. Six frequencies are
required to cover the range of frequencies from 0 to 60 cycles/degree. We choose 1, 2,
4, 8, 16 and 32 cycles/degree to cover the whole range of human visual perception.
Therefore, the total number of filters needed for our Gabor filter is 30, and texture features
are represented as 30-dimensional vectors.
Shape Features
Shape is an important and powerful attribute for image retrieval. It can represent
spatial information that is not presented in color and texture histograms. In our system,
the shape information of an image is described based on its edges. A histogram of the
edge directions is used to represent global information of shape attribute for each image.
We used the Canny edge operator (Canny, 1986) to generate edge histograms for images
in the prepropressing stage. To solve the scale invariance problem, the histograms are
normalized to the number of edge points in each image. In addition, smoothing procedures presented in Jain and Vailaya (1996) are used to make the histograms invariant to
rotation. The histogram of edge directions is represented by 30 bins. Shape features are
thus presented as 30-dimensional vectors.
When forming composite feature vectors from the three types of features described
above, the most common approach is to use the direct sum operation. Let xc, xt and xs be
the color, texture and shape feature vectors; the direct sum operation, denoted by the
symbol , of these two feature vectors is defined as follows:
x xc xt xs
(2)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 1. A hybrid image feature dimension reduction scheme. The linear PCA appears
at the bottom, the nonlinear neural network is at the top, and the representation of
lower dimension vector appears in the hidden layer.
The number of dimensions of the composite feature vector x is then the sum of those
of the single feature vectors, that is, dim(x) = dim(xc) + dim(xt) + dim(xs).
vectors {xk = ( xk1 , xk 2 ,...xkn ) R n | k = 1...N } and the mean vector x , the covariance
matrix S can be calculated as
S=
1 N
( xk x)( xk x)
N k =1
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Let vi and i be a pair of eigenvector and eigenvalue of the covariance matrix S. Then
vi and i satisfy the following:
N
i = (vi ( xk x)) 2
k =1
i =1 i
vectors, and since i can be arranged in decreasing order, that is, 12...n0, if the m
(where m < n) largest eigenvalues account for a large percentage of variance then, with
an nm linear transformation matrix T defined as
T = [v1 , v2 ,..., vm ],
(3)
the nm transformation T T transforms the original n-dimensional feature vectors to mdimensional ones. That is
T ( xk x ) = y k ,
k = 1...N
(4)
where y k Rm, k. Then matrix T above has orthonormal columns because { vi | i = 1...n }
form an orthonormal basis.
The key idea in dimension reduction via PCA is in the computation of and the userdetermined value m, and finally the mn orthogonal matrix T T , which is the required linear
transformation. The feature vectors in the original n-dimensional space can be projected
onto an m-dimensional subspace via the transformation T T . The value of m is normally
determined by the percentage of variance that the system can afford to lose. The i-th
component of the yk vector in (4) is called the i-th principal component (PC) of the original
feature vector x k. Alternatively, one may consider just the i-th column of the T matrix
defined in (3), and the i-th principal component of xk is simply
yki = vi ( xk x)
where
PCA has been employed to reduce the dimensions of single feature vectors so that
an efficient index can be constructed for image retrieval in an image database (Euripides
& Faloutsos, 1997; Lee, 1993). It has also been applied to image coding, for example,, for
removing correlation from highly correlated data such as face images (Sirovich & Kirby,
1987). In this work, PCA is used as the first step in the NLDR method where it provides
optimal reduced dimensional feature vectors for the three-layer neural network, and thus
speed up the NLDR training time.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
neti =
s w
j p i
ij
(5)
where j is a predecessor unit of i, the term wij is the interconnected weights from unit j
to unit i, and i is the bias value of the unit i. Passing the value neti through a nonlinear
activation function, the activation value si of unit i can be obtained. The sigmoid logistic
function
si =
1
1 + e neti
(6)
process governed by the training patterns will adjust the weights in the network so that
a desired mapping of input to output activation can be obtained. Given that we have a
set of feature vectors and their appropriate class number classified by the subjects, the
goal of the supervised learning is to seek the global minimum of cost function E
E=
2
1
(t pj o pj )
2 p j
(7)
where t pj and opj are, respectively, the target output and the actual output for feature
vector p at node j. The rule for updating the weights of the network can be defined as
follows:
wij (t ) = d (t )
(8)
(9)
where is the parameter that controls the learning rate, and d(t) is the direction along
which the weight need to be adjusted in order to minimize the cost function E. There are
many learning algorithms for performing weight updates. The quickprop algorithm is one
of most frequently used adaptive learning paradigms. The weight update can be obtained
by the equation
E
(t )
wij
wij (t 1)
wij (t ) =
E
E
(t 1)
(t )
wij
wij
(10)
network is said to have converged. The total error is the sum of the total output minus
the desired output. The total number of error bits can measure it, since the network also
functions as a pattern classifier. In this case, the number of error bits is determined by
the difference of the actual and the desired output.
During the network training process, the network weights gradually converge and
the required mapping from image feature vectors to the corresponding classes is
implicitly stored in the network. After the network has been successfully trained, the
weights that connect the input and hidden layers are entries of a transformation that map
the feature vectors v to smaller dimensional vectors. When a high-dimensional feature
vector is passed through the network, its activation values in the hidden units form a
lower-dimensional vector. This lower-dimension feature vector keeps the most important
discriminative information of the original feature vectors.
Step 3: Compute the total variance s = i i and select the m largest eigenvalues whose
sum just exceeds s % where is a predefined cut-off value. This step selects
the m largest eigenvalues that account for the % of the total variance of the feature
vectors.
Step 4: Construct matrix T using the m corresponding eigenvectors as given in Equation 3.
Step 5: Obtain the new representation yk for each image feature vector xk by applying the
PCA transformation given in Equation 4.
Step 6: Select the training samples from the image collection. Group these training
samples into different classes as determined by the experiments described in
Section 3.2.2.
Step 7: Construct the composite feature vectors zk from the color, texture and shape
feature vectors using the direct sum operation defined in Equation 2.
Step 8: Prepare the training patterns (z k, ck) k where ck is the class number to which the
composite feature vector z k belongs.
Step 9: Set all the weights and node offsets of the network to small random values.
Step 10: Present the training patterns z k as input and ck as output to the network. The
training patterns can be different on each trial; alternatively, the training patterns
can be presented cyclically until the weights in the network stabilize.
Step 11: Use the quickprop-learning algorithm to update the weights of the network.
Step 12: Test the convergence of the network. If the condition of convergence of the
network is satisfied, then stop the network training process. Otherwise, go back
to Step 10 and repeat the process. If the network does not converge, it needs a new
starting point. Thus, it is necessary to go back to Step 9 instead of Step 10.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Steps 1~5 cover the PCA dimension reduction procedure, which was applied to all
images in the data rather than only to the training samples. This has the advantage that
the covariance matrix for each type of single feature vector contains the global variance
of images in the database. The number of principal components to be used is determined
by the cut-off value . There is no formal method to define this cut-off value. In Step
3, the cut-off value is set to 99 so the minimum variance that is retained after PCA
dimension reduction is at least 99%.
After the completion of PCA, the images are classified into classes in Step 6. Steps
7~12 then prepare the necessary input and output values for the network training
process. The network training corresponds to Steps 8~11. As noted above, the weight
of each link is initialized to a random small continuous value. In the quickprop-learning
algorithm, the parameter that limits the step-size is set to 1.75, and the learning rate
for the gradient descent can vary from 0.1 to 0.9. Each time we apply the quickproplearning algorithm, the weight of each link in the network is updated. After a specified
number of applications of the quickprop-learning algorithm, the convergence of the
network is tested in Step 12. At this point, it is decided whether the network has
converged or a new starting weight is required for each link of the network. In the latter
case, the process involved in Steps 9~12 is repeated.
The CMVF
The CMVF framework has been designed and fully implemented with the C++ and
Java programming languages, and an online demonstration with a CGI-based Web
interface is available for users to evaluate the system (Shen, 2003).
Figure 3 presents the various components for this system. User can submit one
image, which is from existing image database or other source, as a query. The system will
search for the images that are most similar in visual content; the matching images are
displayed in similarity-order, starting from the most similar, and users can score the
results. The query can be executed with any of the following retrieval methods: PCA only,
neural network only and CMVF with different visual feature combinations. Users can also
choose a distorted version of the selected image as the query example to demonstrate
CMVFs robustness against image variability.
buildings, plants, animals, rocks, flags, and so forth. All images were scaled to the same
size (128128 pixels).
A subset of this collection was then selected to form the training samples (testimages). There were three steps involved in forming the training samples. Firstly, we
decided on the number of classes according to the themes of the image collection and
selected one image for each class from the collection of 10,000 images. This can be done
with the help of a domain expert. Next, we built three M-tree image databases for the
collection. The first one used color as the index, the second used texture as the index and
the third one used shape as the index. For each image in each class, we retrieved the most
similar images in color using the color index to form a color collection. We then repeated
the same procedure to get images similar in texture and in shape for each image in each
class to form a texture collection and a shape collection. Finally, we got our training
samples1 that are similar in color, in texture and in shape by taking the intersection of
images from the color, texture and shape collections. The training samples (test-images)
were presented to the subjects for classification. To test the effectiveness of additional
feature integration in image classification and retrieval, we use the same procedure as
mentioned in the previous section for generating test-images with additional visual
feature.
Evaluation Metrics
In our experiment, since not all relevant images are examined, some common
measurements such as standard Recall and Precision are inappropriate. Thus, we select
the concepts of normalized precision (Pn) and normalized recall (Rn) (Salton & Mcgill,
1993) as metrics for evaluation. High Precision means that we have few false alarms (i.e.,
few irrelevant images are returned) while high Recall means we have few false dismissals
(i.e., few relevant images are missed). The formulas for these two measures are
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Rn = 1
Pn = 1
i =1
(ranki i )
( N R)! R!
R
where N is the number of images in the dataset and is equal to 10,000, R is the number
of relevant images and the rank order of the i-th relevant image is denoted by ranki. During
the test, the top 60 images are evaluated in terms of similarity.
0.7
0.6
0.5
recall of PCA
recall of CMVF
recall of neural network
0.4
0.7
0.8
0.6
0.5
0.4
0.3
precision of PCA
precision of CMVF
precision of neural network
0.2
0.1
0.3
0
10 11 12 13 14 15
Class ID
10 11 12 13 14 15
Class ID
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Ave. Recall
Rate (%)
63.1
77.2
77.2
Ave. Prec.
Rate (%)
44.6
60.7
60.7
Training Cost
(epochs)
N/A
7035
4100
from high-dimensional raw feature vectors via PCA and a trained neural network
classifier, which can compress not only various kinds of visual features but also semantic
classification information into a small feature vector. Moreover, we can also see from
Figure 4 that the recall and precision values of neural network and hybrid method are
almost the same. The major difference between the two approaches is the time required
to train the network. Based on Table 1, comparing with pure neural network, CMVF saves
nearly 40% training time on learning process. This efficiency is gained by using a relative
small number of neural network inputs. One can therefore conclude that it is advantageous to use a hybrid dimension reduction to reduce the dimensions of image features
for effective indexing.
An example to illustrate the query effectiveness of different dimension reduction
methods is shown in Appendix A. We use an image with a cat as query example.
Comparing with PCA, CMVF achieves superior retrieval results. In the first nine results,
CMVF returns nine out of nine matches. PCA only retrieves two similar images from the
top nine images. On the other hand, query effectiveness of reduced feature space by
CMVF is very close to the one generated by pure neural network with nine out of nine
matches. The major difference is the order of different images in the final result list. We
conclude from this experiment that by incorporating human visual perception, CMVF
indeed is an effective and efficient dimension reduction technique for indexing large
image databases.
0.4
0.3
0
0.9
0.6
0.5
0.4
0.3
0.2
0.1
0
9 10 11 12 13 14 15
9 10 11 12 13 14 15
Class ID
Class ID
Figure 5b. Comparing precision and recall rate of neural network with different visual
feature combinations
0.9
0.7
0.6
0.5
R eca ll wi th co lo r, tex ture a nd sh ap e
Reca ll wi th co lo r a nd tex t ure
0.4
R eca ll wi th co lo r a nd s hap e
R eca ll wi th sh ap e a nd t exture
0.3
0.7
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2
10 11 12 13 14 15
Cla s s I D
10 11 12 13 14 15
Cla s s I D
0.5
0.65
Figure 5c. Comparing precision and recall rate of PCA with linear concatenation of
different visual feature combinations
0.45
0.6
0.55
0.4
0.35
0.5
0.45
0
9 10 11 12 13 14 15
Class ID
0.3
0.25
0
9 10 11 12 13 14 15
Class ID
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
considers two features, respectively. However, the advantage for CMVF over pure neural
network is that it requires less training cost to achieve results with the same quality. On
the other hand, from Figure 5c, we can see that query effectiveness of a feature vector
generated by PCA doesnt show any improvement with additional visual feature
integration. In contrast, there is a slight drop in terms of precision and recall rate for some
cases. For example, in image class 5, if the system only uses color and texture, a 61%
normalized recall rate can be achieved. Interestingly, normalized recall rate with a feature
combination that includes color, texture and shape is only 60%, which remains close to
that achieved using just color and texture.
Appendix B shows an example of query effectiveness gain due to the addition of
shape feature. Obviously, addition of shape feature resulted in better query result. We
used an image with a cat as the query example. With the feature configuration including
color, texture and shape, CMVF retrieved 12 images with cat on the first 12 matches.
Without considering the shape, there are only seven images with cat returned on the top
12 matches.
Robustness
Robustness is a very important feature for a Content-Based Image Retrieval (CBIR)
system. In this section, we investigate CMVF robustness against both image distortion
and the initial configuration of neural network.
Image Distortion
Humans are capable of correctly identifying and classifying images, even in the
presence of moderate amounts of distortion. This property is potentially useful in reallife image database applications, where the query image may have accompanying noise
and distortion. The typical example for this case is the low-quality scanning of a
photograph. Since CMVF is being trained to reduce the dimensionality of raw visual
feature vectors, this process suggests that if we were to train it using not only the original
image, but also distorted version of that image, it might be more robust in recognizing
the image with minor noise or distortion.
We modified image items with different kinds of alternatives as learning examples
for training purpose and carried out a series of experiments to determine how much
improvement would occur with this additional training. We randomly chose 10 images
from each category in the training data, and applied a specific distortion to each image
and included the distorted image in the training data. This process was repeated for each
type of distortion, to yield a neural network that should have been trained to recognize
images in the presence of any of the trained distortions. In order to evaluate the effect
of this on query performance, we ran the same set of test queries to measure precision
and recall rate. However, each query image was distorted before using it as query, and
the ranks of the result images for this query were compared against the ranks of result
images for the nondistorted query image. This was repeated for varying levels of
distortion.
Figure 6 summarizes the results and Appendix C shows a query example. With
incorporation of human visual perception, CMVF is a robust indexing technique. It can
perform well on different kinds of image variations including color distortion, sharpness
changes, shifting and rotation (Gonzalez & Woods, 2002). The experiment shows that on
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
the average, CMVF is robust to blurring with an 11x11 Gaussian filter and Median filter,
random spread by 10 pixels, pixelization by nine pixels and various kinds of noise
including Gaussian and salt&pepper noise.
s =1
n =1
rank _ dev =
where N is the total number of reference images in the study list, ini_rankn is the initial
rank for the reference image n, rankns is the rank for reference image n in system s, and
the number of systems with different initial states is denoted by S. If the CMVF is
insensitive to its initialization, reference images should have roughly the same ranking
in each of the systems. Table 3 shows that this is not the case. The average rank_dev
for all reference images is 16.5. Thus, in fact, overall the initialization of the neural network
does influence the result. However, in order to study this effect in more detail, we divided
the reference images into six groups (study lists) based on their initial position in system
one: group 1 represents the top 10 (most similar) images (with initial rank from 1 to 10),
group 2 contains the next most similar images (with initial rank from 11 to 20), and so on,
up to group 6, which contains images initially ranked 51-60. If we look at the lower part
of the reference image list (such as group 5 and group 6), we can see that rank_dev is
quite large. This means the initial status of the neural network has a big impact on the
order of results. However, the rank_dev is fairly small for the top part (such as group 1)
of the ranked list. This indicates that for top-ranked images (the most similar images),
the results are relatively insensitive to differences in the neural network initial configuration.
70
Br ig ht en
Da rkenr
Sh arp en
45
60
50
40
30
20
10
0
40
35
30
25
20
15
10
5
0
10
20
30
40
10
20
P r e c en t ag e o f V a r iat io n
S iz e o f F i lt e r ( B l u r )
(a) Blur
70
90
80
Pixelize
Ra nd o m s p read
60
50
40
30
20
10
0
Ga us sian no is e
70
60
50
40
30
20
10
0
10
20
30
40
50
60
10
P ix e ls o f V a r ia t io n
20
30
40
50
60
St a n d a r d D e v ia t io n
80
90
70
Sa lt &p ep p er no is e
30
60
50
40
30
20
10
0
Mo re s at uratio n
Less s at uratio n
80
70
60
50
40
30
20
10
0
10
20
30
40
10
20
30
40
50
P r ece n t ag e o f N o i s e P ix e l
P r ece n t ag e o f var ia t io n
60
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
applied alone. In this section we present a discussion of the issues related to the
performance of this hybrid method.
1
14.5
0.4
1.2
5.7
10.4
26.4
42.7
2
18.6
0.5
1.3
7.1
12.3
38.3
52.1
3
16.3
0.7
1.8
6.6
11.8
28.8
47.6
4
17.2
0.4
1.9
5.9
12.9
32.9
48.9
5
17.8
0.6
1.3
7.5
11.7
36.7
49.5
6
15.4
0.3
1.8
7.8
10.5
33.5
38.8
Class No
rank_dev for all reference image
rank_dev for group 1
rank_dev for group 2
rank_dev for group 3
rank_dev for group 4
rank_dev for group 5
rank_dev for group 6
9
15.9
0.7
2.1
7.5
12.4
31.4
41.5
10
17.4
0.6
2.3
6.8
9.8
35.8
48.8
11
17.1
0.6
1.9
6.9
10.7
33.3
46.1
12
15.9
0.5
1.7
6.7
12.1
34.6
47.4
13
16.1
0.7
1.6
7.1
12.5
32.9
44.1
14
16.9
0.6
2.0
6.9
10.3
31.6
42.8
7
15.9
0.8
1.7
7.6
10.9
34.9
39.6
8
15.7
0.5
2.8
6.7
11.4
32.4
40.7
Average
16.5
0.6
1.8
6.9
11.4
33.1
45.1
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
dimension reduction for the collection of images in this section. Table 4 shows the
learning time for different numbers of PCs.
It can be seen that the numbers of PCs for the best network training in our application
depends on their total variance. There are not significant differences in the time required
for network training from 35 to 50 PCs since they account for more than 99% of the total
variance. Moreover, since the eigenvalues are in decreasing order, increasing the number
of PCs after the first 40 PCs does not require much extra time to train the network. For
example, there are only 40 epochs difference between 45 PCs and 50 PCs. However, if we
choose the number of PCs with a total variance that is less than 90% of the total variance
then the differences are significant. It takes 7100 epochs for 10 PCs that account for 89.7%
of the total variance to reach the ultimate network error of 0.02, which is far greater than
the epochs needed for the number of PCs larger than 35.
Total Variance %
81.5
89.7
93.8
95.5
97.5
98.1
99.1
99.4
99.7
99.8
Learning Errors
57.3
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
SUMMARY
To tackle the dimensionality curse problem for multimedia databases, we have
proposed a novel indexing scheme by combining different types of image features to
support queries that involve composite multiple features. The novelty of this approach
is that various visual features and semantic information can be easily fused into a small
feature vector that provides effective (good discrimination) and efficient (low dimensionality) retrieval. The core of this scheme is to combine PCA and a neural network into a
hybrid dimension reducer. PCA provides the optimal selection of features to reduce the
training time of neural network. Through the learning phase of the network, the context
that human visual system used for judging similarity of the visual features in images is
acquired. This is implicitly represented as the network weights after the training process.
The feature vectors computed at the hidden units (which has a small number of
dimensions) of the neural network become our reduced-dimensional composite image
features. The distance between any two feature vectors at the hidden layer can be used
directly as a measure of similarity between the two corresponding images.
We have developed a learning algorithm to train the hybrid dimension reducer. We
tested this hybrid dimension reduction method on a collection of 10,000 images. The
result is that it achieved the same level of accuracy as the standard neural network
approach with a much shorter network training time. We have also presented the output
quality of our hybrid method for indexing the test image collection using M-trees. This
shows that our proposed hybrid dimension reduction of image features can correctly and
efficiently reduce the dimensions of image features and accumulate the knowledge of
human visual perception in the weights of the network. This suggests that other existing
access methods may be able to be used efficiently. Furthermore, the experimental results
also illustrate that by integrating additional visual features, CMVFs retrieval effectiveness can be improved significantly. Finally, we have demonstrated that CMVF can be
made robust against a range of image distortions, and is not significantly affected by the
initial configuration of the neural network. The issue that remains to be studied is
establishing a formal framework to study the effectiveness and efficiency of additional
visual feature integration. There is also a need to investigate more advanced machine
learning techniques that can incrementally reclassify images as new images are added.
REFERENCES
Behrens, R. (1984). Design in the visual arts. Englewood Cliffs, NJ: Prentice Hall.
Bozkaya, T., & zsoyoglu, M. (1997). Distance-based indexing for high-dimensional
metric spaces. In Proceedings of the 16 th ACM SIGMOD International Conference
on Management of Data (SIGMOD97), Tuscon, Arizona, USA (pp. 357-368).
Brin, S. (1995). Near neighbor search in large metric spaces. In Proceedings of the 21st
International Conference on Very Large Data Bases (VLDB95), Zurich, Switzerland (pp. 574-584).
Canny, J. (1986). A computational approach to edge detection. IEEE Trans. Pattern Anal.
Mach. Intell., 8(6), 679-698.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chiueh, T. (1994). Content-based image indexing. In Proceedings of the 20th International Conference on Very Large Databases (VLDB94), Santiago de Chile, Chile
(pp. 582-593).
Ciaccisa, P., & Patella, M. (1998). Bulk loading the m-tree. In Proceedings of the Ninth
Australian Database Conference (ADC98), Perth, Australia (pp. 15-26).
Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: An efficient access method for
similarity search in metric spaces. In Proceeding of the 23rd VLDB International
Conference on Very Large Databases (VLDB97), Athens, Greece (pp. 426-435).
Euripides, G.M.P., & Faloutsos, C. (1997). Similarity searching in medical image databases. IEEE Transaction on Knowledge and Data Engineering, 3(9), 435-447.
Fahlam, S.E. (1988). An empirical study of learning speed for back-propagation
networks. Technical Report CMU-CS 88-162, Carnegie-Mellon University.
Faloutsos, C., Barber, R., Flickner, M., Niblack, W., Peetkovic, D., & Equitz, W. (1994).
Efficient and effective querying by image content. Journal of Intelligent Information System, 3(3/4), 231-261.
Fukunaga, K., & Koontz, W. (1970) Representation of random process using the
karhumen-love expansion. Information and Control, 16(1), 85-101.
Hellerstein, J.M., Naughton, J.F., & Pfeffer, A. (1995). Generalized search trees for
database systems. In Proceedings of the 21 st International Conference on Very
Large Data Bases (VLDB95), Zurich, Switzerland (pp. 562-573).
Gonzalez, R., & Woods, R. (2002). Digital image processing. New York: Addison Wesley.
Jain, A.K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern Recognition, 29(8), 1233-1244.
Kittler, J., & Young, P. (1973). A new application to feature selection based on the
karhumen-love expansion. Pattern Recognition, 5(4), 335-352.
Lee, D., Barber, R.W., Niblack, W., Flickner, M., Hafner, J., & Petkovic, D. (1993). Indexing
for complex queries on a query-by-content image. In Proceedings of SPIE Storage
and Retrieval for Image and Video Database III, San Jose, California (pp. 24-35).
Lerner, R.M., Kendall, P.C., Miller, D.T., Hultsch, D.F., & Jensen, R.A. (1986). Psychology. New York: Macmillan.
Lowe, D.G. (1985). Perceptual organization and visual recognition. Kluwer Academic.
Salton, G., & McGill, M. (1993). Introduction to modern information retrieval. New York:
McGraw-Hill.
Sellis, T., Roussopoulos, N., & Faloutsos, C. (1987). The R+-tree: A dynamic index for
multidimensional objects. In Proceedings of the 12th International Conference on
Very Large Databases (VLDB87), Brighton, UK (pp. 507-518).
Shen, J., Ngu, A.H.H., Shepherd, J., Huynh, D., & Sheng, Q.Z. (2003). CMVF: A novel
dimension reduction scheme for efficient indexing a large image database. In
Proceedings of the 22nd ACM SIGMOD International Conference on Management
of Data (SIGMOD03), San Diego, California (p. 657).
Sirovich, L., & Kirby, M. (1987). A low-dimensional procedure for the identification of
human faces. Journal of Optical Society of America, 4(3), 519.
Swain, M.J., & Ballard, D.H. (1991). Color indexing. Int. Journal of Computer Version,
7(1),11-32.
Turner, M. (1986). Texture discrimination by gabor functions. Biol. Cybern, 55,71-82.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
White, D., & Jain, R. (1996). Similarity indexing with the ss-tree. In Proceedings of the
12 th International Conference on Data Engineering, New Orleans (pp. 516-523).
Wu, J.K. (1997) Content-based indexing of multimedia databases. IEEE Transaction on
Knowledge and Data Engineering, 9(6), 978-989.
ENDNOTE
1
The size of training sample is predefined. In this study, the size is 163.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
APPENDIX A
This example demonstrates query effectiveness between different dimension reduction methods including CMVF, pure neural network, PCA with feature combination
including color, texture and shape.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
APPENDIX B
An example that demonstrates query effectiveness improvement due to integration
of shape information
Query result of CMVF with color and texture: Seven out of twelve matches
Query result of CMVF with color, texture and shape: Twelve out of twelve matches
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
APPENDIX C
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 2
From Classification
to Retrieval:
ABSTRACT
Users query images by using semantics. Though low-level features can be easily
extracted from images, they are inconsistent with human visual perception. Hence, lowlevel features cannot provide sufficient information for retrieval. High-level semantic
information is useful and effective in retrieval. However, semantic information is
heavily dependent upon semantic image regions and beyond, which are difficult to
obtain themselves. Bridging this semantic gap between computed visual features and
user query expectation poses a key research challenge in managing multimedia
semantics. As a spin-off from pattern recognition and computer vision research more
than a decade ago, content-based image retrieval research focuses on a different
problem from pattern classification though they are closely related. When the patterns
concerned are images, pattern classification could become an image classification
problem or an object recognition problem. While the former deals with the entire image
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
as a pattern, the latter attempts to extract useful local semantics, in the form of objects,
in the image to enhance image understanding. In this chapter, we review the role of
pattern classifiers in state-of-the-art content-based image retrieval systems and discuss
their limitations. We present three new indexing schemes that exploit pattern classifiers
for semantic image indexing, and illustrate the usefulness of these schemes on the
retrieval of 2,400 unconstrained consumer images.
INTRODUCTION
Users query images by using semantics. For instance, in a recent paper by Enser
(2000), he gave a typical request to a stock photo library, using broad and abstract
semantics to describe the images one is looking for:
Pretty girl doing something active, sporty in a summery setting, beach not wearing
lycra, exercise clothes more relaxed in tee-shirt. Feature is about deodorant so girl
should look active not sweaty but happy, healthy, carefree nothing too posed or
set up nice and natural looking.
Using existing image processing and computer vision techniques, low-level features such as color, texture, and shape can be easily extracted from images. However, they
have proved to be inconsistent with human visual perception, let alone the incapability
to capture broad and abstract semantics as illustrated by the example above. Hence, lowlevel features cannot provide sufficient information for retrieval. High-level semantic
information is useful and effective in retrieval. However, semantic information is heavily
dependent upon semantic image regions and beyond, which are difficult to obtain
themselves. Between low-level features and high-level semantic information, there is a
so- called semantic gap. Content-based image retrieval research has yet to bridge this
gap between the information that one can extract from the visual data and the
interpretation that the same data have for a user in a given situation (Smeulders et al.,
2000).
In our opinion, the semantic gap is due to two inherent problems. One problem is
that the extraction of complete semantics from image data is extremely hard, as it demands
general object recognition and scene understanding. This is called the semantics
extraction problem. The other problem is the complexity, ambiguity and subjectivity in
user interpretation, that is, the semantics interpretation problem. They are illustrated
in Figure 1. We think that these two problems are manifestation of two one-to-many
relations.
In the first one-to-many relation that makes the semantics extraction problem
difficult, a real world object, say a face, can be presented in various appearances in an
image. This could be due to the illumination condition when the image of the face is being
recorded; the parameters associated with the image capturing device (focus, zooming,
angle, distance, etc.); the pose of the person; the facial expression; artifacts such as
spectacles and hats; variations due to moustache, aging, and so forth. Hence, the same
real-world object may not have consistent color, texture and shape as far as computer
vision is concerned.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Semantics
Extraction
Problem
Semantics
Interpretation
Problem
RELEVANT RESEARCH
User studies on the behavior of users of image collection is limited. The most
comprehensive effort in understanding what a user wants to do with an image collection
is Ensers work on image (Enser, 1993; Enser, 1995) (and also video (Amitage & Enser,
1997)) libraries for media professionals. Other user studies have focused on newspaper
photo archives (Ornager, 1996; Markkula & Sormunen, 2000), art images (Frost et al.,
2000), and medical image archive (Keister, 1994). Typically, knowledgeable users searched
and casual users browsed. But all users found that both searching and browsing are
useful.
As digital cameras and camera phones proliferate, managing personal image
collection effectively and efficiently with semantic organization and access of the images
is becoming a genuine problem to be tackled in the near future. The most relevant findings
on how consumers manage their personal digital photos come from the user studies by
K. Rodden (Rodden & Wood, 2003; Rodden, 1999). In particular, Rodden and Wood
(2003) found that few people will perform annotation, and comprehensive annotation is
not practical, either typed or spoken. Without text annotation, it is not possible to perform
text-based retrieval. Hence, the semantic gap problem remains unsolved.
Content-based image retrieval research has progressed from the pioneering featurebased approach (Bach et al., 1996; Flickner et al., 1995; Pentland et al., 1995) to the regionbased approach (Carson et al., 1997; Li et al., 2000; Smith & Chang, 1996). In order to bridge
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
the semantic gap (Smeulders et al., 2000) that exists between computed perceptual visual
features and conceptual user query expectation, detecting semantic objects (e.g., faces,
sky, foliage, buildings, etc.) based on trained pattern classifiers has been an active trend
(Naphade et al., 2003; Town & Sinclair, 2000).
The MiAlbum system uses relevance feedback (Lu et al., 2000) to produce annotation for consumer photos. The text keywords in a query are assigned to positive
feedback examples (i.e., retrieved images that are considered relevant by the user who
issues the query). This would require constant user intervention (in the form of relevance
feedback) and the keywords issued in a query might not necessarily correspond to what
is considered relevant in the positive examples. As an indirect annotation, the annotation
process is slow and inconsistent between users. There is also the problem of small
sampling in retrieval using relevance feedback; the small number of samples would not
have statistical significance. Learning with feedback is not stable due to the inconsistency in users feedback. The similarity will also vary when people use it for different
applications.
Town and Sinclair (2000) use a semantic labeling approach. An image is segmented
into regular non-overlapping regions. Each region is classified into visual categories of
outdoor scenes by neural networks. Similarity between a query and an image is computed
as either the sum over all grids of the Euclidean distance between classification vectors,
or their cosine of correlation. The evaluation was carried out on more than 1,000 Corel
Photo Library images and about 500 home photos, and better classification and retrieval
results were obtained for the professional Corel images.
In a leading effort by the IBM (International Business Machines, Inc.) research
group to design and detect 34 visual concepts (both objects and sites) in the TREC 2002
benchmark corpus (www.nlpir.nist.gov/projects/trecvid/), support vector machines are
trained on segmented regions in key frames using various color and texture features
(Naphade et al., 2003; Naphade & Smith, 2003). Recently the vocabulary has been
extended to include 64 visual concepts for the TREC 2003 news video corpus (Amir et
al., 2003). Several months of effort were devoted to the manual labeling of the training
samples using their VideoAnnEx annotation tool (Lin et al., 2003) contributed by the
TREC participants.
However, highly accurate segmentation of objects is a major bottleneck except for
selected narrow domains when few dominant objects are recorded against a clear
background (Smeulders et al., 2000, p. 1360). The challenge of object segmentation is acute
for polysemic images in broad domains such as unconstrained consumer images. The
interpretation of such scenes is usually not unique, as the scenes may have numerous
conspicuous objects, some with unknown object classes (Smeulders et al., n.d.).
Our Semantic Region Indexing (SRI) scheme addresses the issue of local region
classification differently. We have also adopted statistical learning to extract local
semantics in image content, though our detection-based approach does not rely on
region segmentation. In addition, our innovation lies in reconciliation of multiscale viewbased object detection maps and spatial aggregation of soft semantic histograms as
image content signature. Our local semantic interpretation scheme can also be viewed
as a systematic extension of the signs designed for domain-specific applications
(Smeulders et al., 2000, p. 1359) and the visual keywords built for explicit query
specification (Lim, 2001).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Image classification is another approach to bridging the semantic gap that has
received more attention lately (Bradshaw, 2000; Lipson et al., 1997; Szummer & Picard,
1998; Vailaya et al., 2001). In particular, the efforts to classify photos based on contents
have been devoted to indoor versus outdoor (Bradshaw, Szummer & Picard, n.d.), natural
versus man-made (Bradshaw, Vailaya et al., n.d.) and categories of natural scenes (Lipson
et al., n.d.; Vailaya et al., n.d.). In general, the classifications were made based on lowlevel features such as color, edge directions, and so forth, and Vailaya et al. presented
the most comprehensive coverage of the problem by dealing with a hierarchy of eight
categories (plus three others) progressively with separately designed features. The
vacation photos used in their experiments are a mixture of Corel photos, personal photos,
video key frames, and photos from the Web.
A natural and useful insight is to formulate image retrieval as a classification
problem. In very general terms, the goal of image retrieval is to return images of a class
C that the user has in mind based on a set of features x computed for each image in the
database. In probabilistic sense, the system should return images ranked in the descending return status value of P(C|x), whatever C may be defined as desirable. Under this
general formulation, several approaches have emerged.
A Bayesian formulation to minimize the probability of retrieval error (i.e., the
probability of wrong classification) had been proposed by Vasconcelos and Lippman
(2000) to drive the selection of color and texture features and to unify similarity measures
with the maximum likelihood criteria. Similarly, in an attempt to classify indoor/outdoor
and natural/man-made images, a Bayesian approach was used to combine class likelihoods resulted from multiresolution probabilistic class labels (Bradshaw, 2000). The
class likelihoods were estimated based on local average color information and complex
wavelet transform coefficients.
In a different way, Aksoy and Haralick (2002) as well as Wu and others (2000)
considered a two-class problem with only the relevance class and the irrelevance class.
A two-level classification framework was proposed by Aksoy and Haralick. Image feature
vectors were first mapped to two-dimensional class-conditional probabilities based on
simple parametric models. Linear classifiers were then trained on these probabilities and
their classification outputs were combined to rank images for retrieval. From a different
motivation, the image retrieval problem was cast as a transductive learning problem by
Wu et al. to include an unlabeled data set for training the image classifier. In particular,
a new discriminant-EM algorithm was proposed to generalize the mapping function
learned from the labeled training data to a specific unlabeled data set. The algorithm was
evaluated on a small database (134 images) of seven classes using 12 labeled images in
the form of relevance feedback.
This classification approach has been popular in specific domains. For medical
images, images have been grouped by pathological classes for diagnostic purpose
(Brodley et al., 1999) or by imaging modalities for visualization purpose (Mojsilovic &
Gomes, 2002). In the case of facial images (Moghaddam et al., 1998), intrapersonal and
extrapersonal classes of variation between two facial images were modeled. Then the
similarity between the image intensity of two facial images was expressed as a probabilistic measure in terms of the intrapersonal and extrapersonal class likelihoods and priors
using a Bayesian formulation.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Ti ( z ) =
exp Si ( z )
j exp Sj ( z )
(1)
yz =
1 yc zc
yt zt
+
( c
)
2 | y || z c | | y t || z t |
(2)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Ti ( Z ) =
1
Ti ( z k )
n k
(3)
For Query by Example (QBE), the content-based similarity l between a query q and
an image x can be computed in terms of the similarity between their corresponding local
tessellated blocks. For example, the similarity based on L 1 distance measure (city block
distance) between query q with m local blocks Yj and image x with m local blocks Zj is
defined as
( q, x ) = 1
1
| Ti (Y j ) Ti (Z j ) |
2m j i
(4)
This is equivalent to histogram intersection (Swain & Ballard, 1991) with further
averaging over the number of local histograms m except that the bins have semantic
interpretation as SSRs. There is a trade-off between content symmetry and spatial
specificity. If we want images of similar semantics with different spatial arrangement (e.g.,
mirror images) to be treated as similar, we can have larger tessellated blocks (i.e., similar
to a global histogram). However, in applications where spatial locations are considered
differentiating, local histograms will provide good sensitivity to spatial specificity.
Furthermore, we can attach different weights to the blocks (i.e., Yj, Zj) to emphasize the
focus of attention (e.g., center). In this chapter, we report experimental results based on
even weights as grid tessellation is used. In this chapter, we have attempted various
similarity and distance measures (e.g., cosine similarity, L 2 distance, Kullback-Leibler
(KL) distance, etc.) and the simple city block distance in Equation 4 has the best
performance.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
min.
5
9
3
0
0
max.
26
66
13
14
7.8
avg.
14.4
33.3
6.9
5.7
3.2
Note that we have presented the features, distance measures, and window sizes of
SSR detection, etc. in concrete forms to facilitate understanding. The SSR methodology
is indeed generic and flexible to adapt to application domains.
For the data set and experiments reported in this paper, we have designed 26 classes
of SSRs (i.e., Si, i = 1, 2, , 26 in Equation 1), organized into eight superclasses as
illustrated in Figure 2. We cropped 554 image regions from 138 images and used 375 of
them (from 105 images) as training data for support vector machines to compute the
support vectors of the SSRs and the remaining one-third for validation. Among all the
kernels evaluated, those with better generalization result on the validation set are used
for the indexing and retrieval tasks. A polynomial kernel with degree 2 and constant 1 (C
= 100) (Joachims, 1999) produced the best result on precision and recall. Hence, it was
adopted in the rest of our experiments.
Table 1 lists the training statistics of the 26 SSR classes. The columns show, left
to right, the minimum, maximum and average of the number of positive training examples
(from a total of 375), the number of support vectors computed from the training examples,
the number of positive test examples (from a total of 179), the number of misclassified
examples on the 179 test set, and the percentage of error on the test set. The negative
training (test) examples for an SSR class are the union of positive training (test) examples
of the other 25 classes. The minimum number of positive training and test examples are
from the Interior:Wooden SSR while their maximum numbers are from the People:Face
class. The minimum and maximum numbers of support vectors are associated with the
Sky:Clear and Building:Old SSRs, respectively. The SSR with the best generalization is
the Interior:Wooden class, and the worst test error belongs to the Building:Old class.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
When we are dealing with QBE, the set of relevant images R is obscure and a query
example q only provides a glimpse into it. In fact, the set of relevant images R does not
exist until a query has been specified. However, to anchor the query context, we can
define prior image classes Ck, k = 1, 2, , M as prototypical instances of the relevance
class R and compute the relative memberships to these classes of query q. Similarly we
can compute the interclass index for any database image x. These interclass memberships
allow us to compute a form of categorical similarity between q and x (see Equation 7).
In this chapter, as our test images are consumer photos, we design a taxonomy for
consumer photos as shown in Figure 3. This hierarchy is more comprehensive than that
addressed by Vailaya et al. (2001). In particular, we consider subcategories for indoor and
city as well as more common subcategories for nature. We select the seven disjoint
categories represented by the leaf nodes (except the miscellaneous category) in Figure
3 as semantic support classes (SSCs) to model the categorical context of relevance. That
is, we trained seven binary SVMs Ck, k = 1, 2, , 7 on these categories: interior or objects
indoor (inob), people indoor (inpp), mountain and rocky area (mtrk), parks or gardens
(park), swimming pool (pool), street scene (strt), and waterside (wtsd). Using the softmax
function (Bishop, 1995), the output of classification Rk given an image x is computed as,
Rk ( x) =
exp Ck ( x )
j expCj ( x )
(5)
The feature vector of an image for classification is the SRI image index, that is, T i
(Zj) i, j as described above. To be consistent with the SSR training, we adopted the
Figure 3. Proposed taxonomy for consumer photos. The seven disjoint categories (the
leaf nodes except miscellaneous) are selected as semantic support classes to model
categorical context of relevance.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Table 2. Statistics related to SSC learning (left-to-right): SSC class labels, numbers
of positive training examples (p-train), numbers of positive test examples (p-test),
numbers of support vectors computed (sv), and the classification rate (rate) on the
entire 2400 collection.
SSC
p-train
p-test
sv
rate
inob
inpp
mtrk
park
pool
strt
wtsd
27
172
13
61
10
129
30
107
688
54
243
42
516
120
136
234
116
158
72
259
151
95.7
85.1
98.0
92.4
98.7
84.4
95.3
polynomial kernels and the similarity measure between image indexes u = Ti (Y j) and v =
Ti (Zj) as
uv =
m j
T (Y )T (Z )
T (Y ) T (Z
i
)2
(6)
( q, x) = 1
1
| Rk ( q ) Rk ( x ) |
2 k
(7)
Similar to the SSR training, the support vector machines were trained using a
polynomial kernel with degree 2 and constant 1 (C = 100) (Joachims, 1999). For each class,
a human subject was asked to define the list of ground truth images from the 2,400
collection, and 20% of the list was used for training. To ensure unbiased training samples,
we generated 10 different sets of positive training samples from the ground truth list for
each class based on uniform random distribution. The negative training (test) examples
for a class are the union of positive training (test) examples of the other six classes and
the miscellaneous class. The classifier training for each class was carried out 10 times on
these different training sets, and the support vector classifier of the best run was retained.
Table 2 lists the statistics related to the SSC learning. The miscellaneous class (not shown
in the table) has 171 images that include images of dark scene and bad quality.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
number of labeled images. Then local semantic patterns are discovered from clustering
the image blocks with high classification output. Training samples are induced from
cluster memberships for support vector learning to form local semantic pattern detectors.
An image is then indexed as a tessellation of local semantic histograms and matched
using histogram intersection similar to that of the SRI scheme.
Given an application domain, some typical classes Ck with their image samples are
identified. The training samples are tessellated image blocks z from the class samples.
After learning, the class models would have captured the local class semantics and a high
SVM output (i.e., Ck(z) 0) would suggest that the local region z is typical to the semantics
of class k.
With the help of the learned class models Ck, we can generate sets of local image
regions that characterize the class semantics (which in turn captures the semantic of the
content domain) Xk as
X k = { z | C k ( z ) > } ( 0)
(8)
However, the local semantics hidden in each Xk are opague and possibly multimode.
We would like to discover the multiple groupings in each class by unsupervised learning
such as Gaussian mixture modeling and fuzzy c-means clustering. The result of the
clustering is a collection of partitions mkj, j = 1, 2, , Nk in the space of local semantics
for each class, where mkj are usually represented as cluster centers and Nk are the numbers
of partitions for each class. Once we have obtained the typical semantic partitions for
each class, we can learn the models of Discovered Semantic Regions (DSR) Si, i = 1, 2,
, N where N = k Nk (i.e., we linearize the ordering of mkj as mi). We label a local image
block (x k Xk) as a positive example for Si if it is closest to mi and as a negative example
for Si j i,
X i+ = {x | i = arg min t | x mt |}
(9)
X i = {x | i arg min t | x mt |}
(10)
where |.| is some distance measure. Now we can perform supervised learning again on
X+i and X-i using say support vector machines Si(x) as DSR models.
To visualize a DSR Si, we can display the image block s i that is most typical among
those assigned to cluster mi that belonged to class k,
C k ( si ) = max+ C k ( x )
xX i
(11)
For consumer images used in our experiments, we make use of the same seven
disjoint categories represented by the leaf nodes (except the miscellaneous category) in
Figure 3. The same color and texture features as well as the modified dot product similarity
measure used in the supervised learning framework (Equation 2) are adopted for the
support vector classifier training with polynomial kernels degree 2, constant 1, C = 100
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Table 3. Training statistics of the semantic classes Ck for bootstrapping local semantics.
The columns (left to right) list the class labels, the size of ground truth, the number of
training images, the number of support vectors learned, the number of typical image
blocks subject to clustering (Ck(z) > 2), and the number of clusters assigned.
Class
inob
inpp
mtrk
park
pool
strt
wtsd
G.T.
134
840
67
304
52
645
150
#trg
15
20
10
15
10
20
15
#SV
1905
2249
1090
955
1138
2424
2454
#data
1429
936
1550
728
1357
735
732
#clus
4
5
2
4
2
5
4
(Joachims, 1999). The training samples are 6060 image blocks (tessellated with 20 pixels
in both directions) from 105 sample images. Hence, each SVM was trained on 16,800 image
blocks. After training, the samples from each class k are fed into classifier Ck to test their
typicalities. Those samples with SVM output Ck(z) > 2 (Equation 8) are subject to fuzzy
c-means clustering. The number of clusters assigned to each class is roughly proportional to the number of training images in each class. Table 3 lists training statistics for
these semantic classes: inob (indoor interior/objects), inpp (indoor people), mtrk
(mountain/rocks), park (park/garden), pool (swimming pool), strt (street), and wtsd
(waterside). Hence, we have 26 DSRs in total.
To build the DSR models, we trained 26 binary SVMs with polynomial kernels
(degree 2, constant 1, C = 100 (Joachims, 1999)), each on 7467 positive and negative
examples (Equations 9 and 10) (i.e., sum of column 5 of Table 3). To visualize the 26 DSRs
that have been learned, we compute the most typical image block for each cluster
(Equation 11) and concatenate their appearances in Figure 4. Image indexing was based
on the steps as in the case of SRI (Equations 1 to 3) and matching uses the same similarity
measure as given in Equation 4.
Figure 4. Most typical image blocks of the DSRs learned (left to right): china utensils
and cupboard top (first four) for the inob class; faces with different background and
body close-up (next five) for the inpp class; rocky textures (next two) for the mtrk class;
green foliage and flowers (next four) for the park class; pool side and water (next two)
for the pool class; roof top, building structures, and roadside (next five) for the strt
class; and beach, river, pond, far mountain (next four) for the wtsd class.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EXPERIMENTAL RESULTS
Dataset and Queries
In this paper, we evaluate the SRI, CRI and PDI schemes on 2,400 unconstrained
consumer photos. These genuine consumer photos are taken over five years in several
countries with both indoor and outdoor settings. The images are those of the smallest
resolution (i.e., 256384) from Kodak PhotoCDs, in both portrait and landscape layouts.
After removing possibly noisy marginal pixels, the images are of size 240 360. Figure
5 displays typical photos in this collection. As a matter of fact, this genuine consumer
photo collection includes photos of bad quality (e.g., faded, over- and underexposed,
blurred, etc.) (Figure 6). We retained them in our test to reflect the complexity of the
original data. The indexing process automatically detects the layout and applies the
corresponding tessellation template.
We defined 16 semantic queries and their ground truths (G.T.) among the 2,400
photos (Table 4). In fact, Figure 5 shows, in top-down left-to-right order, two relevant
images for queries Q01-Q16 respectively. As we can see from these sample images, the
relevant images for any query considered here exhibit highly varied and complex visual
appearance. Hence, to represent each query, the we have selected three relevant photos
as query examples for our experiments because a single query image is far from
satisfactory to capture the semantic of any query. Indeed single query images have
resulted in poor precisions and recalls in our initial experiments. The precisions and
recalls were computed without the query images themselves in the lists of retrieved
images.
Figure 5. Sample consumer photos from the [Trial mode] collection. They also represent
[Trial mode] relevant images (top-down, left-right) for each of the [Trial mode] queries
used in our experiments.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Description
indoor
outdoor
people close-up
people indoor
interior or object
city scene
nature scene
at a swimming pool
street or roadside
along waterside
in a park or garden
at mountain area
buildings close-up
people close up, indoor
small group, indoor
large group, indoor
G.T.
994
1218
277
840
134
697
521
52
645
150
304
67
239
73
491
45
When a query has multiple examples, q = { q1, q2, , qK }, the similarity (q, x) for
any database image is computed as
( q, x) = max i ( qi , x)
(12)
(13)
where lc and lt are similarities based on color and texture features respectively. Among
the relative weights attempted at 0.1 intervals, the best fusion was obtained at Pavg = 0.38
and P30 = 0.61 with equal color influence and texture influence for global signatures. In
the case of local signatures, the fusion peaked when the local color histograms were given
a dominant influence of 0.9, resulting in Pavg = 0.38 and P30 = 0.59.
The Precision/Recall curves (averaged over 16 queries) in Figure 7 illustrate the
precisions at various recall values for the four methods compared. All three proposed
indexing schemes outperformed the feature-based fusion approach.
Figure 7. Precision/Recall curves for CTO, SRI, CRI and PDI schemes
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Table 5. Average precisions at top numbers of retrieved images (left to right): numbers
of retrieved images, average precisions based on CTO, SRI, CRI and PDI, respectively.
The numbers in parentheses are the relative improvement over the CTO method. The last
row shows the overall average precisions.
Avg.Prec.
At 20
At 30
At 50
At 100
Overall
CTO
0.54
0.59
0.52
0.46
0.38
SRI
0.76 (41%)
0.70 (19%)
0.62 (19%)
0.54 (17%)
0.45 (18%)
CRI
0.71 (31%)
0.68 (15%)
0.64 (23%)
0.58 (26%)
0.53 (39%)
PDI
0.71 (31%)
0.68 (15%)
0.63 (21%)
0.57 (24%)
0.48 (26%)
Table 5 shows the average precisions among the top 20, 30, 50 and 100 retrieved
images as well as the overall average precisions for the methods compared. Overall, the
proposed SRI, CRI and PDI schemes improve over the CTO method by 18%, 39% and 26%,
respectively. The CRI scheme has the best overall average precision of 0.53 while the SRI
scheme retrieves the highest number of relevant images at top 20 and 30 images.
DISCUSSION
The complex task of managing multimedia semantics has attracted a lot of research
interests due to the inexorable growth of multimedia information. While automatic feature
extraction does offer some objective measures to index the content of an image, it is far
from satisfactory to capture the subjective and rich semantics required by humans in
multimedia information retrieval tasks. Pattern classifiers provide a mid-level means to
bridge the gap between low-level features and higher level concepts (e.g., faces,
buildings, indoor, outdoor, etc.).
We believe that object and event detection in images and videos based on
supervised or semisupervised pattern classifiers will continue to be active research
areas. In particular, combining multiple modalities (visual, auditory, textual, Web) to
achieve synergy among the semantic cues from different information sources has been
accepted as a promising direction to create semantic indexes for multimedia contents
(e.g., combining visual and textual modalities for images; auditory and textual modalities
for music; auditory, visual and textual modalities for videos, etc.) in order to enhance
system performance. However, currently there is neither established formalism nor
proven large-scale application to guide or demonstrate the exploitation of pattern
classifiers and multiple modalities in semantic multimedia indexing, respectively. Hence,
we believe principled representation and integration schemes for multimodality and
multiclassifier as well as realistic large-scale applications will be well sought after in the
next few years. While some researchers push towards a generic methodology for broad
applicability, we will also see many innovative uses of multimodal pattern classifiers that
incorporate domain-specific knowledge to solve specific narrow domain multimedia
indexing problems.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Similarly, in the area of semantic image indexing and retrieval, we foresee three
promising trends, among other research opportunities. First, generic object detection
and recognition will continue to be an important research topic, especially in the direction
of unlabeled and unsegmented object recognition (e.g., Fergus et al., 2003). We hope that
the lessons learned in many forthcoming object recognition systems in narrow domains
can be abstracted into some generic and useful guiding principles. Next, complementary
information channels will be utilized to better index the images for semantic access. For
instance, in the area of consumer images, the time stamps available from digital cameras
can help to organize photos into events (Cooper et al., 2003). Associated text information
(e.g., stock photos, medical images, etc.) will provide a rich semantic source in addition
to image content (Barnard & Forsyth, 2001; Barnard et al., 2003b; Kutics et al., 2003; Li
&Wang, 2003). Last, but not least, we believe that pattern discovery (as demonstrated
in this chapter) is an interesting and promising direction for image understanding and
indexing. These three trends (object recognition, text association and pattern discovery)
are not conflicting and their interaction and synergy would produce very powerful
semantic image indexing and retrieval systems in the future.
CONCLUDING REMARKS
In this chapter, we have reviewed several key roles of pattern classifiers in contentbased image retrieval systems, ranging from segmented object detection to image scene
classification. We pointed out the limitations related to region segmentation for object
detection, image classification for similarity matching, and manual labeling effort for
supervised learning. Three new semantic image indexing schemes are introduced to
address these issues respectively. They are compared to the feature-based fusion
approach that requires very high dimension features to attain a reasonable retrieval
performance on the 2,400 unconstrained consumer images with 16 semantic queries.
Experimental results have confirmed that our three proposed indexing schemes are
effective especially when we consider precisions at top retrieved images. We believe that
pattern classifiers are very useful tools to bridge the semantic gap in content-based image
retrieval. The potential for innovative use of pattern classifiers is promising as demonstrated by our research results presented in this chapter.
ACKNOWLEDGMENTS
We thank T. Joachims for his great SVMlight software and J.L. Lebrun for his 2,400
family photos.
REFERENCES
Aksoy, S., & Haralick, R.M. (2002). A classification framework for content-based image
retrieval. In Proceedings of International Conference on Pattern Recognition
2002 (pp. 503-506).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Bach, J.R. et al. (1996). Virage image search engine: an open framework for image
management. In Storage and Retrieval for Image and Video Databases IV,
Proceedings of SPIE 2670 (pp. 76-87).
Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In
Proceedings of International Conference on Computer Vision 2001 (pp. 408-415).
Barnard, K. et al. (2003). The effects of segmentation of feature choices in a translation
model of object recognition. In Proceedings of IEEE Computer Vision and Pattern
Recognition 2003 (pp. 675-684).
Barnard, K. et al. (2003). Matching words and pictures. Journal of Machine Learning
Research, 3, 1107-1135.
Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.
Bradshaw, B. (2000). Semantic based image retrieval: A probabilistic approach. In
Proceedings of ACM Multimedia 2000, (pp. 167-176).
Brodley, C.E. et al. (1999). Content-based retrieval from medical image databases: A
synergy of human interaction, machine learning and computer vision. In Proceedings of AAAI (pp. 760-767).
Carson, C. et al. (2002). Blobworld: Image segmentation using expectation-maximization
and its application to image querying. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24(8), 1026-1038.
Cooper, M. et al. (2003). Temporal event clustering for digital photo collections. In
Proceedings of ACM Multimedia 2003 (pp. 364-373).
Duda, R.O., & Hart, P.E. (1973). Pattern classification and scene analysis. New York:
John Wiley & Sons.
Duygulu, P. et al. (2002). Object recognition as machine translation: Learning a lexicon
for a fixed image vocabulary. In Proceedings of European Conference on Computer Vision 2002 (vol IV, pp. 97-112).
Enser, P. (2000). Visual image retrieval: Seeking the alliance of concept based and content
based paradigms. Journal of Information Science, 26(4), 199-210.
Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised
scale-invariant learning. In Proceedings of IEEE Computer Vision and Pattern
Recognition 2003 (pp. 264-271).
Flickner, M. et al. (1995). Query by image and video content: The QBIC system. IEEE
Computer, 28(9), 23-30.
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C.
Burges, & A. Smola (Eds.), Advances in kernel methods - Support vector learning
(pp. 169-184). Boston: MIT-Press.
Kapur, J.N., & Kesavan, H.K. (1992). Entropy optimization principles with applications. New York: Academic Press.
Kutics, A. et al. (2003). Linking images and keywords for semantics-based image retrieval.
In Proceedings of International Conference on Multimedia & Exposition (pp.
777-780).
Li, J., & Wang, J.Z. (2003). Automatic linguistic indexing of pictures by a statistical
modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10), 1-14.
Li, J., Wang, J.Z., & Wiederhold, G. (2000). Integrated region matching for image retrieval.
Proceedings of ACM Multimedia 2000 (pp. 147-156).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Lim, J.H. (2001). Building visual vocabulary for image indexation and query formulation.
Pattern Analysis and Applications, 4(2/3), 125-139.
Lipson, P., Grimson, E., & Sinha, P. (1997). Configuration based scene classification and
image indexing. In Proceedings of International Conference on Computer Vision
(pp. 1007-1013).
Manjunath, B.S., & Ma, W.Y. (1996). Texture features for browsing and retrieval of image
data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8),
837-842.
Moghaddam, B., Wahid, W., & Pentland, A. (1998). Beyond Eigenfaces: Probabilistic
matching for face recognition. In Proceedings of IEEE International Conference
on Automatic Face and Gesture Recognition (pp. 30-35).
Mojsilovic, A., & Gomes, J. (2002). Semantic based categorization, browsing and retrieval
in medical image databases. In Proceedings of IEEE International Conference on
Image Processing (pp. III 145-148).
Naphade, M.R. et al. (2003). A framework for moderate vocabulary semantic visual
concept detection. In Proceedings of International Conference on Multimedia &
Exposition (pp. 437-440).
Ortega, M. et al. (1997). Supporting similarity queries in MARS. In Proceedings of ACM
Multimedia (pp. 403-413).
Papageorgiou, P.C., Oren, M., & Poggio, T. (1997). A general framework for object
detection. In Proceedings of International Conference on Computer Vision (pp.
555-562).
Pentland, A., Picard, R.W., & Sclaroff, S. (1995). Photobook: Content-based manipulation
of image databases. International Journal of Computer Vision, 18(3), 233-254.
Robertson, S.E. (1977). The probability ranking principle in IR. Journal of Documentation, 33, 294-304.
Schmid, C. (2001). Constructing models for content-based image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition 2001 (pp. 39-45).
Selinger, A., & Nelson, R.C. (2001). Minimally supervised acquisition of 3D recognition
models from cluttered images. In Proceedings of IEEE Computer Vision and
Pattern Recognition 2001 (pp. 213-220).
Smeulders, A.W.M. et al. (2000). Content-based image retrieval at the end of the early
years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12),
1349-1380.
Smith, J.R., & Chang, S.-F. (1996). VisualSEEk: A fully automated content-based image
query system. In Proceedings of ACM Multimedia, Boston, November 20 (pp. 8798).
Sung, K.K., & Poggio, T. (1998). Example-based learning for view-based human face
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(1), 39-51.
Swain, M.J., & Ballard, D.N. (1991). Color indexing. International Journal of Computer
Vision, 7(1), 11-32.
Szummer, M., & Picard, R.W. (1998). Indoor-outdoor image classification. In Proceedings of IEEE International Workshop on Content-based Access of Image and
Video Databases (pp. 42-51).
Town, C., & Sinclair, D. (2000). Content-based image retrieval using semantic visual
categories. Technical Report 2000.14, AT&T Laboratories Cambridge.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Vailaya, A., et al. (2001). Bayesian framework for hierarchical semantic classification of
vacation images. IEEE Transactions on Image Processing, 10(1), 117-130.
Vasconcelos, N., & Lippman, A. (2000). A probabilistic architecture for content-based
image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition
(pp. 1216-1221).
Wang, L., Chan, K.L., & Zhang, Z. (2003). Bootstrapping SVM active learning by
incorporating unlabelled images for image retrieval. In Proceedings of IEEE
Computer Vision and Pattern Recognition (pp. 629-634).
Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for
recognition. In Proceedings of European Conference on Computer Vision (pp. 1832).
Wu, Y., Tian, Q., & Huang, T.S. (2000). Discriminant-EM algorithm with application to
image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition
(pp. 1222-1227).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 3
Self-Supervised Learning
Based on Discriminative
Nonlinear Features and
Its Applications for
Pattern Classification
Qi Tian, University of Texas at San Antonio, USA
Ying Wu, Northwestern University, USA
Jie Yu, University of Texas at San Antonio, USA
Thomas S. Huang, University of Illinois, USA
ABSTRACT
For learning-based tasks such as image classification and object recognition, the
feature dimension is usually very high. The learning is afflicted by the curse of
dimensionality as the search space grows exponentially with the dimension.
Discriminant expectation maximization (DEM) proposed a framework by applying
self-supervised learning in a discriminating subspace. This paper extends the linear
DEM to a nonlinear kernel algorithm, Kernel DEM (KDEM), and evaluates KDEM
extensively on benchmark image databases and synthetic data. Various comparisons
with other state-of-the-art learning techniques are investigated for several tasks of
image classification, hand posture recognition and fingertip tracking. Extensive
results show the effectiveness of our approach.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 53
INTRODUCTION
Invariant object recognition is a fundamental but challenging computer vision task,
since finding effective object representations is generally a difficult problem. Three
dimensional (3D) object reconstruction suggests a way to invariantly characterize
objects. Alternatively, objects could also be represented by their visual appearance
without explicit reconstruction. However, representing objects in the image space is
formidable, since the dimensionality of the image space is intractable. Dimension
reduction could be achieved by identifying invariant image features. In some cases,
domain knowledge could be exploited to extract image features from visual inputs, such
as in content-based image retrieval (CBIR). CBIR is a technique which uses visual content
to search images from large-scale image databases according to users interests, and has
been an active and fast advancing research area since the 1990s (Smeulders, 2000).
However, in many cases machines need to learn such features from a set of examples
when image features are difficult to define. Successful examples of learning approaches
in the areas of content-based image retrieval, face and gesture recognition can be found
in the literature (Tieu et al., 2000; Cox et al., 2000; Tong & Wang, 2001; Tian et al., 2000;
Bellhumeur, 1996).
Generally, characterizing objects from examples requires huge training datasets,
because input dimensionality is large and the variations that object classes undergo are
significant. Labeled or supervised information of training samples are needed for
recognition tasks. The generalization abilities of many current methods largely depend
on training datasets. In general, good generalization requires large and representative
labeled training datasets. Unfortunately, collecting labeled data can be a tedious, if not
impossible, process. Although unsupervised or clustering schemes have been proposed
(e.g., Basri et al., 1998; Weber et al., 2000), it is difficult for pure unsupervised approaches
to achieve accurate classification without supervision.
This problem can be alleviated by semisupervised or self-supervised learning
techniques which take hybrid training datasets. In content-based image retrieval (e.g.,
Smeulders et al., 2000; Tieu et al., 2000; Cox et al., 2000; Tong & Wang, 2001; Tian et al.,
2000), there are a limited number of labeled training samples given by user query and
relevance feedback (Rui et al., 1998). Pure supervised learning on such a small training
dataset will have poor generalization performance. If the learning classifier is overtrained on the small training dataset, over-fitting will probably occur. However, there
are a large number of unlabeled images or unlabeled data in general in the given database.
Unlabeled data contain information about the joint distribution over features which can
be used to help supervised learning. These algorithms assume that only a fraction of the
data is labeled with ground truth, but still take advantage of the entire data set to generate
good classifiers; they make the assumption that nearby data are likely to be generated
by the same class. This learning paradigm could be seen as an integration of pure
supervised and unsupervised learning.
Discriminant-EM (DEM) (Wu et al., 2000) is a self-supervised learning algorithm for
such purposes that use a small set of labeled data with a large set of unlabeled data. The
basic idea is to learn discriminating features and the classifier simultaneously by
inserting a multiclass linear discriminant step in the standard expectation-maximization
(EM) (Duda et al., 2001) iteration loop. DEM makes the assumption that the probabilistic
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 55
the well-known kernel trick (Schlkopf & Smola, 2002). The kernel functions k(x,z)
compute a dot product in a feature space F: k(x,y) = ( (x)T (z)). Formulating the
algorithms in F using only in dot products, we can replace any occurrence of a dot
product by the kernel function k, which amounts to performing the same linear algorithm
as before, but implicitly in a kernel feature space F. The kernel principle has quickly
gained attention in image classification in recent years (e.g., Zhou & Huang, 2001;
Wang et al., 2003; Wu et al., 2001; Tian et al., 2004; Schlkopf et al., 2002; Wolf &
Shashua, 2003).
J (W ) =
| W T S1W |
| W T S 2W |
(1)
Here, W denotes the weight vector of a linear feature extractor (i.e., for an example
x, the feature is given by the projections (W Tx) and S1 and S2 are symmetric matrices
designed such that they measure the desired information and the undesired noise along
the direction W. The ratio in Equation (1) is maximized when one covers as much as
possible of the desired information while avoiding the undesired.
If we look for discriminating directions for classification, we can choose S B
(between-class variance) to measure the separability of class centers that is S 1 in
Equation (1), and S W to measure the within-class variance, that is, S 2 in Equation (1).
In this case, we recover the well-known Fisher discriminant (Fisher, 1936), where S B
and SW are given by
C
S B = N j (m j m)(m j m) T
j =1
(2)
Nj
SW = ( xi( j ) m j )( xi( j ) m j ) T
j =1 i =1
(3)
we use {xi(j), i = 1,...,Nj}, j = 1,...,C (C = 2 for Fisher discriminant analysis (FDA)) to denote
the feature vectors of training samples. C is the number of classes, Nj is the number of
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
the samples of the jth class, xi(j) is the ith sample from the j th class, m j is mean vector of the
j th class, and m is grand mean of all examples.
If S1in Equation (1) is the covariance matrix
S1 =
Nj
1
C
(x
j =1
1
Nj
i =1
( j)
i
m)( x i( j ) m) T
(4)
Self-Supervised Learning 57
(xj). This is the same idea adopted by the support vector machine (Vapnik, 2000), kernel
PCA (Schlkopf et al., 1998), and invariant feature extractions (Mika et al., 1999; Roth &
Steinhage, 1999). The trick is to rewrite the MDA formulae using only dot products of
the form iT j , so that the reproducing kernel matrix can be substituted into the
formulation and the solution, thus eliminating the need for direct nonlinear transformation.
Using superscript to denote quantities in the new space and using SB and SW for
between-class scatter matrix and within-class scatter matrix, we have the objective
function in the following form:
| W T S B W |
| W T S W W |
(5)
and
C
S B = N j (m j m )(m j m ) T
(6)
j =1
C
Nj
S W = ( (x i( j ) ) m j )( (x i( j ) ) m j ) T
j =1 i =1
with m =
Nj
1
N
(x k ) , mj =
k =1
(7)
1
Nj
(x
k =1
of samples.
In general, there is no other way to express the solution W opt F, either because F
is too high or infinite dimension, or because we do not even know the actual feature space
connected to a certain kernel. Schlkopf and Smola (2002) and Mika et al. (2003) showed
that any column of the solution Wopt , must lie in the span of all training samples in F, that
r
is, Wi F. Thus for some expansion coefficients = [ 1 , L , N ]T ,
N
r
w i = k (x k ) =
k =1
i = 1, K , N
(8)
where = [ (x 1 ), L , (x N )] . We can therefore project a data point xk onto one coordinate of the linear subspace of F as follows (we will drop the subscript on wi in the ensuing
equation):
r
w T (x k ) = T T (x k )
(9)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
k (x 1 , x k )
v
= r T
= T
M
k
k (x N , x k )
k (x 1 , x k )
M
k =
k (x N , x k )
(10)
(11)
where we have rewritten dot products, (x)T (y) with kernel notation k(x,y). Similarly,
we can project each of the class means onto an axis of the subspace of feature space F
using only products:
r
w T m j = T
(x 1 ) T ( x k )
M
=1
k
(x N ) T ( x k )
Nj
1
Nj
1 Nj
N j k (x 1 , x k )
r k =1
M
=T
1 j
N j k (x N , x k )
k =1
r
= Tj
(12)
(13)
(14)
It follows that
r
r
w T S B w = T K B
(15)
T
where K B = N j ( j )( j ) and
j =1
r
r
w T SW w = T K W
C
(16)
Nj
T
where K W = ( k j )( k j ) . The goal of kernel multiple discriminant analysis
j =1 k =1
(KMDA) is to find
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 59
| AT KBA |
| A T KW A |
(17)
r
r
where A = [ 1 , L , C 1 ] , C is the total number of classes, N is the size of training samples,
and KB and KW are NN matrices which require only kernel computations on the training
samples (Schlkopf & Smola, 2002).
r
Now we can solve for s, the projection of a new pattern z onto w is given by
Equations (9) and (10). Similarly, algorithms using different matrices for S1, and S2 in
Equation (1), are easily obtained along the same lines.
S N P = (y i m x )(y i m x ) T
i =1
(18)
Nx
S P = (x i m x )(x i m x ) T
i =1
(19)
where {xi, i = 1, ..., Nx} denotes the positive examples and {yi, i = 1, ..., Ny} denotes the
negative examples, and mx is the mean vector of the sets {xi}, respectively. SNP is the
scatter matrix between the negative examples and the centroid of the positive examples,
and SP is the scatter matrix within the positive examples. NP indicates the asymmetric
property of this approach, that is, the users biased opinion towards the positive class,
thus the name of biased discriminant analysis (BDA) (Zhou & Huang, 2001).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
TRAINING ON A SUBSET
We still have one problem: Although we could avoid working explicitly in the
extremely high or infinite dimensional space F, we are now facing a problem in N variables,
a number which in many practical applications would not allow us to store or manipulate
NN matrices on a computer anymore. Furthermore, solving, for example, an eigenproblem or a QP of this size is very time consuming (O(N3)). To maximize Equation (17),
we need to solve an NN eigen- or mathematical programming problem, which might be
intractable for a large N. Approximate solutions could be obtained by sampling representative subsets of the training data {x k | k = 1, L , M , M << N } , and using
The first scheme is blind to the class labeling. We select representatives, or kernel
vectors, by identifying those training samples which are likely to play a key role in
k)
projection y i( k ) = w ( k ) (x i ) = A (opt
i( k ) of the original data xi can be obtained. We
assume Gaussian distribution (k) for each class in the nonlinear discrimination space ,
and the parameters (k) can be estimated by {y(k)}, such that the labeling and training error
e(k) can be obtained by l i ( k ) = arg max p(l j | y i , ( k )} .
j
If e(k) < e(k1), we randomly select M training samples from the correctly classified
training samples as kernel vector KV(t +1) in iteration k +1. Another possibility is that if
any current kernel vector is correctly classified, we randomly select a sample in its
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 61
(a)
(b)
(c)
(d)
topological neighborhood to replace this kernel vector in the next iteration. Otherwise,
that is, e(k) > e(k1), and we terminate.
The evolutionary kernel vector selection algorithm is summarized below:
Evolutionary Kernel Vector Selection: Given a set of training data D = (X, L) = {xi,
li), i = 1, ..., N}
to identify a set of M kernel vectors KV = {vi, i = 1, ..., M}.
k = 0; e = ; KV (0) = random_pick(X); // Init
do {
k)
A (opt
= KMDA(X, KV ( k ) ) ; // Perform KMDA
k)
);
Y ( k ) = Pr oj (X, A (opt
( k ) = Bayes(Y ( k ) , L);
// Project X to
// Bayesian Classifier
L ( k ) = Labeling (Y ( k ) , ( k ) ); // Classification
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
e ( k ) = Error ( L ( k ) , L) ;
if ( e
(k )
// Calculate error
< e)
e = e (k ) ; KV = KV (k ) ; k + +;
KV ( k ) = random _ pick ({x i : l i ( k ) l i });
else
KV = KV ( k 1) ;
break;
end
}
return KV;
y i = arg max p( y j | x i , L, U : x i U }
(20)
j =1,L,C
where C is the number of classes and yi is the class label for xi.
The expectation-maximization (EM) (Duda et al., 2001) approach can be applied to
this transductive learning problem, since the labels of unlabeled data can be treated as
missing values. We assume that the hybrid data set is drawn from a mixed density
distribution of C components {cj, j = 1, ..., C}, which are parameterized by = { j, j =
1, ..., C}. The mixture model can be represented as
C
p ( x | ) = p ( x | c j ; j ) p ( c j | j )
(21)
j =1
where x is sample drawn from the hybrid data set D = L U. We make another assumption
that each component in the mixture model corresponds to one class, that is, {yj = c j , j =
1, ..., C}.
Since the training data set D is the union of labeled data set L and unlabeled data
set U, the joint probability density of the hybrid data set can be written as:
p ( D | ) =
p(c
x i U j =1
| ) p ( x i | c j ; ) p ( y i = c i | ) p ( x i | y i = c i ; )
x i L
(22)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 63
Equation (22) holds when we assume that each sample is independent to others. The
first part of Equation (22) is for the unlabeled data set, and the second part is for the
labeled data set.
The parameters can be estimated by maximizing a posteriori probability p( | D) .
Equivalently,
this
can
be
done
by
maximizing
log( p( | D)).
Let
log( p(c
x i U
log( p( y
x i L
j =1
| ) p (x i | c j ; )) +
= ci | ) p(x i | y i = ci ; ))
(23)
Since the log of a sum is hard to deal with, a binary indicator zi is introduced, zi = (zi1,
..., ziC), denoted with observation Oj : z ij = 1 if and only if yi = cj, and zij = 0 otherwise, so
that
l ( | D, Z ) = log( p()) +
x i D j =1
ij
(24)
p ( y | ) = p(w T (x ) | c j ; j ) p(c j | j )
j =1
(25)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
the transformed data in . Kernel DEM can be initialized by selecting all labeled data as
kernel vectors and training a weak classifier based on only labeled samples. Then, the
three steps of Kernel DEM are iterated until some appropriate convergence criterion:
(k ) ]
E-step: set Z ( k +1) = E[ Z | D;
| AT KBA |
, and project a data point x to a linear
| A T KW A |
The E-step gives probabilistic labels to unlabeled data, which are then used by the
D-step to separate the data. As mentioned above, this assumes that the class distribution
is moderately smooth.
positive constants cR and dN, respectively (Schlkopf & Smola, 2002). RBF kernels
are used in all kernel-based algorithms.
In Table 1, KMDA-random is KMDA with kernel vectors randomly selected from
training samples, KMDA-pca is KMDA with kernel vectors selected from training
samples based on PCA, KMDA-evolutionary is KMDA with kernel vectors selected from
training samples based on an evolutionary scheme. The benchmark test shows the Kernel
MDA achieves comparable performance as other state-of-the-art techniques over
different training datasets, in spite of the use of a decimated training set. Comparing three
schemes of selecting kernel vectors, it is clear that both PCA-based and evolutionarybased schemes work slightly better than the random selection scheme by having smaller
error rate and/or smaller standard deviation. Finally, Table 1 clearly shows superior
performance of KMDA over linear MDA.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 65
Table 1. Benchmark test: Average test error and standard deviation in percentage
Error Rate
Banana
Breast-Cancer
Heart
RBF
10.80.06
27.60.47
17.60.33
AdaBoost
12.30.07
30.40.47
20.30.34
SVM
11.50.07
26.00.47
16.00.33
KFD
10.80.05
25.80.48
16.10.34
MDA
38.432.5
28.571.37
20.11.43
KMDArandom
11.030.26
27.41.53
16.50.85
KMDApca
10.70.25
27.50.47
16.50.32
KMDAevolutionary
10.80.56
26.30.48
16.10.33
(# Kernel Vectors)
120
40
20
Method
Classification
in Percentage (%)
Benchmark
Kernel Setting
There are two parameters that need to be determined for kernel algorithms using RBF
(Radial Basis Function) kernel. The first is the degree c and the second is the number of
kernel vectors used. The kernel-based approaches are facing the problem of sensitivity
to its parameter selected, for example, the Gaussian (or Radial Basis Function) kernel,
k (x , z ) = exp(- x - z
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 2. Average error rate for KDEM with RBF kernel under varying degree c and
number of kernel vectors on heart data
50
45
c1
40
c5
35
c 10
30
c 20
25
c 40
20
c 60
c 80
15
1
10
20
40
60
80
100
c 100
Error Rate
Figure 3. Comparison of KDEM and KBDA for face and non-face classification
20
18
16
14
12
KBDA
10
8
6
4
2
0
KDEM
10
20
40
60
80
100
200
positive examples. This works very well with a relatively small training set. However, BDA
is biased towards the centroid of the positive examples. It will be effective only if these
positive examples are the most-informative images (Cox et al., 2000; Tong & Wang, 2001),
for example, images close to the classification boundary. If the positive examples are
most-positive images (Cox et al., 2000; Tong & Wang, 2001), for example, the images far
away from the classification boundary. The optimal transformation found based on the
most-positive images will not help the classification for images on the boundary.
Moreover, BDA ignores the unlabeled data and takes only the labeled data in learning.
In the third experiment, Kernel DEM (KDEM) is compared with the Kernel BDA
(KBDA) on both an image database and synthetic data. Figure 3 shows the average
classification error rate in percentage for KDEM and KBDA with the same RBF kernel for
face and nonnonface classification. The face images are from an MIT facial image
database3 (CBC Face Database) and nonface images are from a Corel database4. There
are 2,429 faces images from the MIT databases and 1,385 nonface images (14 categories
with about 99 each category), a subset of the Corel database, in the experiment. Some
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 67
Figure 4. Examples of (a) face images from MIT facial database and (b) non-face images
from Corel database
(a)
(b)
examples of face and nonface images are shown in Figure 4. For training sets, the face
images are randomly selected from the MIT database with fixed size 100, and nonface
images are randomly selected from the Corel database with varying sizes from five to 200.
The testing set consists of 200 random images (100 faces and 100 nonfaces) from two
databases. The images are resized to 1616 and converted to a column-wise concatenated feature vector.
In Figure 3, when the size of negative examples is small (< 20), KBDA outperforms
KDEM, and KDEM performs better when more negative examples are provided. This
agrees with our expectation.
In this experiment, the size of negative examples is increased from five to 200. There
is a possibility that most of the negative examples are from the same class. To further test
the capability of KDEM and KBDA in classifying negative examples with a varying
number of classes, we perform experiments on synthetic data for which we have more
controls over data distribution.
A series of synthetic data is generated based on Gaussian or Gaussian mixture
models with feature dimension of 2, 5 and 10 and a varying number of negative classes
from 1 to 9. In the feature space, the centroid of positive samples is set at origin and the
centroids of negative classes are set randomly with distance of 1 to the origin. The
variance of each class is a random number between 0.1 and 0.3. The features are
independent of each other. We include 2D synthetic data for visualization purpose. Both
the training and testing sets have fixed size of 200 samples with 100 positive samples and
100 negative samples with varying number of classes.
Figure 5 shows the comparison of KDEM, KBDA and DEM algorithms on 2D, 5D
and 10D synthetic data. In all cases, with the increasing size of negative classes from 1
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 5. Comparison of KDEM, KBDA and DEM algorithms on (a) 2-D (b) 5-D (c) 10D synthetic data with varying number of negative classes
45
Error Rate
40
35
DEM
30
KBDA
25
KDEM
20
15
10
1
Error Rate
40
35
DEM
30
KBDA
25
KDEM
20
15
10
1
Error Rate
40
35
DEM
30
KBDA
25
KDEM
20
15
10
1
to 9, KDEM always performs better than KBDA and DEM thus shows its superior
capability of multiclass classification. Linear DEM has comparable performance with
KBDA on 2D synthetic data and outperforms KBDA on 10D synthetic data. One possible
reason is that learning is on hybrid data in both DEM and KDEM, while only labeled data
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 69
S P = (1 ) S P +
tr[ S P ]I
n
S N P = (1 ) S N P +
tr[ S N P ]I
n
(26)
(27)
The parameter m controls shrinkage toward a multiple of the identity matrix. And
tr[] denotes the trace operation for a matrix. g is the discounting factor. With different
combinations of the (,) values, the regularized and/or discounted BDA provides a rich
set of alternatives: ( = 0, = 1) gives a subspace that is mainly defined by minimizing the
scatters among the positive examples, resembling the effect of a whitening transform.
Whitening transform is the special case when only positive examples are considered; (
= 1, = 0) gives a subspace that mainly separates the negative from the positive centroid,
with minimal effort on clustering the positive examples; ( = 0, = 0) is the full BDA and
( = 1, = 1) represents the extreme of discounting all configurations of the training
examples and keeping the original feature space unchanged.
However, the set of (, ) was proposed without further testing. Zhou and Huang
(2001) only analyzed full BDA ( = 0, = 0). To take a step further, we also investigate the
various combinations of ( , ) values on the performance of BDA. We test on cropped
face images consisting of 94 facial images (48 male and 48 female). We feed BDA a small
number of training samples with different values of (,). We find that full BDA ( = 0,
= 0) could be further improved by 41.4% in terms of average error rate with a different value
( = 0.1, = 0.4). This is a promising result and we will further investigate the regularization
issue for all discriminant-based approaches in future work.
MLP
NN
NN-G
EM
LDEM
KDEM
I-Feature
33.3
30.2
15.8
21.4
9.2
5.3
E-Feature
39.6
35.7
20.3
20.8
7.6
4.9
of the dataset resized to 2020 pixels (these are eigenimages, or E-features). In our
experiments, we use 140, that is, 10 for each hand posture, and 10,000 (randomly selected
from the whole database) labeled and unlabeled images respectively, for training both
EM and DEM.
Table 2 shows the comparison. Six classification algorithms are compared in this
experiment. The multilayer perceptron (Haykin, 1999) used in this experiment has one
hidden layer of 25 nodes. We experiment with two schemes of the nearest neighbor
classifier. One uses just 140 labeled samples, and the other uses 140 labeled samples to
bootstrap the classifier by a growing scheme, in which newly labeled samples will be
added to the classifier according to their labels. The labeled and unlabeled data for both
EM and DEM are 140 and 10,000, respectively.
We observe that multilayer perceptrons are often trapped in local minima, and
nearest neighbors suffers from the sparsity of the labeled templates. The poor performance of pure EM is due to the fact that the generative model does not capture the
Figure 6. Data distribution in the projected subspace (a) Linear MDA (b) Kernel MDA.
Different postures are more separated and clustered in the nonlinear subspace by
KMDA.
(a)
(b)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 71
Figure 7. (a) Some correctly classified images by both LDEM and KDEM (b) images
that are mislabeled by LDEM, but correctly labeled by KDEM (c) images that neither
LDEM or KDEM can correctly label.
(a)
(b)
(c)
ground-truth distribution well, since the underlying data distribution is highly complex.
It is not surprising that Linear DEM (LDEM) and KDEM outperform other methods, since
the D-step optimizes separability of the class.
Comparing KDEM with LDEM, we find KDEM often appears to project classes to
approximately Gaussian clusters in the transformed spaces, which facilitate their modeling with Gaussians. Figure 6 shows typical transformed data sets for linear and
nonlinear discriminant analysis, in projected 2D subspaces of three different hand
postures. Different postures are more separated and clustered in the nonlinear subspace
by KMDA. Figure 7 shows some examples of correctly classified and mislabeled hand
postures for KDEM and Linear DEM.
Fingertip Tracking
In some vision-based gesture interface systems, fingers could be used as accurate
pointing input devices. Also, fingertip detection and tracking play an important role in
recovering hand articulations. A difficulty of the task is that fingertip motion often
undergoes arbitrary rotations, which makes it hard to invariantly characterize fingertips.
In the last experiment, the proposed Kernel DEM algorithm is employed to discriminate
fingertips and nonfingertips.
We collected 1000 training samples including both fingertips and nonfingertips.
Nonfingertip samples are collected from the background of the working space. Some
training samples are shown in Figure 8. The 50 samples for each of the two classes are
manually labeled. Training images are resized to 2020 and converted to gray-level
images. Each training sample is represented by its coefficient of the 22 largest principal
components. Kernel DEM algorithm is performed on such training dataset to obtain a
kernel transformation and a Bayesian classifier. Assume at time t1, fingertip location is
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
(a)
(b)
~
Xt1 in the image. At time t, the predicted location of the fingertip is X t according to
Kalman prediction. For simplicity, the size of search window is fixed by 1010 centered
~
at X t . For each location in the search window, a fingertip candidate is constructed by
the 2020 sized image centered at that location. Thus, 100 candidates will be tested. A
probabilistic label of such fingertip candidate is obtained by classification. The one with
the largest probability is determined as the tracked location at time t. We run the tracking
algorithm on sequences containing a large amount of fingertip rotation and complex
backgrounds. The tracking results are fairly accurate.
CONCLUSION
Two sampling schemes are proposed for efficient, kernel-based, nonlinear, multiple
discriminant analysis. These algorithms identify a representative subset of the training
samples for the purpose of classification. Benchmark tests show that KMDA with these
adaptations not only outperforms the linear MDA but also performs comparably with the
best known supervised learning algorithms. We also present a self-supervised discriminant analysis technique, Kernel DEM (KDEM), which employs both labeled and unlabeled data in training. On synthetic data and real image databases for several applications
such as image classification, hand posture recognition, and fingertip tracking, KDEM
shows superior performance over biased discriminant analysis (BDA), nave supervised
learning and some other existing semisupervised learning algorithms.
Our future work includes several aspects: (1) We will look further into the regularization factor issue for the discriminant-based approaches on a large database; (2) We
will intelligently integrate biased discriminant analysis for small numbers of training
samples with traditional multiple discriminant analysis on large numbers of training
samples and varying numbers of classes; (3) To avoid the heavy computation over the
whole database, we will investigate schemes of selecting a representative subset of
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 73
unlabeled data whenever unlabeled data helps, and perform parametric or nonparametric
tests on the condition when it does not help; (4) Gaussian or Gaussian mixture models
are assumed for data distribution in the projected optimal subspace, even when the initial
data distribution is highly nonGaussian. We will examine the data modeling issue more
closely with Gaussian (or Gaussian mixture) and nonGaussian distributions.
ACKNOWLEDGMENT
This work was supported in part by the National Science Foundation (NSF) under
EIA-99-75019 in the University of Illinois at Urbana-Champaign, and by the University
of Texas at San Antonio.
REFERENCES
Basri, R., Roth, D., & Jacobs, D. (1998). Clustering appearances of 3D objects. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA.
Bellhumeur, P., Hespanha, J., & Kriegman, D. (1996). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. Proceedings of European Conference
on Computer Vision, Cambridge, UK.
CBCL Face Database #1, MIT Center for Biological and Computation Learning. Online:
http://www.ai.mit.edu/projects/cbcl
Cohen, I., Sebe, N., Cozman, F. G., Cirelo, M. C., & Huang, T. S. (2003). Learning Bayesian
network classifiers for facial expression recognition with both labeled and unlabeled data. Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition, Madison, WI.
Cox, I. J., Miller, M. L., Minka, T. P., & Papsthomas, T. V. (2000). The Bayesian image
retrieval system, PicHunter: Theory, implementation, and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20-37.
Cozman, F. G., & Cohen, I. (2002). Unlabeled data can degrade classification performance
of generative classifiers. Proceedings of the 15 th International Florida Artificial
Intelligence Society Conference, Pensacola, FL, (pp. 327-331).
Cui, Y., & Weng, J. (1996). Hand sign recognition from intensity image sequence with
complex background. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, San Francisco (pp. 88-93).
Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks. New
York: John Wiley & Sons.
Duda, R. O., Hart, P. E., & Stork, D.G. (2001). Pattern classification (2nd ed.). New York:
John Wiley & Sons.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, vol. 7, 179-188.
Fisher, R. A. (1938). The statistical utilization of multiple measurements. Annals of
Eugenics, vol. 8, 376-386.
Friedman, J. (1989). Regularized discriminant analysis. Journal of American Statistical
Association, 84(405), 165-175.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). NJ: Prentice
Hall.
Jain, A. K., & Farroknia, F. (1991). Unsupervised texture segmentation using Gabor filters.
Pattern Recognition, 24(12), 1167-1186.
Martinez, A. M., & Kak, A. C., (2001). PCA versus LDA. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(2), 228-233.
Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. (1999a). Fisher
discriminant analysis with Kernels. Proceedings of IEEE Workshop on Neural
Networks for Signal Processing.
Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. R. (1999b).
Invariant feature extraction and classification in kernel spaces. Proceedings of
Neural Information Processing Systems, Denver.
Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. R. (2003).
Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(5).
Mitchell, T. (1999). The role of unlabeled data in supervised learning. Proceedings of
the Sixth International Colloquium on Cognitive Science, Spain.
Nigram, K., McCallum, A. K., Thrun, S., & Mitchell, T. M. (2000). Text classification from
labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103-134.
Roth, V., & Steinhage, V. (1999). Nonlinear discriminant analysis using kernel functions.
Proceedings of Neural Information Processing Systems, Denver, CO.
Rui, Y., Huang, T. S., Ortega, M., & Mehrotra, S. (1998). Relevance feedback: A power
tool in interactive content-based image retrieval. IEEE Transactions on Circuits
and Systems for Video Technology, 8(5), 644-655.
Schlkopf B., & Smola, A. J. (2002). Learning with kernels. Boston: MIT Press.
Schlkopf, B., Smola, A., & Mller, K. R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10, 1299-1319.
Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image
retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 1349-1380.
Tian, Q., Hong, P., & Huang, T. S. (2000). Update relevant image weights for contentbased image retrieval using support vector machines. Proceedings of IEEE
International Conference on Multimedia and Expo, New York, (vol. 2, pp. 11991202).
Tian, Q., Yu, J., Wu, Y., & Huang, T.S. (2004). Learning based on kernel discriminant-EM
algorithm for image classification. IEEE International Conference on Acoustics,
Speech, and Signal Processing, Montreal, Quebec, Canada.
Tieu, K., & Viola, P. (2000). Boosting image retrieval. Proceedings of IEEE International
Conference on Computer Vision and Pattern Recognition, Hilton Head, SC.
Tong, S., & Wang, E. (2001). Support vector machine active learning for image retrieval.
Proceedings of ACM International Conference on Multimedia, Ottawa, Canada
(pp. 107-118).
Vapnik, V. (2000). The nature of statistical learning theory (2nd ed.). Springer-Verlag.
Wang, L., Chan, K. L., & Zhang, Z. (2003). Bootstrapping SVM active learning by
incorporating unlabelled images for image retrieval. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, WI.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Self-Supervised Learning 75
Weber, M., Welling, M., & Perona, P. (2000). Towards automatic discovery of object
categories. Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition, Hilton Head Island, SC.
Wolf, L., & Shashua, A. (2003). Kernel principle for classification machines with
applications to image sequence interpretation. Proceedings of IEEE International
Conference on Computer Vision and Pattern Recognition, Madison, WI.
Wu, Y., & Huang, T. S. (2001). Self-supervised learning for object recognition based on
kernel discriminant-EM algorithm. Proceedings of IEEE International Conference
on Computer Vision, Vancouver, Canada.
Wu, Y., Tian, Q., & Huang, T. S. (2000). Discriminant EM algorithm with application to
image retrieval. Proceedings of IEEE International Conference on Computer
Vision and Pattern Recognition, Hilton Head Island, SC.
Zhou, X., & Huang, T.S. (2001). Small sample learning during multimedia retrieval using
biasMap. Proceedings of IEEE International Conference on Computer Vision and
Pattern Recognition, Hawaii.
ENDNOTES
1
2
3
A term used in kernel machine literatures to denote the new space after the nonlinear
transform; not to be confused with the feature space concept used in contentbased image retrieval to denote the space for features or descriptors extracted from
the media data.
The benchmark data sets are obtained from http://mlg.anu.edu.au/~raetsch/.
The MIT facial database can be downloaded from http://www.ai.mit.edu/projects/
cbcl/software-datasets/FaceData2.html.
The Corel database is widely used as benchmark in content-based image retrieval.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Section 2
Audio and Video Semantics:
Models and Standards
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 4
Context-Based
Interpretation and
Indexing of Video Data
Ankush Mittal, IIT Roorkee, India
Cheong Loong Fah, The National University of Singapore, Singapore
Ashraf Kassim, The National University of Singapore, Singapore
Krishnan V. Pagalthivarthi, IIT Delhi, India
ABSTRACT
Most of the video retrieval systems work with a single shot without considering the
temporal context in which the shot appears. However, the meaning of a shot depends
on the context in which it is situated and a change in the order of the shots within a
scene changes the meaning of the shot. Recently, it has been shown that to find higherlevel interpretations of a collection of shots (i.e., a sequence), intershot analysis is at
least as important as intrashot analysis. Several such interpretations would be
impossible without a context. Contextual characterization of video data involves
extracting patterns in the temporal behavior of features of video and mapping these
patterns to a high-level interpretation. A Dynamic Bayesian Network (DBN) framework
is designed with the temporal context of a segment of a video considered at different
granularity depending on the desired application. The novel applications of the system
include classifying a group of shots called sequence and parsing a video program into
individual segments by building a model of the video program.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
INTRODUCTION
Many pattern recognition problems cannot be handled satisfactorily in the absence
of contextual information, as the observed values under-constrain the recognition
problem leading to ambiguous interpretations. Context is hereby loosely defined as the
local domain from which observations are taken, and it often includes spatially or
temporally related measurements (Yu & Fu, 1983; Olson & Chun, 2001), though our
focus would be on the temporal aspect, that is, measurements and formation of
relationships over larger timelines. Note that our definition does not address a
contextual meaning arising from culturally determined connotations, such as a rose as
a symbol of love.
A landmark in the understanding of film perception was the Kuleshov experiments
(Kuleshov, 1974). He showed that the juxtaposition of two unrelated images would force
the viewer to find a connection between the two, and the meaning of a shot depends on
the context in which it is situated. Experiments concerning contextual details performed
by Frith and Robson (1975) showed a film sequence has a structure that can be described
through selection rules.
In video data, each shot contains only a small amount of semantic information. A
shot is similar to a sentence in a piece of text; it consists of some semantic meaning which
may not be comprehensible in the absence of sufficient context. Actions have to be
developed sequentially; simultaneous or parallel processes are shown one after the other
in a concatenation of shots. Specific domains contain rich temporal transitional structures that help in the classification process. In sports, the events that unfold are
governed by the rules of the sport and therefore contain a recurring temporal structure.
The rules of production of videos for such applications have also been standardized. For
example, in baseball videos, there are only a few recurrent views, such as pitching, close
up, home plate, crowd and so forth (Chang & Sundaram, 2000). Similarly, for medical
videos, there is a fixed clinical procedure for capturing different video views and thus the
temporal structures are exhibited.
The sequential order of events creates a temporal context or structure. Temporal
context helps create expectancies about what may come next, and when it will happen.
In other words, temporal context may direct attention to important events as they unfold
over time.
With the assumption that there is inherent structure in most video classes,
especially in a temporal domain, we can design a suitable framework for automatic
recognition of video classes. Typically in a Content Based Retrieval (CBR) system, there
are several elements which determine the nature of the content and its meaning. The
problem can thus be stated as extracting patterns in the temporal behavior of each
variable and also in the dynamics of relationship between the variables, and mapping
these patterns to a high-level interpretation. We tackle the problem in a Dynamic
Bayesian Framework that can learn the temporal structure through the fusion of all the
features (for tutorial, please refer to Ghahramani (1997)).
The chapter is organized as follows. A brief review of related work is presented first.
Next we describe the descriptors that we used in this work to characterize the video. The
algorithms for contextual information extraction are then presented along with a strategy
for building larger video models. Then we present the overview of the DBN framework
and structure of DBN. A discussion on what needs to be learned, and the problems in
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
using a conventional DBN learning approach are also presented in this section. Experiments and results are then presented, followed by discussion and conclusions.
RELATED WORK
Extracting information from the spatial context has found its use in many applications, primarily in remote sensing (Jeon & Landgrebe, 1990; Kittler & Foglein, 1984),
character recognition (Kittler & Foglein, n.d.), and detection of faults and cracks (Bryson
et al., 1994). Extracting temporal information is, however, a more complicated task but it
has been shown to be important in many applications like discrete monitoring (Nicholson,
1994) and plan recognition tasks, such as tracking football players in a video (Intille &
Bobick, 1995) and traffic monitoring (Pynadath & Wellman, 1995). Contextual information
extraction has also been studied for problems such as activity recognition using
graphical models (Hamid & Huang, 2003), visual intrusion detection (Kettnaker, 2003)
and face recognition.
In an interesting analysis done by Nack and Parkes (1997), it is shown how the
editing process can be used to automatically generate short sequences of video that
realize a particular theme, say humor. Thus for extraction of indices like humor, climax and
so forth, context information is very important. A use of context for finding an important
shot in a sequence is highlighted by the work of Aigrain and Joly (1996). They detect
editing rhythm changes through the second-order regressive modeling of shot duration.
The duration of a shot PRED (n) is predicted by (1) where the coefficients of a and b are
estimated in a 10-shot sliding window. The rule employed therein is that if (Tn > 2 *PRED
(n)) or (Tn < PRED (n)/2) then it is likely that the nth shot is an important (distinguished)
shot in a sequence. The above model was solely based on tapping the rhythm information
through shot durations. Dorai and Venkatesh (2001) have recently proposed an algorithmic framework called computational media aesthetics for understanding of the dynamic
structure of the narrative structure via analysis of the integration and sequencing of
audio/video elements. They consider expressive elements such as tempo, rhythm and
tone. Tempo or pace is the rate of performance or delivery and it is a reflection of the speed
and time of the underlying events being portrayed and affects the overall sense of time
of a movie. They define P (n), a continuous valued pace function, as
P (n) = W (s(n)) +
m(n) - m
m
where s refers to shot length in frames, m to motion magnitude, m and m, to the mean
and standard deviation of motion respectively and n to shot number. W (s(n)) is an overall
two-part shot length normalizing scheme, having the property of being more sensitive
near the median shot length, but slows in gradient as shot length increases into the
longer range.
We would like to view the contextual information extraction from a more generic
perspective and consider the extraction of temporal pattern in the behavior of the
variables.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
THE DESCRIPTORS
The perceptual level features considered in this work are Time-to-Collision (TTC),
shot editing, and temporal motion activity. Since our emphasis is on presenting algorithms for contextual information, they are only briefly discussed here. The interested
reader can refer to Mittal and Cheong (2001) and Mittal and Altman (2003) for more details.
TTC is the time needed for the observer to reach the object, if the instantaneous relative
velocity along the optical axis is kept unchanged (Meyer, 1994). There exists a specific
mechanism in the human visual system, designed to cause one to blink or to avoid a
looming object approaching too quickly. Video shot with a small TTC evokes fear
because it indicates a scene of impending collision. Thus, TTC can serve as a potent cue
for the characterization of an accident or violence.
Although complex editing cues are employed by the cameramen for making a
coherent sequence, the two most significant ones are the shot transitions and shot
pacing. The use of a particular shot transition, like dissolve, wipe, fade-in, and so forth,
can be associated with the possible intentions of the movie producer. For example,
dissolves have been typically used to bring about smoothness in the passage of time or
place. A shot depicting a close-up of a young woman followed by a dissolve to a shot
containing an old woman suggests that the young woman has become old. Similarly, shot
pacing can be adjusted accordingly for creating the desired effects, like building up of
the tension by using a fast cutting-rate. The third feature, that is, temporal motion feature,
characterizes motion via several measures such as total motion activity, distribution of
motion, local-motion/global-motion, and so forth.
Figure 1. Shots leading to climax in a movie
Chase 1
Collision Alarm
Climax scene
(a)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 2. Feature values for the shots in Figure 1. Shots in which TTC length is not
shown correspond to a case when TTC is infinite; that is, there is no possibility of
collision.
Figure 1 shows the effectiveness of TTC in the content characterization with the
example of a climax. Before the climax of a movie, there are generally some chase scenes
leading to the meeting of bad protagonists of a movie with good ones. One example of
such a movie is depicted in this figure, where the camera is shown both from the
perspective of the prey and of the predator leading to several large lengths of the
impending collisions as depicted in Figure 2. During the climax, there is a direct contact,
and therefore collision length is small and frequent. Combined with large motion and small
shot length (both relative to the context), TTC could be used to extract the climax
sequences.
Many works in the past have focused on shot classification based on single-shot
features like color, intensity variation, and so forth. We believe that since each video
class represents structured events unfolding in time, the appropriate class signatures are
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Media
Descriptor
Large-timeline
Descriptors
Context over
large timeline
Feelings
Activity
Scene
Characteristics
Sequence-level
Descriptors
Context
over shots/
Mapping
Collision
Detection
Shot
Transition
Motion
Descriptors
Shot-level
Descriptors
Context
Within
Shot
Optical
Flow
Color
Statistics
Frame
Difference
Frame-level
Descriptors
also present in the temporal domain. The features such as shot-length, motion, TTC, and
so forth, were chosen to illustrate the importance of temporal information, as they are
perceptual in nature, as opposed to low-level features such as color, and so forth.
etc.). For example, a large number of collisions typically depict a violent scene, whereas
one or two collision shots followed by a fading to a long shot typically depicts a
melancholy scene. If the context information over the neighboring shots is not considered, then these and many other distinctions would not be possible.
Finally, these indices are integrated over a large timeline to obtain semantic
structures (like for news and sports) or media descriptors. An action movie has many
scenes where violence is shown, a thriller has many climax scenes and an emotional movie
has many sentimental scenes (with close-ups, special effects, etc.). These make it
possible to perform automatic labeling as well as efficient screening of the media (for
example, for violence or for profanity).
Descriptors at different levels of the hierarchy need to be handled with different
strategies. One basic framework is presented in the following sections.
The Algorithm
Figure 4 depicts the steps in a context information extraction (mainly the first pass
over the shots). The digitized media is kept in the database and the shot transition module
segments the raw video into shots. The sequence composer groups a number of shots
(say ) based on the similarity values of the shots (Jain, Vailaya, & Wei, 1999) depending
on the selected application. For example, for applications like climax detection, the
sequences consist of a much larger number of shots than that for the sports identifier.
The feature extraction of a sequence yields an observation sequence, which
consists of feature vectors corresponding to each shot in the sequence. An appropriate
DBN Model is selected based on the application, which determines the complexity of the
mapping required. Thus, each application has its own model (although the DBN could
be made in such a way that a few applications could share a common model, but the
performance would not be optimal). During the training phase, only the sequences
corresponding to positive examples of the domain are used for learning. During the
querying or labeling phase, DBN evaluates the likelihood of the input observation
sequence belonging to the domain it represents. The sequence labeling module and the
application selector communicate with each other to define the task which needs to be
performed, with the output being a set of likelihoods. If the application involves
classifying into one of the exclusive genres, the label corresponding to the DBN model
with the maximum likelihood is assigned. In general, however, a threshold is chosen
(automatically during the training phase) over the likelihoods of a DBN model. If the
likelihood is more than the threshold for a DBN model, the corresponding label is
assigned. Thus, a sequence can have zero label or multiple labels (such as interesting,
soccer, and violent) after the first pass. Domain rules aid further classification in the
subsequent passes, the details of which are presented in the next section.
Figure 4. Steps in context information extraction (details of the first pass are shown
here)
SequenceComposition
Media
Database
Shot Transition
Detector
...
Shots
Sequence
Feature Extraction
Module (TTC,
Sequence motion, color....)
Information &
Feedback
Observation
Sequence
DBN Model1
Sequence
Application
Selector
DBN Model 2
Labeling
.
.
.
Select
Input (0/1)
&
Feature
Vector
DBN Model k
Second pass
Select Mode
(e.g., in English: qu, ee, tion) are used to provide a context within which the character may
be interpreted.
Three examples of algorithms at this level are briefly presented in this section as
follows:
1.
Effecti-2
Seqi-M
Label i-M
...
Effect i-1
Effect
Seqi-1
Seq
Label i-1
Label i
2M+1
Effecti+1
Seqi+1
Label i+1
Effect
Seqi+M
...
i-M
Label i+M
Center
time
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
2.
3.
a facility to browse any one of them. Examples of such queries could be Show me the
sport clip which came in the news and go to the weather report.
If a video can be segmented into its scene units, the user can more conveniently
browse through that video on a scene basis rather than on a shot-by-shot basis, as is
commonly done in practice. This allows a significant reduction of information to be
conveyed or presented to the user.
Zweig (1998) has shown that a Finite State Automaton (FSA) can be modeled using
DBNs. The same idea is explored in the domain of a CBR system. The video programs
evolve through a series of distinct processes, each of which is best represented by a
separate model. When modeling these processes, it is convenient to create submodels
for each stage, and to model the entire process as a composition of atomic parts. By
factoring a complex model into a combination of simpler ones, we achieve a combinatorial
reduction in the number of models that need to be learned. Thus a probabilistic
nondeterministic FSA can be constructed as shown in Mittal and Altman (2003). In the
FSA, each of the states represents a stage of development and can be either manually
or automatically constructed with a few hours of News program.
Since most video programs begin and end with a specific video sequence (which
can be recognized), modeling the entire structure through FSA, which has explicitly
defined start and end states, is justified. The probabilistic FSA of News has transitions
on the type of shot cut (i.e., dissolve, wipe, etc.) with a probability.
The modeling of the FSA by DBN can be done as follows. The position in the FSA
at a specific time is represented by a state variable in the DBN. The DBN transition
variable encodes which arc is taken out of the FSA state at any particular time. The number
of values the transition variable assumes is equal to the maximum out-degree of any of
the states in the FSA. The transition probabilities associated with the arcs in the
automaton are reflected in the class probability tables associated with the transition
variables in the DBN.
learning tools like SVM, and so forth, is that DBN offers the interpretation in terms of
probability that makes it suitable to be part of a larger model.
P (z 0 , . . . , z T ) = PBN 0 (z 0 )
T -1
t =1
PBN (z t +1 | z t )
over the states at t and t + 1 is proportional to t(zt ) PBN (zt+1 |z t) tt+1 (zt+1 ) PBN (Ft+1
|zt+1 ).
The learning algorithm for dynamic Bayesian networks follows from the EM
algorithm. The goal of sequence decoding in DBN is to find the most likely state sequence
Q of hidden variables given the observations such that X *T = arg maxXT P (XT | seqT). This
task can be achieved by using the Viterbi algorithm (Viterbi, 1967) based on dynamic
programming. Decoding attempts to uncover the hidden part of the model and outputs
the state sequence that best explains the observations. The previous section presents
an application to parse a video program where each state corresponds to a segment of
the program.
2.
3.
4.
separate DBN is employed, the optimization of the number of hidden states is done by
trial and test method so as to achieve the best performance on the training data. Since
feature extraction on video classes is very inefficient (three frames/second!), we do not
have enough data to learn the structure of DBN at present for our application. We
assumed a simple DBN structure as shown in Figure 6.
Another characteristic of the DBN learning is that the features which are not
relevant to the class have their estimated probability density functions spread out, which
can easily be noticed and these features can be removed. For example, there is no
significance of the shot-transition length in cricket sequences, and thus this feature can
be removed.
Like most of the learning tools, DBN also requires the presence of a few representational training sequences of the model to help extract the rules. For example, just by
training the interesting sequences of cricket, the interesting sequences of badminton
cannot be extracted. This is because the parameter learning is not generalized enough.
DBN is specific in learning the shot-length specific to cricket, along with the length of
the special effect (i.e., wipe) used to indicate replay. On the other hand, a human expert
would probably be in a position to generalize these rules.
Of course, this learning strategy of DBN has its own advantages. Consider for
example, a sequence of cricket shots (which is noninteresting) having two wipe transi-
Figure 6. DBN architecture for the CBR system. The black frame feature has value 1 if
the shot ends with black frame and 0 otherwise. It is especially relevant in detection
of commercials.
State t-1
State t
State Evolution
Model
State t+1
Observation Model
Feature vector t-1
Feature vector t
Time slice t
Black Frame
Shot
Length
Cut type
Length of
Transition
Presence of
TTC
Length of
TTC
Frame
Difference
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
tions and the transitions are separated by many flat-cut shots. Since DBN is trained to
expect only one to three shots in between two-wipe transitions, it would not recognize
this as an interesting shot.
Another practical problem that we faced was to decide the optimum length of shots
for DBN learning. If some redundant shots are taken, DBN fails to learn the pattern. For
example, in climax application, if we train the DBN only with climax scene, it can perform
the classification with good accuracy. However, if we increase the number of shots, the
classification accuracy drops. This implies that the training samples should have just
enough number of shots to model or characterize the pattern (which at present requires
human input).
Modeling video programs is usually tough, as they cannot be mapped to the
templates except in exceptional cases. The number of factors which contribute toward
variations in the parameters and thus stochastic learning in DBN is highly suitable. We
would therefore consider subsequent passes in the next section which takes the initial
probability assigned by DBN and tries to improve the classification performance based
on the task at hand.
Sequence Classifier
As discussed before, the problem of sequence classification is to assign labels from
one or more of the classes to the observation sequence, seq T consisting of feature vectors
F0 , . . . , FT . Table 2 shows the classification performance of the DBN models for six video
classes. The experiment was conducted by training each DBN model with preclassified
sequences and testing with unclassified 30 sequences of the same class and 50
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Sports
MTV
Commercial
Movie
BBC News
News from Singapore TCS
Channel 5 ((consist of 2
commercial breaks)
Total
6350
1210
1713
2725
6454
8750
4 hr 33 min 22 sec
sequences of other classes. The DBN models for different classes had a different value
of T, which was based on optimization performed during the training phase. The number
of shots for DBN model of news was 5 shots, of soccer 8 shots, and of commercials 20
shots. The news class consists of all the segments, that is, newscaster shots, outdoor
shots, and so forth. The commercial sequences all have a black frame at the end.
There are two paradigms of training the DBN: (i) Through a fixed number of shots
(around seven to 10), and (ii) through a fixed number of frames (400 to 700 frames). The
first paradigm works better than the second one as is clear from the recall and precision
rates. This could be due to two reasons: first, having a fixed number of frames does not
yield proper models for classes, and second, the DBN output is in terms of likelihood,
which reduces as there are a larger number of shots in a sequence. Large numbers of shots
implies more state transitions, and since each transition has probability of less than one,
the overall likelihood decreases. Thus, classes with longer shotlengths are favored in the
second paradigm, leading to misclassifications.
In general, DBN modeling gave good results for all the classes except soccer. It is
interesting to note that commercials and news were detected with very high precision
because of the presence of the characteristic black frame and the absence of high-motion,
respectively. A large number of fade-in and fade-out effects are present in the MTV and
commercial classes, and dissolves and wipes are frequently present in sports. The black
frame feature also prevents the MTV class being classified as a commercial, though both
of them have similar shot lengths and shot transitions.
The poor performance of the soccer class could be explained by the fact that the
standard deviations of the DBN parameters corresponding to motion or shot length
features were large, signifying that the degree of characterization of the soccer class was
poor by features such as shot length. On the other hand, cricket was much more
structured, especially with a large number of wipes.
Highlight Extraction
In replays, the following conventions are typically used:
1.
Replays are bounded by a pair of identical gradual transitions, which can either be
a pair of wipes or a pair of dissolves.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
None
3 MT, 1 SO
1 MT, 1 NE,
2 NE, 2 TE
11
7NE, 6 TE,
Soccer
(SO)
10
Tennis
Precision
None
Recall
None
News
(NE)
(TE)
2.
Misclassified
(Out of 30)
None
2 CO, 4 CR,
5 MT, 8 NE,7 TN
4 MT, 6 NE ,
2 SO
13
13
7 SO, 4 TE
7 SO
3 TE, 1 SO
1 CO, 1 MT .
4 NE, 9 TE
3 CR, 3 NE ,
5 SO
83.9 %
68.3 %
76.7 %
66.5 %
Cuts and dissolves are the only transition types allowed between two successive
shots during a replay. Figure 7 shows two interesting scenes from the soccer and
cricket videos. Figure 7(a) shows a soccer match in which the player touched the
football with his arm. A cut to a close-up of the referee showing a yellow card is
used. A wipe is employed in the beginning of the replay, showing how the player
touched the ball. A dissolve is used for showing the expression of the manager of
the team, followed by a wipe to indicate the end of the replay. Finally, a close-up
view of the player who received the yellow card is shown.
For the purpose of experiments, only two sports, cricket and soccer, were considered, although the same idea can be extended to most of the sports, if not all. This is
because the interesting scene analysis is based on detecting the replays, the structure
of which is the same for most sports. Below is a typical format of the transition effects
around an interesting shot in cricket and soccer.
E . C G1 . E G2 E
where,
E {Cut, Dissolve}
C {Cut}
G1, G2 {Wipe, Dissolve}, G1 and G2 are the gradual transitions before and after the replay
D {Dissolve}
, , {Natural number}
is a followed by operator.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 7. Typical formats of the transition effects used in the replay sequences of (a)
soccer match (b) cricket match
This special structure of replay is used for identifying interesting shots. The
training sequence for both sports, which was typically 5 to 7 shots, consisted of
interesting shots followed by the replays. Figure 8 shows the classification performance
of DBN for highlight extraction. A threshold could be chosen on the likelihood returned
by DBN based on training data (such that the threshold is less than the likelihood of most
of the training sequences). All sequences possessing a likelihood more than the
threshold of a DBN model are assigned an interesting class label. Figure 8 shows that
only two misclassifications result with the testing. One recommendation could be to
lower the threshold such that no interesting sequences are missed; although a few
uninteresting sequences may also be labeled and retrieved to the user. Once an
interesting scene is identified, the replays are removed before presenting it to the user.
An interesting detail in this application is that one might want to show the
scoreboard during the extraction of highlights (scoreboards are generally preceded and
followed by many dissolves). DBN can encode in its learning both the previous model
of the shot followed by a replay and the scoreboard sequences with their characteristics.
0.05
Log likelihood
0. 1
Threshold
0.15
0. 2
0.25
Interesting
0. 3
Uninteresting
Misclassified
0
10
11
Shots
Over the last two decades, a large body of literature has linked the exposure to
violent television with increased physical aggressiveness among children and violent
criminal behavior (Kopel, 1995, p. 17; Centerwall, 1989). A modeling of the censoring
process can be done to restrict access to media containing violence. Motion alone is
insufficient to characterize violence, as many acceptable classes (especially sports like
car racing, etc.) also possess high motion. On the other hand, shot-length and especially
TTC are highly relevant due to the fact that the camera is generally at a short distance
during violence (thus the movements of the actors yield impression of impending
collisions). The cutting rate is also high, as many perspectives are generally covered. The
set of the features used, though, remains the same as a sequence classifier application.
Figure 10 shows a few images from a violent scene of a movie.
Figure 9 shows the classification performance for a censoring application. For each
sequence in the training and test database, the opinions of three judges were sought in
terms of one of the three categories: violent, nonviolent and cannot say. Majority
voting was used to decide if a sequence is violent. The test samples were from sports,
MTV, commercials and action movies. Figure 9 shows that while the violent scenes were
correctly detected, two MTV shots were misclassified as violent. For one sequence of
these MTV sequences, the opinion of two judges was cannot say, while the third one
classified it as violent (Figure 11). The other sequence consisted of too many objects
near to the camera with a large cutting rate although it should be classified as nonviolent.
0.05
0.1
Threshold
Loglikelihood
0.15
0.2
0.25
0.3
0.35
Interesting
Uninteresting
0.4
0.45
Misclassified
10
11
Shots
removes the cumbersome task of manually designing a rule-based system. The design
of such a rule-based system would have to be based on the low-level details, such as
thresholding; besides, many temporal structures are difficult to observe but could be
extracted by automatic learning approaches. DBN assignment of the initial labels on the
data prepares it for the subsequent passes, where expert knowledge could be used
without much difficulty.
The experiments conducted in this chapter employed a few perceptual-level features. The temporal properties of such features are more readily understood than lowlevel features. Though the inclusion of low-level features could enhance the characterization of the categories, it would raise the important issue of dealing with the highCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 11. An MTV scene which was misclassified as violent. There are many moving
objects near the camera and the cutting rate is high.
REFERENCES
Aigrain, P., & Joly, P. (1996). Medium knowledge-based macro-segmentation of video
into sequences. Intelligent Multimedia Information Retrieval.
Boyen, X., Firedman, N., & Koller, D. (1999). Discovering the hidden structure of complex
dynamic systems. Proceedings of Uncertainty in Artificial Intelligence (pp. 91100).
Bryson, N., Dixon, R.N., Hunter, J.J., & Taylor, C. (1994). Contextual classification of
cracks. Image and vision computing, 12, 149-154.
Centerwall, B. (1989). Exposure to television as a risk factor for violence. Journal of
Epidemology, 643-652.
Chang, S. F., & Sundaram, H. (2000). Structural and semantic analysis of video. IEEE
International Conference on Multimedia and Expo (pp. 687-690).
Dorai, C., & Venkatesh, S. (2001). Bridging the semantic gap in content management
systems: Computational media aesthetics. International Conference on Computational Semiotics in Games and New Media (pp. 94-99).
Frith, U., & Robson, J. E. (1975). Perceiving the language of film in Perception, 4, 97-103.
Garg, A., Pavlovic, V., & Rehg, J. M. (2000). Audio-visual speaker detection using
Dynamic Bayesian networks. IEEE Conference on Automatic Face and Gesture
Recognition (pp. 384-390).
Ghahramani, Z. (1997). Learning dynamic Bayesian networks. Adaptive Processing of
Temporal Information. Lecture Notes in AI. SpringerVerlag.
Hamid, I. E., & Huang, Yan. (2003). Argmode activity recognition using graphical models.
IEEE CVPR Workshop on Event Mining: Detection and Recognition of Events in
Video (pp. 1-7).
Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling
processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5,
267-287.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Intille, S. S., & Bobick, A. F. (1995). Closed-world tracking. IEEE International Conference on Computer Vision (pp. 672-678).
Jain, A. K., Vailaya, A., &Wei, X. (1999). Query by video clip. Multimedia Systems, 369384.
Jeon, B., & Landgrebe, D. A. (1990). Spatio-temporal contextual classification of remotely
sensed multispectral data. IEEE International Conference on Systems, Man and
Cybernetics (pp. 342-344).
Kettnaker, V. M. (2003). Time-dependent HMMs for visual intrusion detection. IEEE
CVPR Workshop on Event Mining: Detection and Recognition of Events in Video.
Kittler, J., & Foglein, J. (1984). Contextual classification of multispectral pixel data. Image
and Vision Computing, 2, 13-29.
Kopel, D. B. (1995). Massaging the medium: Analyzing and responding to media violence
without harming the first. Kansas Journal of Law and Public Policy, 4, 17.
Kuleshov, L. (1974). Kuleshov on film: Writing of Lev Kuleshov. Berkeley, CA: University of California Press.
Meyer, F. G. (1994). Time-to-collision from first-order models of the motion fields. IEEE
Transactions of Robotics and Automation (pp. 792-798).
Mittal, A., & Altman, E. (2003). Contextual information extraction for video data. The 9th
International Conference on Multimedia Modeling (MMM), Taiwan (pp. 209223).
Mittal, A., & Cheong, L.-F. (2001). Dynamic Bayesian framework for extracting temporal
structure in video. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2, 110-115.
Mittal, A., & Cheong, L.-F. (2003). Framework for synthesizing semantic-level indices.
Journal of Multimedia Tools and Application, 135-158.
Nack, F., & Parkes, A. (1997).The application of video semantics and theme representation in automated video editing. Multimedia tools and applications, 57-83.
Nicholson, A. (1994). Dynamic belief networks for discrete monitoring. IEEE Transactions on Systems, Man, and Cybernetics, 24(11), 1593-1610.
Olson, I. R., & Chun, M. M. (2001). Temporal contextual cueing of visual attention.
Journal of Experimental Psychology: Learning, Memory, and Cognition.
Pavlovic, V., Frey, B., & Huang, T. (1999). Time-series classification using mixed-state
dynamic Bayesian networks. IEEE Conference on Computer Vision and Pattern
Recognition (pp. 609-615).
Pavlovic, V., Garg, A., Rehg, J., & Huang, T. (2000). Multimodal speaker detection using
error feedback dynamic Bayesian networks. IEEE Conference on Computer Vision
and Pattern Recognition.
Pynadath, D. V., & Wellman, M. P. (1995). Accounting for context in plan recognition with
application to traffic monitoring. International Conference on Artificial Intelligence, 11.
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected application in
speech recognition. Proceedings of the IEEE (vol. 77, pp. 257-286).
Sondhauss, U., & Weihs, C.(1999). Dynamic bayesian networks for classification of
business cycles. SFB Technical report No. 17. Online at http://www.statistik.unidortmund.de/
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal
decoding algorithm. IEEE Transactions on Information Theory (pp. 260-269).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Yeo, B. L., & Liu, B.(1995). Rapid scene analysis on compressed video. IEEE Transactions on Circuits, Systems, and Video Technology (pp. 533-544).
Yu, T. S., & Fu, K. S. (1983). Recursive contextual classification using a spatial stochastic
model. Pattern Recognition, 16, 89-108.
Zweig, G. G. (1998). Speech recognition with dynamic Bayesian networks. PhD thesis,
Dept. of Computer Science, University of California, Berkeley.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
99
Chapter 5
Content-Based Music
Summarization
and Classification
Changsheng Xu, Institute for Infocomm Research, Singapore
Xi Shao, Institute for Infocomm Research, Singapore
Namunu C. Maddage, Institute for Infocomm Research, Singapore
Jesse S. Jin, The University of Newcastle, Australia
Qi Tian, Institute for Infocomm Research, Singapore
ABSTRACT
INTRODUCTION
Recent advances in computing, networking, and multimedia technologies have
resulted in a tremendous growth of music-related data and accelerated the need to
analyse and understand the music content. Music representation is multidimensional
and time-dependent. How to effectively organize and process such large variety and
quantity of music information to allow efficient browsing, searching and retrieving is an
active research area in recent years. Audio content analysis, especially music content
understanding, posts a big challenge for those who need to organize and structure music
data. The difficulty arises in converting the featureless collections of raw music data to
suitable forms that would allow tools to automatically segment, classify, summarize,
search and retrieve large databases. The research community is now at the point where
the limitations and properties of developed methods are well understood and used to
provide and create more advanced techniques tailored to user needs and able to better
bridge the semantic gap between the current audio/music technologies and the semantic
needs of interactive media applications
The aim of this chapter is to provide a comprehensive survey of the technical
achievements in the area of content-based music summarization and classification and
to present our recent achievements. The next section introduces music representation
and feature extraction. Music summarization and music genre classification are presented
in details in the two sections, respectively. Semantic region detection in acoustical music
signals is described in the fifth section. Finally, the last section gives the concluding
remarks and discusses future research directions.
MUSIC REPRESENTATION
AND FEATURE EXTRACTION
Feature extraction is the first step of content-based music analysis. There are many
features that can be used to characterize the music signal. Generally speaking, these
features can be divided into three categories: timbral textural features, rhythmic content
features and pitch content features.
Amplitude Envelope
The amplitude envelope describes the energy change of the signal in the time
domain and is generally equivalent to the so-called ADSR (attack, decay, sustain and
release) of a music song. The envelope of the signal is computed with a frame-by-frame
root mean square (RMS) and a third-order Butterworth low-pass filter (Eiilis, 1994). RMS
is a perceptually relevant measure and has been shown to correspond closely to the way
we hear Loudness. The length of the RMS frame determines the time resolution of the
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
101
envelope. A large frame length yields low transient information and a small frame length
yields greater transient energy.
Spectral Power
For a music signal s(n), each frame is weighted with a Hanning window h(n):
83
n
1 cos(2 )
2
N
h( n ) =
(1)
where N is the number of samples of each frame. The spectral power of the signal s(n)
is calculated as
1
S ( k ) = 10 log 10
N
N 1
s(n)h(n) exp( j 2
n =0
nk
)
N
(2)
Spectral Centroid
The spectral centroid (Tzanetakis & Cook, 2002) is defined as the centre of gravity
of spectrum magnitude in STFT.
N
Ct =
M ( n) n
n =1
N
M ( n)
n =1
(3)
where Mt(n) is the spectrum of the Fast Fourier Transform (FFT) at the t-th frame with a
frequency bin f. The spectral centroid is a measure of spectral shape. Higher centroid
values correspond to brighter textures with more high frequencies.
Spectrum Rolloff
Spectrum Rolloff (Tzanetakis & Cook, 2002) is the frequency below which 85% of
spectrum distribution is concentrated. It is also a measure of the spectral shape.
Spectrum Flux
Spectrum flux (Tzanetakis & Cook, 2002) is defined as the variation value of
spectrum between two adjacent frames.
SF = N t ( f ) N t 1 ( f )
(4)
where Nt( f ) and Nt-1( f ) are the normalized magnitude of the FFT at the current frame t
and previous frame t-1, respectively. Spectrum flux is a measure of the amount of local
spectral changes.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Cepstrum
The mel-frequency cepstra has proven to be highly effective in automatic speech
recognition and in modeling the subjective pitch and frequency content of audio signals.
Psychophysical studies have found the phenomena of the mel pitch scale and the critical
band, and the frequency scale-warping to the mel scale has led to the cepstrum domain
representation.
The cepstrum can be illustrated by means of the Mel-Frequency Cepstral Coefficients (MFCCs). These are computed from the FFT power coefficients (Logan & Chu,
2000). The power coefficients are filtered by a triangular bandpass filter. The filter
consists of K triangular banks. They have a constant mel-frequency interval and cover
the frequency range of 0-4000Hz. Denoting the output of the filter bank by Sk (k=1,2,K),
(K is denoted to 19 for speech recognition purpose, while K has a value higher than 19
for music signals because music signals have a wider spectrum than speech signals), the
MFCCs are calculated as
cn =
2 K
(log S k ) cos[n(k 0.5) / K ] n = 1,2,..., L
K k =1
(5)
where L is the order of the cepstrum. Figure 1 illustrates the estimation procedure of
MFCC.
Z s ( m) =
1
N
n = m N +1
w(m n)
(6)
FFT
FFT
MelMelScale
Scale
Filter
Filter
Sum
Sum
log
log
MFCC
MFOC
DCT
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
103
FFT
FFT
Octave
Octave
Scale
Scale
Filter
Filter
Peak/
Peak/
Valley
Valley
Select
Select
log
log
K-L
K-L
Spectral
Spectral
Contrast
Contrast
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
function correspond to the main pitches for that short segment of sound and are
accumulated into pitch histograms. Then the pitch content features can be extracted from
the pitch histograms.
MUSIC SUMMARIZATION
The creation of a concise and informative extraction that accurately summarizes
original digital content is extremely important in a large-scale information repository.
Currently, the majority of summaries used commercially are manually produced from the
original content. For example, a movie clip may provide a good preview of the movie.
However, as a large volume of digital content has become publicly available on the
Internet and in other physical storage media during recent years, automatic summarization has become increasingly important and necessary.
There are a number of techniques being proposed and developed to automatically
generate summaries from text (Mani & Maybury, 1999), speech (Hori & Furui, 2000) and
video (Gong et al., 2001). Similar to text, speech and video summarization, music
summarization refers to determining the most common and salient themes of a given music
piece that may be used to represent the music and is readily recognizable by a listener.
Automatic music summarization can be applied to music indexing, content-based music
retrieval and web-based music distribution.
A summarization system for MIDI data has been developed (Kraft et al., 2001). It
uses the repetition nature of MIDI compositions to automatically recognize the main
melody theme segment for a given piece of music. A detection engine converts melody
recognition and music summarization to string processing and provides efficient ways
of retrieval and manipulation. The system recognizes maximal length segments that have
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
105
nontrivial repetitions in each track of the MIDI data of music pieces. These segments are
treated as basic units in music composition, and are the candidates for the melody in a
music piece. However, MIDI format is not sampled audio data (i.e., actual audio sounds),
instead, it contains synthesizer instructions, or MIDI notes, to reproduce audio.
Compared with actual audio sounds, MIDI data cannot provide a real playback experience and an unlimited sound palette for both instruments and sound effects. On the other
hand, MIDI data is a structured format, so it is easy to create a summary according to its
structure. Therefore, MIDI summarization has little practical significance. In this section,
we focus on the music summarization for sound recoding from the real world, both in
uncompressed domain such as WAV format and compressed domain such as MP3
format.
Feature Extraction
The commonly used features for music summarization are timbral textual features
including
It considers the spectral peak, spectral valley and their differences in each subband.
Therefore, it can roughly reflect the relative distribution of harmonic and nonharmonic
components in the spectrum, which complements the weak point of MFCC, that is, MFCC
averages the spectral distribution in each subband and thus loses the relative spectral
information.
The pitch can be estimated using autocorrelation of each frame. Although all the
test data in their experiment are polyphonic, the authors believe that this feature is able
to capture much information for music signals with a leading vocal.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
107
Bartsch and Wakefield (2001) used chroma-based features and the similarity matrix
proposed by Foote for music summarization.
Chai and Vercoe (2003) proposed a Dynamic Programming method to detect the
repetition of a fixed length excerpt in a song one by one. Firstly, they segmented the whole
song into frames, and grouped the fixed number of continuous frames into excerpts. Then
they computed the repetition property of each excerpt in the song using Dynamic
Programming. The consecutive excerpts that have the same repetitive property were
merged into sections and each section was labelled according to the repetitive relation
(i.e., they gave each section a symbol such as A,B, etc.). The final summary was
generated based on the most frequently repeated section.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
assumption is that the ideal summary lasting for a long time should contain the summary
of short time.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
109
Evaluation
video based on certain criteria and then concatenating them to compose an audio
summary. To enforce the synchronization, the visual summary has to be generated by
selecting the image segments corresponding to those audio segments which form the
audio summary. Similarly, the image-centric summarization can be created by selecting
representative image segments from the original video to form a visual summary, and then
taking the corresponding audio segments to form the associated audio summary. For
these types of summarizations, either audio or visual contents of the original video will
be sacrificed in the summaries.
However, music video programs do not have a strong synchronization between
their audio and visual contents. Considering a music video program in which an audio
segment presents a song sung by a singer, the corresponding image segment could be
a close-up shot of the singer sitting in front of a piano, or shots of some related interesting
scenes. The audio content does not directly refer to the corresponding visual content.
Since music video programs do not have strong synchronization between the associated
audio and visual contents, we propose to first create an audio and a visual summary
separately, and then integrate the two summaries with partial alignment. With this
approach, we can maximize the coverage for both audio and visual contents without
sacrificing either of them.
Dv (i, j ) =
e =Y ,U ,V
k =1..n
e
i
( k ) h ej (k )
(7)
where hie , h ej are the YUV-level histograms of key frame i and j, respectively.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
111
The total number of clusters in S varies depending on the internal structure of the
original video. When given a shot cluster set S, the video sequence with the minimum
redundancy measure is the one in which all the shot clusters have a uniform occurrence
probability and an equal time length of 1.5 seconds (Gong et al., 2001). Based on these
criteria, the video summaries were created using the following major steps:
1.
2.
3.
4.
5.
and should maximize the coverage for both music and visual contents of the original
music video without sacrificing audio or visual parts.
Assume that the whole time span Lsum of the video summary is divided by the
alignment into P partitions (required clusters), and the time length of partition i is T i.
Because each image segment forming the visual summary must be at least L min seconds
long (a time slot equals one Lmin duration) as shown in Figure 5, partition i will provide
N i = Ti /Lmin
(8)
N total = N i
(9)
i =1
For each partition, the time length of music subsummary lasts for three to five
seconds, and the time length of a shot is 1.5 seconds. Therefore, the alignment problem
can be formally described as
1.
2.
Given:
An ordered set of representative shots U={u1, u2,, um}, mn, n is the total number
of clusters in cluster set S.
P partitions and N total time slots.
To extract:
P sets of output shots R ={R1, R2, , RP} which are the best matches between shot
set U and Ntotal time slots.
where:
P=The number of partitions
Ri={ri1,...,r ij ,...,r iNi} U, i=1,2,,P and Ni= Ti /Lmin
where r i1,...,r ij ,...,riNi are optimal shots selected from the shot set U for the i-th
partition.
By proper reformulation, this problem can be converted into the Minimum Spanning
Tree (MST) problem (Dale, 2003). Let G = (V, E) represent an undirected graph with a
Figure 5. Alignment operations on image and music
T1
Audio
Image
T2
Ti
TP-1
TP
Partition 1
Partition
Lmin
Lmin
Lmin
Lmin
Lmin
Lmin
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
113
weighted edge set V and a finite set of vertices E. The MST of a graph defines the lowestweight subset of edges that spans the graph in one connected component. To apply the
MST on our alignment problem, we use each vertex to represent a representative shot
ui, and an edge eij=(ui , uj) to represent the similarity between shots ui and uj. The similarity
here is defined as the combination of time similarity and visual similarity, and we give time
similarity a higher weight. The similarity is defined as follows:
(10)
where (0 1) is a weight coefficient, and D(i, j ) and T(i, j) represent the normalized
visual similarity and time similarity, respectively.
(11)
where Dv(i,j) is the visual similarity calculated from Equation (9). After normalization,
(12)
where Li is the index of the last frame in the ith shot, and Fj is the index of the first frame
in the jth shot. Using this equation, the closer the two shots are in the time domain, the
higher the time similarity value they get. T(i,j) varies from 0 to 1, when shot j just follows
shot i. There are no other frames between these two shots. In order to give the time
similarity high priority, we set less than 0.5. Thus, we can create a similarity matrix
for all shots in representative shots set U, and the (i,j)th element of is eij.
For every partition Ri, we generate an MST based on the similarity matrix .
In summary, for creating content-rich audio-visual extraction, we propose the
following alignment operations:
1.
2.
3.
Summarize the music track of the music video. The music summary consists of
several partitions, each of which lasts for three to five seconds. The total duration
of the summary is about 30 seconds.
Divide each music partition into several time slots, each of which lasts for 1.5
seconds.
For each music partition, find the corresponding image segment as follows:
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
In the first time slot of the partition, find the corresponding image segment in the
time domain. If it exists in the representative shot set U, assign it to the first slot and delete
it from the shot set U; if not, identify it in the shot set S, and find a most similar shot in
shot set U using similarity measure which is defined in Equation (7). After finding the
shot, take it as the root, apply an MST algorithm on it, find other shots in shot set U, and
fill them in the subsequent time slots in this partition.
Future Work
We believe that there is a long way to go for automatically generating transcription
from acoustic music signals. Current techniques are not robust and efficient enough.
Thus, analyzing the acoustic music signals directly without transcription is practically
important in music summarization. The future direction will focus on making summarization more accurate. To achieve this, on the one hand, we need to explore more music
features that can be used to characterize the music content; on the other hand, we need
to investigate more sophisticated music structure analysis methods to create more
accurate and acceptable music summary. In addition, we will investigate more deeply into
human perception of music; for example, what makes part of music sound like a complete
phrase, and what makes it memorable or distinguishable.
For music video, except for improving the summarization on the audio part, more
sophisticated music/video alignment methods will be developed. Furthermore, some of
the other information in music video can be integrated generation the summary. For
example, some Karaoke music videos have lyric captions, which can be detected and
recognized. These captions, together with visual shots and vocal information, can be
used to make better music video summary.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
115
Prescriptive Approach
Aucouturier and Pachet (2003) defined the prescriptive approach as an automatic
process that involves two steps: frame-based feature extraction followed by machine
learning.
Tzanetakis et al. (2001) cited a study indicating that humans are able to classify
genre after hearing only 250 ms of a music signal. The authors concluded from this that
it should be possible to make classification systems that do not consider music form or
structure. This implied that real-time analysis of genre could be easier to implement than
thought.
The ideas were further developed in Tzanetakis and Cook (2002), where a fully
functional system was described in detail. The authors proposed to use features related
to timbral texture, rhythmic content and pitch content to classify pieces, and the
statistical values (such as the mean and the variance) of these features were then
computed.
Several types of statistical pattern recognition (SPR) classifiers are used to identify
genre based on feature data. SPR classifiers attempt to estimate the probability density
function for the feature vectors of each genre. The Gaussian Mixture Model (GMM)
classifier and K-Nearest Neighbor (KNN) classifier were, respectively, trained to distinguish between 20 music genres and three speech genres by feeding them with feature sets
of a number of representative samples of each genre.
Pye (2000) used MFCCs as the feature vector. Two statistical classifier, GMM and
Tree-based Vector Quantization scheme, were used separately to classify music into six
types: blues, easy listening, classic, opera, dance and rock.
Grimaldi (Grimalidi et al., 2003) built a system using a discrete wavelet transform to
extract time and frequency features, for a total of 64 time features and 79 frequency
features. This is a greater number of features than Tzanetakis and Cook (2002) used,
although few details were given about the specifics of these features. This work used an
ensemble of binary classifiers to perform the classification operation with each trained
on a pair of genres. The final classification is obtained through a vote of the classifiers.
Tzanetakis, in contrast, used single classifiers that processed all features for all genres.
Xu (Xu et al., 2003) proposed a multilayer classifier based on support vector
machines (SVM) to classify music into four genres of pop, classic, rock and jazz. In order
to discriminate different music genres, a set of music features was developed to
characterize music content of different genres and an SVM learning approach was applied
to build a multilayer classifier. For different layers, different features and support vectors
were employed. In the first layer, the music was classified into pop/classic and rock/jazz
using an SVM to obtain the optimal class boundaries. In the second layer, pop/classic
music was further classified into pop and classic music and rock/jazz music was classified
into rock and jazz music. This multilayer classification method can provide a better
classification result than existing methods.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emergent Approach
There are two challenges in the prescriptive method (see the preceding section):
how to determine features to characterize the music and how to find an appropriate
pattern recognition method to perform classifications. The more fundamental problem,
however, is to determine the structure of the taxonomy in which music pieces will be
classified. Unfortunately, this is not a trivial problem. Different people may classify the
same piece differently. They may also select genres from entirely different domains or
emphasize different features. There is often an overlap between different genres, and the
boundaries of each genre are not clearly defined. The lack of universally agreed upon
definitions of genres and relationships between them makes it difficult to find appropriate
taxonomies for automatic classification systems.
Pachet and Cazaly (2000) attempted to solve this problem. They observed that the
taxonomies currently used by the music industry were inconsistent and therefore
inappropriate for the purpose of developing a global music database. They suggested
building an entirely new classification system. They emphasized the goals of producing
a taxonomy that was objective, consistent, and independent from other metadata
descriptors and that supported searches by similarity. They suggested a tree-based
system organized by genealogical relationships as an implementation, where only leaves
would contain music examples. Each node would contain its parent genre and the
differences between its own genre and that of its parent.
Although merits exist, the proposed solution has problems of its own. To begin
with, defining an objective classification system is much easier to say than do, and
getting everyone to agree on a standardized system would be far from an easy task,
especially when it is considered that new genres are constantly emerging. Furthermore,
this system did not solve the problem of fuzzy boundaries between genres, nor did it deal
with the problem of multiple parents that could compromise the tree structure.
Since, up to now, no good solution for the ambiguity and inconsistence of music
genre definition, Pachet et al. (2001) presented the emergent approach as the best
approach to take to achieve automatic genre classification. Rather than using existing
taxonomies, as done in prescriptive systems, emergent systems attempted to emerge
classifications according to certain measure of similarity. The authors suggested some
similarity measurements based on audio signals as well as on cultural similarity gleaned
from the application of data mining techniques to text documents. They proposed the use
of both collaborative filtering to search for similarities in the taste profiles of different
individuals and co-occurrence analysis on the play lists of different radio programs and
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
117
track listings of CD compilation albums. Although this emergent system has not been
successfully applied to music, the idea of automatically exploiting text documents to
generate genre profiles is an interesting one.
Future Work
There are two directions for prescriptive approaches that need to be investigated
in the future. First, more music features are needed to be explored because the better
feature set can improve the performance dramatically. For example, some music genres
use the same instrumentation, which implies that the timbre features are not good enough
to separate them. Thus we can use rhythm features in the future. Existing beat-tracking
systems are useful in acquiring rhythmic features. However, many existing beat-tracking
systems provide only an estimate of the main beat and its strength. For the purpose of
genre classification, more detailed information of features such as overall meter, syncopation, use of rubato, recurring rhythmic gestures and the relative strengths of beats and
subbeats are all significant. Furthermore, we can consider segmenting the music clip
according to its intrinsic rhythmic structure. It captures the natural structure of music
genres better than the traditional fixed length window framing segmentation. The second
direction is to scale-up unsupervised classification to music genre classification. Since
the supervised machine learning method is limited by the inconsistency of built-in
taxonomy, we will explore an unsupervised machine learning method which tries to
emerge a classification from the database. We will also investigate the possibility of
combining an unsupervised classification method with a supervised classification
method for music genre classification. For example, the unsupervised method could be
employed to initially classify music into broad and strongly different categories, and the
supervised method could then be employed to classify finely narrowed subcategories.
This would partially solve the problem of fuzzy boundaries between genres and could
lead to better overall results.
The emergent approach is able to extract high-level similarity between titles and
artists and is therefore suitable for unsupervised clustering of songs into meaningful
genre-like categories. These techniques suffer from technical problems, such as labelling
clusters. These issues are currently under investigation.
accurately. In singing voice identification, it is required to detect and extract the vocal
regions in the music. In music source separation, it is required to identify the vocal and
instrumental section. To remove the voice from the music for applications such as
Karaoke and for automatic lyrics generator, it is also required to detect the vocal sections
in the music signal.
The continuous raw music data can be divided into four preliminary classes: pure
instrumental (PI), pure vocal (PV), instrumental mixed vocal (IMV), and silence (S). The
pure instrumental regions contain signal mixture of many types of musical instruments
such as string type, bowing type, blowing type, percussion type, and so forth. The pure
vocal regions are the vocal lines sung without instrumental music. The IMV regions
contain the mixture of both vocals and instrumental music. Although the silence is not
a common section in popular music, it can be found at the beginning, ending and between
chorus verse transitions in the songs.
The singing voice is the oldest musical instrument and the human auditory
physiology and perceptual apparatus have evolved to a high level of sensitivity to the
human voice. After over three decades of extensive research on speech recognition, the
technology has matured to the level of practical applications. However, speech recognition techniques have limitations when applied to singing voice identification because
speech and singing voice differ significantly in terms of their production and perception
by the human ear (Sundberg, 1987). A singing voice has more dynamic and complicated
characteristics than speech (Saitou et al., 2002). The dynamic range of the fundamental
frequency (F0) contours in a singing voice is wider than that in speech, and F0
fluctuations in singing voices are larger and more rapid than those in speech.
The instrumental signals are broadband and harmonically rich signals compared
with singing voice. The harmonic structures of instrumental music are in the frequency
range up to 15 kHz, whereas singing voice is in the range of below 5 kHz.
Thus it is important to revise the speech processing techniques according to
structural musical knowledge so that these techniques can be applied to music content
analysis such as semantic region detection where the signal complexity defers for
different regions.
The semantic region detection is a hot topic in content-based music analysis. In the
following subsections, we summarize related work in this area and introduce our new
approach for semantic region detection.
Related Work
Many of the existing approaches for speech, instrument and singing voice detection and identification are based on speech processing techniques.
119
vowels and some consonants [m], [n], [l] are voiced while other consonants [f], [s], [t]
are unvoiced. For unvoiced, the source is no longer the phonation of the vocal folds, but
the turbulence caused by air is impeded by the vocal tract. Some consonants ([v], [z])
are mixed sounds (mixture of voiced and unvoiced) that use both phonation and
turbulence to produce the overall sound (Kim, 1999).
Speech is a narrow band (<10 kHz) signal and voiced and unvoiced regions are
distinctive in the spectrogram. Voiced fricatives produce quasi-periodic pulses. Thus
harmonically spaced strong frequencies in the lower frequency band (<1 kHz) can be
noticed in the spectrogram. Since unvoiced fricatives are produced by exciting the vocal
tract with broadband noise, it appears as a broadband frequency beam in the spectrum.
The analysis of formant structures which are the resonant frequencies of the vocal tract
tube has been one of the key techniques for detecting the voiced/unvoiced regions. Pitch
contour and time domain speech modeling using signal energy and average zero crossing
are some of the other speech features inspected for detecting the speech boundaries.
Basic steps for detecting the boundaries in the speech signals are shown in Figure
6. The signal is first segmented into 30~40 ms with 50% overlapping short windows, then
features are extracted. The shorter window smooths the shape of lower frequencies in
the spectrum and highlights the lower frequency resonant in the vocal tract (formants). Another
reason is that a shorter window is capable of detecting dynamic changes of the speech and with
reasonable window overlap can detect these temporal properties in the signal.
The linear predictive coding coefficients (LPC) calculated from the speech model,
stationary signal spectrum representation using Cepstral coefficients and dominant
pitch sensitive Mel-scaled Cepstral coefficients are some of the features extracted from
the short time windowed signals and they are modeled with statistical learning methods.
Most of the speech recognition systems have employed Hidden Markov Model (HMM)
to detect these boundaries and it is found that HMM is efficient in modeling the
dynamically changing speech properties in different regions.
Speech
Feature
extraction
Signal Segmentation /
Windowing
Classification /
Learning
Boundary
detection
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
fundamental frequency, and the amplitudes of the harmonic partials. For the best
recognition score, genetic algorithm (GA) was used to find the optimized subset of 352
main feature set. In his experiment, it was found that the accuracy varied from 8% to 84%.
Cosi (Cosi et al., 1994) trained a Self-Organizing Map (SOM) with MFCCs extracted
from isolated musical tones of the 40 different musical instruments for timbre classification. Martin (1999) trained a Bayesian network with different types of features such as
spectral features, pitch, vibrato, tremolo features, and note characteristic features to
recognize the nonpercussive musical instruments.
Eronen and Klapuri (2000) proposed a system for musical instrument recognition
using a wide set of features to model the temporal and spectral characteristics of sounds.
Kashino and Murase (1997) compared the classification abilities of a feed-forward
neural network with a K-Nearest Neighbor classifier, both trained with features of the
amplitude envelopes for isolated instrument tones.
Brown (1999) trained a GMM with constant-Q Cepstral coefficients for each
instrument (i.e., oboe, saxophone, flute and clarinet), using approximately one minute of
music data each.
Maddage et al. (2002) extracted spectral power coefficients, ZCR, MFCCs and LPC
derived Cepstral coefficients from eight pitch class electrical guitar notes (C4 to C5
shown in Table 1) and employ nonparametric learning technique (i.e., the nearest
neighbour rule) to classify the musical notes. Over 85% accuracy of correct note
classification was reported using a musical note database which has 100 samples of each
note.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
121
networks to detect the vocal and nonvocal boundaries in hierarchical fashion. Similar to
other methods, our experiments were based on speech-related features, that is, LPC, LPC
derived Cepstral coefficients, MFCCs, ZCR and Spectral Power (SP). After parameter
tuning we could reach an accuracy over 80%. In Maddage et al. (2004a), we measured both
the harmonic spacing and harmonic strength of both instrumental and vocal spectrums
in order to detect the instrumental/vocal boundaries in the music.
For singer identification, a vocal and instrumental model combination method has
been proposed in Maddage et al. (2004b). In that method, vocal and instrumental sections
of the songs were characterised using octave scale Cepstral coefficients (OSCC) and LPC
derived Cepstral coefficients (LPCC), respectively, and two GMMs (one for vocal and
the other for instrumental) were trained to highlight the singer characteristics. The experiments performed on a database of 100 songs indicated that the singer identification could
be improved (by 6%) when the instrumental models were combined with the vocal model.
The previous methods have borrowed the mature speech processing ideas such as
fixed-frame-size acoustic signal segmentation (usually 20~100-ms frame size and 50%
overlap), speech processing/coding feature extraction, and statistical learning procedures or linear threshold for segment classification, to detect the vocal/nonvocal
boundaries of the music. Although these methods have achieved up to 80% of frame level
accuracy, their performance is limited due to the fact that musical knowledge has not been
effectively exploited in these (mostly bottom-up) methods. We believe that a combination of bottom-up and top-down approaches, which combines the strength of low-level
features and high-level musical knowledge, can provide a powerful tool to improve
system performance. In the following subsections, we investigate how well the speech
processing techniques can cope with the semantic boundary detection task, and we
propose a novel approach, which considers both signal processing and musical knowledge, to detect semantic regions in acoustical music signals.
Song Structure
The popular music structure often contains Intro, Verse, Chorus, Bridge and Outro
(Ten Minute Master, 2003). The intro may be two, four or eight bars long (or longer), or
there may not be any intro in a song at all. The intro of the pop song is a flash back of
the chorus. Both verse and chorus are eight to sixteen bars long. Typically the verse is
not as strongly melodic as the chorus. However the verse and chorus of some songs like
Beatles songs are equally strong and most of the people can hum or sing their way.
Usually the gap between verse and chorus is linked by a bridge which may be only two
or four bars. There are instrumental sections in the song and they can be instrumental
versions of chorus or verse, or an entirely different tune with an altogether different set
of chords. Silence may act as a bridge between verse and chorus of a song, but such cases
are rare.
Since the sung vocal passages follow the changes in the chord pattern, we can apply
the following knowledge of chords (Goto, 2001) to the timing information of vocal
passages:
1.
2.
Chords are more likely to change on beat times than on other positions.
Chords are more likely to change on half-note times than on other positions of beat
times.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 7. Time alignment of the structure in songs with bars and quarter notes
bar 1 bar 2
bar i
Intro
Verse
Bar length
Song: 4/4
bar k
bar j
Chorus
Verse
bar l
Bridge
Chorus
Outro
Song structure
3.
Chords are more likely to change at the beginning of the measure than at other
positions of half-note times.
A typical song structure and its possible time alignment of bars and quarter notes
are shown in Figure 7. The duration of the song can be measured by the number of quarter
notes in the song, where the quarter-note time length is proportional to the interbeat
timing. The beginning and the end of Intro, Verse, Chorus Bridge and Outro usually start
and end at quarter notes. This is illustrated in Figure 7.
Thus we detect both the onsets of the musical notes and chord pattern changes to
compute the quarter-note time length with high confidence.
2.
The careful analysis of the song structure reveals that the time lengths of the
semantic regions (PV, IMV, PI, & S) are proportional to the interbeat time interval
of the music which corresponds to the quarter-note length (see preceding section
and next section).
The dynamic behaviour of the beat spaced signal section is quasi stationary
(Sundberg, 1987; Rossing et al. , 2002). In another way, the musically driven signal
properties such as octave spectral spacing and musical harmonic structure change
in beat space time steps.
123
Musical Audio
Rhythm extraction
Beat Space
Segmentation
Semantic
Regions
beat constantly alternates with the weak beat, the interbeat interval, which is the temporal
difference between two successive beats, would correspond to the temporal length of
a quarter note.
In our method, the beat corresponds to the sequence of equally spaced phenomenal
impulses which define the tempo for the music (Scheirer, 1998). We assume the meter to
be 4/4, this being the most frequent meter of popular songs and the tempo of the input
song to be constrained between 30-240 M.M (Mlzels Metronome: the number of quarter
notes per minute) and almost constant (Scheirer, 1998).
Our proposed rhythm tracking and extraction approach is shown in Figure 9. We
employ a discrete wavelet transform technique to decompose the music signal according
to octave scales. The frequency ranges of the octave scales are detailed in Table 1. The
system detects both onsets of musical notes (positions of the musical notes) and the
chord changes in the music signal. Then, based on musical knowledge, the quarter-note
time length is computed.
The onsets are detected by computing both frequency transients and energy
transients in the octave scale decomposed signals described in Duxburg et al. (2002). In
order to detect hard and soft onsets we take the weighted summation of onsets, detected
in each sub-band as shown in Equation (13) where Sb i (t) is the onset computed in
the i th sub-band at time t and On(t) is the weighted sum of each sub-band onset
at time t.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Transient
Engergy
Autocorrelation
Sub-band 2
Frequency
Transients
Onset Detection
Sub-band 1
Moving Threholds
Rhythm
Knowledge
Frequency
code book for
musical chords
Musical chord
Identification
Sub-band 8
Distance measure
Audio music
Sub-band
frequency
spectrum
Peak
Tracking
Ineter beat
time
information
Chord
change time
information
(13)
Silence Detection
Silence is defined as a segment of imperceptible music, including unnoticeable
noise and very short clicks. We use short-time energy to detect silence. The short-time
energy function of a music signal is defined as
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
125
1
0 .5
0
- 0 .5
-1
1
0
1
2
1
0
1
2
x 10
1
0
1
2
x 10
x 10
20
R e s u lts o f A u to c o r r e la tio n
Energy
15
10
5
Strength
0
1
0 .8
0 .6
0 .4
0 .2
0
S ix e te e n th N o te L e v e l P lo t
4
6
8
S a m p le s ( s a m p le r a te = 4 4 1 0 0 H z )
Table 1. Musical note frequencies and their placement in the octave scales sub-bands
Musical notes
En =
01
02
03
04
05
06
07
C2 to B2
C3 to B3
C4 to B4
C5 to B5
C6 to B6
C7 to B7
C8 to B8
64 ~
128
65.406
69.296
73.416
77.782
82.407
87.307
92.499
97.999
103.826
110.000
116.541
123.471
128 ~
256
130.813
138.591
146.832
155.563
164.814
174.614
184.997
195.998
207.652
220.000
233.082
246.942
256 ~
512
261.626
277.183
293.665
311.127
329.628
349.228
369.994
391.995
415.305
440.000
466.164
493.883
512 ~
1024
523.251
554.365
587.330
622.254
659.255
698.456
739.989
783.991
830.609
880.000
932.328
987.767
1024 ~
2048
1046.502
1108.730
1174.659
1244.508
1318.510
1396.913
1479.978
1567.982
1661.219
1760.000
1864.655
1975.533
2048 ~
4096
2093.004
2217.46
2349.318
2489.016
2637.02
2793.826
2959.956
3135.964
3322.438
3520
3729.31
3951.066
4096 ~
8192
4186.008
4434.92
4698.636
4978.032
5274.04
5587.652
5919.912
6271.928
6644.876
7040
7458.62
7902.132
1
[ x(m) w(n m)]2
N m
08
All higher Octave scales in the frequency
range of (8192 ~ 22050)
Sub-band No
Octave scale - B1
Freq - range
(Hz)
0~
64
C
C#
D
D#
E
F
F#
G
G#
A
A#
B
(14)
where x(m) is the discrete time music signal, n is the time index of the short-time energy,
and w(m) is a rectangular window, which length is equal to the quarter-note time length,
that is
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
1,
w(n) =
0,
(15)
Feature Extraction
We propose a new frequency scaling called the Octave Scale instead of Mel scale
to calculate Cepstral coefficients (see section of Timbral Textual Features). The fundamental frequencies F0 and the harmonic structures of musical notes are in an octave scale
as shown in Table 1. Sung vocal lines always follow the instrumental line such that both
pitch and harmonic structure variations are also in octave scale. In our approach we
divide the whole frequency band into eight sub-bands corresponding to the octaves in
the music. The frequency ranges of the sub-band are shown in Figure 11.
The useful range of fundamental frequencies of tones produced by musical
instruments is considerably less than the audible frequency range. The highest tone of
the piano has a frequency of 4186 Hz, and this seems to have evolved as a practical upper
limit for fundamental frequencies. We have considered the entire audible spectrum to
accommodate the harmonics (overtones) of the high tones. The range of fundamental
frequencies of the voice demanded in classical opera is 80~1200 Hz which corresponds
to the low end of the bass voice and the high end of the soprano voice, respectively.
Linearly placing the maximum number of filters in the bands where the majority of
the singing voice is present would give better resolution of the signal in that range. Thus,
we use 6, 8, 12, 12, 8, 8, 6 and 4 filters in sub-bands 1 to 8, respectively. Then the octave
scale Cepstral coefficients are extracted according to Equation (5).
In order to make a comparison, we also extract Cepstral coefficients from the Mel
scale. Figure 12 illustrates the deviation of the third Cepstral coefficient derived from both
scales for PV, PI and IMV classes. The frame size is 30 ms without overlap. It can be seen
that the standard deviation is lower for the coefficients derived from the Octave scale,
which would make it more robust in our application.
Figure 11. Filter-band distribution in octave scale for calculating Cepstral coefficients
C3 to B3
C4 to B4
C5 to B5
C6 to B6
C7 to B7
C8 to B8
C9 ~
0~128
128~256
256~512
512~1024
1024~2048
2048~4096
4096~8192
8192~22050
01
02
03
05
06
07
Octaves
~B1
C2 to B2
04
08
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
127
Figure 12. Third Cepstral coefficient derived from Mel-scale (1~1000 frame in black
colored lines) and Octave scale (1001~2000 frames in ash colored lines)
Pure Vocal ( PV )
0.8
0.6
0.4
200
400
600
800
1000
1200
1400
1600
1800
2000
(b)
1
0.5
0
-0.5
200
400
600
800
1000
1200
1400
1600
1800
2000
(c)
200
400
600
800
1000
1200
1400
1600
1800
2000
Fram e N um ber
Statistical Learning
To find the semantic region boundaries, we use a two-layer hierarchical classification method as shown in Figure 13. This has been proven to be more efficient than the
single layer multiclass classification method (Xu et al., 2003; Maddage et al., 2003).
In the initial experiments (Gao et al., 2003; Maddage et al., 2003), it is found that PV
can be effectively classified from PI and IMV. Thus we separate PV from other classes
in the first layer. In the second layer, we detect PV and IMV. However, PV regions in the
popular music are rare compared with both PI and IMV regions. In our experiments, SVM
and GMM are used as classifiers in layers 1 & 2.
When the classifier is SVM, layers 1 & 2 are modeled with parameter optimized radial
basis kernel function (Vapnik, 1998). The Expectation Maximization (EM) algorithm
(Bilmes, 1998) is used to estimate the parameters for layers 1 & 2 in GMM. The order of
Cepstral coefficients, used for both Mel scale and Octave scale in the layer 1 & 2, in both
classifiers (SVM & GMM) and number of Gaussian mixtures in each layers are shown in
Table 2.
Experimental Results
Our experiments are performed using 15 popular English songs (West life- Moments, If I Let You Go, Flying without Wings, My Love and Fragile Heart;
Backstreet Boys- Show me the meaning of being lonely, Quit playing games (with my
Heart), All I have to give, Shape of my Heart and Drowning; Michel Learns to
Rock (MLTR)-Paint my love, 25 Minutes, Breaking my Heart, How many Hours
and Someday) and 5 Sri Lankan songs (Maa Baa la Kale, Mee Aba Wanaye, Erata
Akeekaru, Wasanthaye Aga and Sehena Lowak). All music data are sampled from
commercial CDs at a 44.1 kHz sample rate and 16 bits per sample in stereo.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Pure Vocals
(PV)
Layer 2
Pure Instrumental
(PI)
Mel scale
18
24
Octave scale
10
12
We first conduct the experiments on our proposed rhythm tracking algorithm to find
the quarter-note time intervals. We test for 30s, 60s, and full length intervals of the song
and note the average quarter-note time length calculated from the rhythm tracking
algorithm. Our system has managed to obtain 95% of average accuracy in the number of
beats detected with a 20 ms error margin on the quarter-note time intervals.
The music is then framed into quarter-note spaced segments and experiments are
conducted for the detection of the class boundaries (PV, PI, & IMV) of the music. Twenty
songs are used by cross validation where 3/2 songs of each artist are used for training
and testing respectively in each turn. We perform three types of experiments:
EXP1: training and testing songs are divided into 30 ms with 50% overlapping
frames.
EXP2: training songs are divided into 30 ms with 50% overlapping frames and
testing songs are framed according to quarter-note time interval.
EXP3: training and testing songs are framed according to quarter-note time
intervals.
Experimental results (in % accuracy) using SVM learning are illustrated in Table 3.
Mel scale is optimized by having more filter positions on lower frequencies (Deller et al.,
2000) because dominant pitches of vocals and musical instruments are in lower frequencies (< 4000 Hz) (Sundberg, 1987).
Though the training and testing sets of PV are small (not many songs have sections
of only singing voice), it is seen that in all experiments (EXP 1~3), the classification
accuracy of PV is higher than other two classes. However, when vocals are mixed with
instruments, it can be seen that finding vocal boundaries is more difficult than other two
classes.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
129
PV
72.35
67.34
74.96
Mel-scale
PI
IMV
68.98
64.57
65.18
64.87
72.38
70.17
PV
73.76
75.22
85.73
Octave scale
PI
IMV
73.18
68.05
74.15
73.16
82.96
80.36
The results of EXP1 demonstrate the higher performance of the Octave scale
compared with the Mel Scale for 30 ms frame size. In EXP2, a slightly better performance
can be seen for the Octave Scale but not for the Mel Scale compared with EXP1. This
demonstrates that Cepstral coefficients are sensitive to the frame length as well as the
position of the filters in Mel or Octave scales. EXP3 is seen to achieve the best
performance among EXP1 ~ EXP3 demonstrating the importance of the inclusion of
musical knowledge in this application. Furthermore, the better results obtained by the
use of Octave scale demonstrate its ability to be able to model music signals better than
Mel scale for this application.
The results obtained for EXP3 with the use of SVM and GMM are compared as
shown in Figure 14. In Table 2, we have shown the numbers of Gaussian mixtures which
are empirically found to be good for layer 1 and layer 2 classifications. Since the number
of Gaussian mixtures in layer 2 is higher than in layer 1, it reflects that the classifying PI
from IMV is more difficult than classifying PV from the rest. It can be seen that SVM
performs better than GMM in identifying the region boundaries. We can thus infer that
this implementation of SVM, which is a polynomial learning machine that uses a radialbased kernel function, is more efficient than the GMM method that uses a probabilistic
modeling method using the EM algorithm.
Figure14. Comparison between SVM and GMM in EXP3
%
86
85.73%
SVM
GMM
84.96%
85
84
82.56%
83
80.96%
82
80.36%
81
79.96%
80
79
78
77
76
PV
PI
IMV
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Future Work
In addition to semantic boundary detecting in music, the beat space segmentation
platform is useful for music structural analysis, content-based music source separation
and automatic lyrics generation.
CONCLUSION
This chapter reviews past and current technical achievements in content-based
music summarization and classification. We summarized the state of arts in music
summarization, music genre classification and semantic region detection in music
signals. We also introduced our latest work in compressed domain music summarization,
musical video summarization and semantic boundary detection in acoustical music
signal.
Although many advances have been achieved in many research areas of contentbased music summarization and classification, there are still many research issues that
need to be explored to make content-based music summarization and classification
approaches more applicable in practice. We also identified the future research directions
in music summarization, music genre classification and semantic region detection in
music signals.
REFERENCES
Assfalg, J., Bertini, M., DelBimbo, A., Nunziati, W., & Pala, P. (2002). Soccer highlights
detection and recognition using HMMs. In Proceedings IEEE International
Conference on Multimedia and Explore, 1 (pp. 825-828), Lausanne, Switzerland.
Aucouturier, J. J., & Pachet, F. (2003). Representing musical genre: A state of the art.
Journal of New Music Research, 32(1), 1-12.
Bartsch, M.A., & Wakefield, G.H. (2001). To catch a chorus: Using chroma-based
representations for audio thumbnailing. In Proceedings Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), New Paltz, New
York, (pp. 15-18).
Berenzweig, A. L., & Ellis, D. P. W. (2001). Location singing voice segments within music
signals. In Proceedings IEEE Workshop on Applications of Signal processing to
Audio and Acoustics (WASPAA), New Paltz, New York (pp. 119-122).
Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application to parameter
estimation for Gaussian mixture and hidden Markov models. Technical Report
ICSI-TR-97-021, University of California, Berkeley.
Brown, J. C. (1999, March). Computer identification of musical instruments using pattern
recognition with Capstral coefficients as features. Journal of Acoustic Society
America, 105(3), 1064-10721.
Chai, W., & Vercoe, B. (2003). Music thumbnailing via structural analysis. In Proceedings
ACM International Conference on Multimedia, Berkeley, California, (pp. 223-226).
Cooper, M., & Foote, J. (2002). Automatic music summarization via similarity analysis.
In Proceedings International Conference on Music Information Retrieval, Paris
(pp. 81-85).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
131
Cosi, P., De Poli, G., & Prandoni, P. (1994). Timbre characterization with mel-cepstrum and
neural nets. In Proceedings International Computer Music Conference (pp. 42-45).
Dale, N. B. (2003). C++ plus data structures (3rd ed.). Boston: Jones and Bartlett.
Deller, J.R., Hansen, J.H.L., & Proakis, J.G. (2000). Discrete-time processing of speech
signals. IEEE Press.
DeMenthon, D., Kobla, V., & Maybury, M.T. (1998). Video summarization by curve
simplification. In Proceedings of the ACM international conference on Multimedia, Bristol, UK (pp. 211-218).
Duxburg, C., Sandler, M., & Davies, M. (2002). A hybrid approach to musical note onset
detection. In Proceedings of the International Conference on Digital Audio
Effects, Hamburg, Germany (pp. 33-28).
Eiilis, G. M. (1994). Electronic filter analysis and synthesis. Boston: Artech House.
Eronen, A., & Klapuri, A. (2000). Musical instrument recognition using cepstral cofficients
and temporal features. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing, Istanbul ,Turkey (Vol. 2, pp. II753 - II756).
Foote, J., Cooper, M., & Girgensohn, A. (2002). Creating music video using automatic
media analysis. In Proceedings ACM international conference on Multimedia,
Juan-les-Pins, France (pp. 553-560).
Fujinaga, I. (1998). Machine recognition of timbre using steady-state tone of acoustic
musical instruments. In Proceedings International Computer Music Conference
(pp. 207-210).
Gao, S., Maddage, N.C., & Lee, C.H. (2003). A hidden markov model based approach to
musical segmentation and identification. In Proceedings IEEE Pacific-Rim Conference on Multimedia (PCM), Singapore (pp. 1576-1580).
Gong, Y., Liu, X., & Hua, W. (2001). Summarizing video by minimizing visual content
redundancies. In Proceedings IEEE International Conference on Multimedia and
Explore, Tokyo (pp. 788-791).
Goto, M. (2001). An audio-based real-time beat tracking system for music with or without
drum-sounds. Journal of New Music Research, 30(2), 159-171.
Goto, M., & Muraoka, Y. (1994). A beat tracking system for acoustic signals of music.
In Proceedings of the Second ACM International Conference on Multimedia (pp.
365-372).
Grimalidi, M., Kokaram, A., & Cunningham, P. (2003). Classifying music by genre using
a discrete wavelet transform and a round-robin ensemble. Work report. Trinity
College, University of Dublin, Ireland.
Gunsel, B., & Tekalp, A. M. (1998). Content-based video abstraction. In Proceedings
IEEE International Conference on Image Processing, Chicago, Illinois (Vol. 3, pp.
128-132).
Hori, C., & Furui, S. (2000). Improvements in automatic speech summarization and
evaluation methods. In Proceedings International Conference on Spoken Language Processing ,Beijing, China (Vol. 4, pp. 326-329).
Jiang, D., Lu, L., Zhang, H., Tao, J., & Cai, L. (2002). Music type classification by spectral
contrast feature. In Proceedings IEEE International Conference on Multimedia
and Explore, Lausanne, Switzerland (Vol. 1, pp. 113-116).
Kashino, K., & Murase, H. (1997). Sound source identification for ensemble music based
on the music stream extraction. In Proceedings International Joint Conference on
Artificial Intelligence, Nagoya, Aichi, Japan (pp. 127-134).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Kim, Y. K. (1999). Structured encoding of the singing voice using prior knowledge of the
musical score. In Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York (pp. 47-50).
Kim, Y. K., & Brian, W. (2002). Singer identification in popular music recordings using
voice coding feature. In Proceedings International Symposium of Music Information Retrieval (ISMIR), Paris.
Kraft, R., Lu, Q., & Teng, S. (2001). Method and Apparatus for Music Summarization and
Creation of Audio Summaries. U.S. Patent No. 6,225,546. Washington, DC: U.S.
Patent and Trademark Office.
Logan, B., & Chu, S. (2000). Music summarization using key phrases. In Proceedings
IEEE International Conference on Audio ,Speech and Signal Processing, Istanbul,
Turkey (Vol. 2, pp. II749 - II752).
Lu, L., & Zhang, H. (2003). Automated extraction of music snippets. Proceedings ACM
International Conference on Multimedia, Berkeley, California (pp. 140-147).
Maddage, N.C., Wan, K., Xu, C., & Wang, Y. (2004a). Singing voice detection using twiceiterated composite fourier transform. In Proceedings International Conference on
Multimedia and Explore, Taibei, Taiwan.
Maddage, N.C., Xu, C., & Wang, Y. (2003). A SVM-based classification approach to
musical audio. In Proceedings International Symposium of Music Information
Retrieval, Baltimore, Maryland (pp. 243-244).
Maddage, N.C., Xu, C., & Wang, Y. (2004b). Singer identification based on vocal and
instrumental models. In Proceedings International Conference on Pattern Recognition, Cambridge, UK.
Maddage, N.C., Xu, C. S., Lee, C. H., Kankanhalli, M.S., & Tian, Q. (2002). Statistical
analysis of musical instruments. In Proceedings IEEE Pacific-Rim Conference on
Multimedia, Taibei,Taiwan (pp. 581-588).
Mani, I., & Maybury, M.T. (Eds.). (1999). Advances in automatic text summarization.
Boston: MIT Press.
Martin, K. D. (1999). Sound-source recognition: A theory and computational model.
PhD thesis, MIT Media Lab.
Nakamura,Y., & Kanade,T. (1997). Semantic analysis for video contents extraction
Spotting by association in news video. In Proceedings of ACM International
Multimedia Conference, Seattle, Washington (pp. 393-401).
Pachet, F., & Cazaly, D. (2000). A taxonomy of musical genre. In Proceedings ContentBased Multimedia Information Access Conference, Paris.
Pachet, F.,Weatermann, G., & Laigre, D. (2001). Musical data mining for EMD. In
Proceedings WedelMusic Conference, Italy.
Patel, N. V., & Sethi, I. K. (1996). Audio characterization for video indexing. In Proceedings SPIE Storage and Retrieval for Still Image and Video Databases IV, San Jose,
California, 2670 (pp. 373-384).
Pfeiffer, S., Lienhart, R., Fischer, S., & Effelsberg, W. (1996). Abstracting digital movies
automatically. Journal of Visual Communication and Image Representation, 7(4),
345-353.
Pye, D. (2000). Content-based methods for the management of digital music. In Proceedings IEEE International Conference on Audio, Speech and Signal Processing,
Istanbul ,Turkey (Vol. 4, pp. 2437-2440).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
133
Rabiner, L.R., & Juang, B.H. (1993). Fundamentals of speech recognition. New York:
Prentice Hall.
Rabiner, L.R., & Schafer, R.W. (1978). Digital processing of speech signals. New York:
Prentice Hall.
Rossing, T.D., Moore, F.R., & Wheeler, P.A. (2002). Science of sound (3rd ed.). Boston:
Addison-Wesley.
Saitou, T., Unoki, M., & Akagi, M. (2002). Extraction of f0 dynamic characteristics and
developments of control model in singing voice. In Proceedings of the 8th
International Conference on Auditory Display, Kyoto, Japan (pp. 275-278).
Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of
the Acoustical Society of America, 103(1), 588-601.
Shao, X., Xu, C., & Kankanhalli, M.S. (2003). Automatically generating summaries for
musical video. In Proceedings IEEE International Conference on Image Processing, Barcelona, Spain (vol. 3, pp. II547-II550).
Shao, X., Xu, C., Wang, Y., & Kankanhalli, M.S. (2004). Automatic music summarization
in compressed domain. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada.
Sundaram, H., Xie, L., & Chang, S.F. (2002). A utility framework for the automatic
generation of audio-visual skims. In Proceedings ACM International Conference
on Multimedia, Juan-les-Pins, France (pp. 189-198).
Sundberg, J. (1987). The science of the singing voice. Northern Illinois University Press.
Ten Minute Master No 18: Song Structure. (2003). MUSIC TECH Magazine, 62 -63.
Tolonen, T., & Karjalainen, M. (2000). A computationally efficient multi pitch analysis
model. IEEE Transactions on Speech and Audio Processing, 8(6), 708-716.
Tsai, W.H., Wang, H.M., Rodgers, D., Cheng, S.S., & Yu, H.M. (2003). Blind clustering
of popular music recordings based on singer voice characteristics. In Proceedings
International Symposium of Music Information Retrieval, Baltimore, Maryland
(pp. 167-173).
Tzanetakis, G., & Cook, P. (2000). Sound analysis using MPEG compressed audio. In
Proceedings IEEE International Conference on Acoustics, Speech, and Signal
Processing, Istanbul, Turkey (Vol. 2, pp. II761-II764).
Tzanetakis G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE
Transactions on Speech and Audio Processing, 10(5), 293-302.
Tzanetakis, G., Essl, G., & Cook, P. (2001). Automatic musical genre classification of audio
signals. Proceedings International Symposium on Music Information Retrieval,
Bloomington, Indiana (pp. 205-210).
Vapnik, V. (1998). Statistical learning theory. New York: John Wiley & Sons.
Wang, Y., & Vilermo, M. (2001). A compressed domain beat detector using MP3 audio
bit streams. In Proceedings ACM International Conference on Multimedia,
Ottawa, Ontario, Canada (pp. 194-202).
Wold, E., Blum, T., Keislar, D., & Wheaton, J. (1996). Content-based classification, search
and retrieval of audio. IEEE Multimedia, 3(3), 27-36.
Xu, C., Maddage, N.C., Shao, X.,Cao, F., & Tian, Q. (2003). Musical genre classification
using support vector machines. In Proceedings IEEE International Conference
on Acoustics, Speech, and Signal Processing, Hong Kong, China (pp. V429V432).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Xu, C., Zhu,Y., & Tian, Q. (2002). Automatic music summarization based on temporal,
spectral and Cepstral features. In Proceedings IEEE International Conference on
Multimedia and Explore, Lausanne, Switzerland (pp. 117-120).
Xu, C.S., Maddage, N.C., & Shao, X. (in press). Automatic music classification and
summarization. IEEE Transaction on Speech and Audio Processing.
Yow, D., Yeo, B.L., Yeung, M., & Liu, G. (1995). Analysis and presentation of soccer
highlights from digital video. In Proceedings of Asian Conference on Computer
Vision, Singapore.
Zhang, T. (2003). Automatic singer identification. In Proceedings International Conference on Multimedia and Explore, Baltimore, Maryland (Vol. 1, pp. 33-36).
Zhang, T., & Kuo, C.C.J. (2001), Audio content analysis for online audiovisual data
segmentation and classification. In IEEE Transaction on Speech and Audio
Processing, 9(4), 441-457.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
135
Chapter 6
A Multidimensional
Approach for Describing
Video Semantics
Uma Srinivasan, CSIRO ICT Centre, Australia
Surya Nepal, CSIRO ICT Centre, Australia
ABSTRACT
INTRODUCTION
With the convergence of Internet and Multimedia technologies, video content
holders have new opportunities to provide novel media products and services, by
repurposing the content and delivering it over the Internet. In order to support such
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
applications, we need video content models that allow video sequences to be represented and managed at several levels of semantic abstraction. Modeling video content
to support semantic retrieval is a hard task, because video semantics means different
things to different people. The MPEG-7 Community (ISO/ISE 2001) has spent considerable effort and time in coming to grips with ways to describe video semantics at several
levels, in order to support a variety of video applications. The task of developing content
models that show the relationships across several levels of video content descriptions
has been left to application developers. Our aim in this chapter is to provide a framework
that can be used to develop video semantics for specific applications, without limiting
the modeling to any one domain, genre or application.
The Webster Dictionary defines the meaning of semantics as the study of
relationships between signs and symbols and what they represent. In a way, from
the perspective of feature analysis work (MPEG, 2000; Rui, 1999; Gu, 1998; Flickner et al.,
1995; Chang et al. 1997; Smith & Chang, 1997), low- level audiovisual features can be
considered as a subset, or a part of, visual signs and symbols that convey a meaning.
In this context, audio and video analysis techniques have provided a way to model video
content using some form of constrained semantics, so that video content can be retrieved
at some basic level such as shots. In the larger context of video information systems,
it is now clear that feature analyses alone are not adequate to support video applications.
Consequently, research focus has shifted to analysing videos to identify higher-level
semantic content such as objects and events. More recently, video semantic modeling
has been influenced by film theory or semiotics (Hampapur, 1999; Colombo et al., 2001;
Bryan-Kinns, 2000), where a meaning is conveyed through a relationship of signs and
symbols that are manipulated using editing, lighting, camera movements and other
cinematic techniques. Whichever theory or technology one chooses to follow, it is clear
that we need a video model that allows us to specify relationships between signs and
symbols across video sequences at several levels of interpretation (Srinivasan et al.,
2001).
The focus of this chapter is to present an approach to modeling video content, such
that video semantics can be described incrementally, based on the application and the
video genre. For example, while describing a basketball game, we may wish to describe
the game at several levels: the colour and texture of players uniforms, the segments that
had the crowd cheering loudly, the goals scored by a player or a team, a specific movement
of a player and so on. In order to facilitate such descriptions, we have developed a
framework that is generic and not definitive, but still supports the development of
application specific semantics.
The next section provides a background survey of some of these approaches used
to model and represent the semantics associated with video content. In the third section,
we present our Video Metamodel Framework (VIMET) that helps to model video
semantics at different levels of abstraction. It allows users to develop and specify their
own semantics, while simultaneously exploiting results of video analysis techniques. In
the fourth section, we present a data model that implements the VIMET metamodel. In
the fifth section we present an example. Finally, the last section provides some conclusions and future directions.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
137
BACKGROUND
In order to highlight the challenges and issues involved in modeling video
semantics, we have organized video semantic modeling approaches into four broad
categories and cite a few related works under each category.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
139
Figure 1. A multidimensional data model for building semantics (the shadow part
represents the temporal (dynamic) component)
Domain
Knowledge
Semantic Concepts
Multi Feature
Multi Modal
Relationships
Single Feature
Relationships
Audio Video
Feature
Audio Video
Content
Representation
Structure of Features
Order of Events
Spatial Relationship
Temporal Relationship
Static
Motion-Based
Object
Frame
Shot
Scene
Clip
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Static features: Represent features extracted from objects and frames. Shape, color
and texture are examples of static features extracted from the objects. Similarly,
global color histogram and average colors are examples of features extracted from
a frame.
Motion-based features: Represent features extracted from video using motionbased information. An example of such a feature is motion vector.
Spatial Relationships: Two types of spatial relationships are possible for videos:
topological relationships and directional relationships. Topological relationships
include relations such as contains, covered by, and disjoint (Egenhofer & Franzosa,
1991). Directional relationships include relations such as right-of, left-of, above
and below (Frank, 1996).
Temporal Relationships: This includes the most commonly used temporal relationship from Allens 13 temporal relationships (Allen, 1983). These include
temporal relations such as before, meets, overlaps, finishes, starts, contains,
equals, during, started by, finished by, overlapped by, met by and after.
Moving further up in the semantic dimension, we use multiple features both in the
audio and the visual domain, and establish (spatial and temporal) relationships across
these multimodal features to model higher-level concepts. In the temporal dimension,
this helps us identify semantic constructs based on structure of features and order of
events occurring in a video.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
141
Perceptual level: This is the level at which visual phenomena become perceptually
meaningful, the level at which distinctions are perceived by the viewer. This is the
level that is concerned with features such as colour, loudness and texture.
Cinematic level: This level is concerned with formal film and video editing
techniques that are incorporated to produce expressive artifacts. For example,
arranging a certain rhythmic pattern of shots to produce a climax, or introducing
voice-over to shift the gaze.
Diegetic level: This refers to the four-dimensional spatiotemporal world posited
by a video image or a sequence of video images, including spatiotemporal
descriptions of objects, actions, or events that occur within that world.
Connotative level: This level of video semantics is the level of metaphorical,
analogical and associative meanings that the objects and events in a video may
have. An example of connotative significance is the use of facial expression to
denote some emotion.
Subtextual level: This is the level of more specialized, hidden and suppressed
meanings of symbols and signifiers that are related to special cultural and social
groups.
The main idea of this metamodel framework is to allow users to develop their own
application models, based on their semantic notion and interpretation, by specifying
objects and relationships of interest at any level of granularity.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Next we describe the data model to implement the ideas presented in the VIMET
metamodel. The elements of the data model allow application (developers) to incrementally develop video semantics that needs to be modeled and represented in the context
of the application domain.
143
Figure 2. A typical video object with two sets of attributes: audio and temporal
VO
A u d io A ttr ib u te :
( n a m e , v a lu e )
L o u d n e s s h is to g r a m :
[ 1 , ..,5 ]
- -- - - -- - - -T e m p o r a l a ttr ib u te :
( n a m e , v a lu e )
D u r a tio n : 2 0 s e c
- -- - - -- - - -- - - -- - - -
A typical video object a video shot is shown in Figure 2. The diagram shows
a shot with temporal attribute duration and audio attribute loudness histogram. (To keep
the diagram simple we have shown only a subset of possible attributes of a video object.)
Each attribute has a value domain and is shown as a name value pair. For example, the
audio attribute set has an audio attribute loudness whose value is given by a loudness
histogram. Similarly, the temporal attribute set has an attribute duration whose value is
given in seconds.
The applications determine the content of the video objects modeled. For example,
a video object could be a frame in the temporal dimension, with an attribute specifying
its location in the video. In the semantic dimension, the same frame could have visual
features such as colour and texture histograms.
VO Attributes
The attributes of a video object are either intensional or extensional. Extensional
attribute values are text data, drawn from the character domain. The possible sources of
extensional attributes are annotation, transcripts, keywords, textual description, terms
from a thesaurus, and so forth. Extensional attributes fall in the feature category in Figure
1. Intensional attributes have specific value domains where the values are computed
using appropriate feature extraction functions. Where the extensional attributes of the
video objects are semantic in nature, relationships across such objects can be expressed
as association, aggregation, and generalisation as per object-oriented or EER modeling
methodologies. For intensional attributes, however, we need to establish specific
relationships for each attribute type. For example, when we consider the temporal
attribute whose value domain is time-interval, we need to specify temporal relationships and corresponding operations that are valid over time intervals.
We define two sets of value domains for intensional attributes. The first set is the
numerical domain and the second set is the linguistic domain. The purpose of providing
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
X = denotes the symbolic name of a linguistic variable (attributes in our case) such
as duration.
LX = set of linguistic terms that X can take such as short, long and very long..
QX = the numeric domain where X can take the values such as time in seconds.
MX = a semantic function which gives meaning to the linguistic terms by
mapping Xs values from LX to QX. The function MX depends on the types of
attributes and the application domain. We defined one such function in Nepal et
al. (2001).
VO Attribute-Level Relationships
Relationships across video objects can be defined at two levels: the object and
attribute levels. We will discuss the object level relationships in a later section. Here
we discuss attribute-level relationships to illustrate the capabilities of the data model.
Each intensional attribute value allows us to define relationships that are valid for that
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
-----------
Soft, loud,
very soft,
very loud
------
Temporal attribute:
(name, value)
Duration: 20 sec
------------------
Short, long,
medium, very
short, very
long
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
145
Relationship Type
Visual, based on
light intensity
Visual, based on
contrast measures
Visual, based on
color
Audio, based on
sound level
Audio, based on
pitch
Visual, based on
size
Temporal, based on
duration
Relationship
Value set
{Brighter, Dimmer,
Similar}
{Higher, Lower, Similar}
(Same, Different,
Resembles}
{Louder, Softer, Similar}
{Higher, Lower, Similar}
{Bigger, Smaller, Similar}
{Shorter, Longer, Equal}
between a VO and a VCO lies in the domain of the intensional attributes. The attributes
of a VO have numerical domains, and the attributes of VCO have a domain which is a set
of linguistic terms which forms a domain for semantic attribute value. The members of this
set are controlled vocabulary terms that reflect the (fuzzy) semantic value for each
attribute type. The domain set will be different for audio, visual, temporal and spatial
attributes.
Definition: A typical Video Concept Object (VCO) is a five-tuple
VCO = <Xc, Sc, Tc, Ac, Vc>
where
Xc is a set of textual attributes that define a concept, the attribute values are drawn from
the character domain,
Sc represents a set of spatial attributes. The value domain is a set of (fuzzy) terms that
describe relative positional and directional relationships,
Tc represents a set of temporal attributes whose values are drawn from a set of fuzzy terms
used to describe time interval or duration.
Ac represents a set of audio attributes (an example is loudness attribute whose values
are drawn from a set of fuzzy terms that describe loudness).
Vc represents a visual attribute whose values are drawn from a set of fuzzy terms that
describe that particular visual attribute.
The relationship between a VO and a VCO is established using fuzzy linguistic
mapping as shown in Figure 6.
In a fuzzy linguistic model, attribute name-relationship pair is equivalent to fuzzy
variable-value pair. In general, a user can query and retrieve video from a database using
any primitive VO/VCO. An example is retrieve all videos where an object A is at the rightbottom of the frame. In this example, the expressive power of a query is based on a single
attribute-value and is limited. A more interesting query would be retrieve all video
sequences where the loudness value is very high and the duration of the sequence is
short. Such queries would include VCOs and VOs.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
147
Temporal attribute:
Duration: {long, short,very short, }
VO
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
-----------
Soft, loud,
very soft,
very loud
------
Temporal attribute:
(name, value)
Duration: 20 sec
------------------
Short, long,
medium, very
short, very
long
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Topological Relations: Topological relations are spatial relations that are invariant
under bijective and continuous transformations that have also continuous inverses. Topological equivalence does not necessarily preserve distances and
directions. Instead, topological notions include continuity, interior, and boundary,
which are defined in terms of neighbourhood relations. The eight topological
relationship operators are given as TR = {Disjoint, Meet, Equal, Inside, Contains,
Covered_By, Covers, Overlap} (Egenhofer & Franzosa, 1991) as shown in Figure
7. These relationship operators are binary.
Relative Positional Relations: The relative positional relationship operators are
given by RP = {Right of, Left of, Above, Below, In front of, Behind}. These
operators are used to express the relationships among VCOs. These relative
positional relationship operators are binary.
Sro = TR RP
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
149
Multimodal Operators
Traditionally the approach used for audiovisual content retrieval is based on
similarity measures on the extracted features. In such systems, users need to be familiar
with the underlying features and express their queries in terms of these (low-level)
features. In order to allow users to formulate semantically expressive queries, we define
a set of fuzzy terms that can be used to describe a feature using fuzzy attribute
relationships. For example, when we are dealing with the loudness feature, we use a
set of fuzzy terms such as very loud, loud, soft, and very soft to describe the
relative loudness at the attribute level. This fuzzy measure is described at the VCO level.
A query on VSS, however, can be multimodal in nature and involve many VCOs. In order
to have a generic way to combine multimodal relationships, we use simple fuzzy Boolean
operators. The fuzzy Boolean operator set is given by
FR = {AND, OR, NOT}.
(Note: This set FR corresponds to the set of multimodal relationship operators
defined as Fro in VSS.)
An example of a multimodal, multifeature query is shown in Figure 8.
Figure 8. An example of multimodal query using fuzzy and temporal operators on VCOs
(loudness, loud> AND <duration, short>) BEFORE
(<loudness, soft> AND <duration, long>)
Temporal operators
(BEFORE, AFTER, ..)
Spatial operators
(RIGHT-OF, ..)
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
-----------
Soft, loud,
very soft,
very loud
------
Temporal attribute:
(name, value)
Duration: 20 sec
------------------
Short, long,
medium, very
short, very
long
VO
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Interpretation Models
In the third section we identified five different levels of cinematic codification and
description that help us interpret the meaning of a video. Lindley and Srinivasan (1998)
have demonstrated empirically that these descriptions are meaningful in capturing
distinctions in the way images are viewed and interpreted by a nonspecialist audience,
and between this audience and the analytical terms used by filmmakers and film critics.
Modeling the meaning of a video, shot, or sequence requires the description of the
video object at any or all of the interpretation levels described in the third section.
Interpretation Models can be based on one or several of these five levels. A model based
on perceptual visual characteristics is the subject of a large amount of current research
on video content-based retrieval (Ahanger & Little, 1996). Models based on cinematic
constructs incorporate the expressive artifacts such as camera operations (Adam et al.,
2000), lighting schemes and optical effects (Truong et al., 2001). Automated detection
of cinematic features is another area of vigorous current research activity (see Ahanger
& Little). While modeling at the diegetic level the basic perceptual features of an image
are organised into a four-dimensional spatiotemporal world posited by a video image or
sequence of video images. This includes the spatiotemporal descriptions of agents,
objects, actions and events that take place within that world. The interpretation model
that we illustrate in the fifth section is a diegetic model based on interpretations at the
perceptual level. Examples of connotative meanings are the emotions connoted by
actions or the expressions on the faces of characters. The subtextual level of interpretation involves representing specialised meanings of symbols and signifiers. For both
the connotative and the subtextual levels, definitive representation of the meaning of
a video is in principle impossible. The most that can be expected is an evolving body
of interpretations.
In the next section we show how a user/author can develop application semantics
by explicitly specifying objects and relationships and interpretation models. The ability
to create different models allows users/authors to contextualise the content to be
modeled.
EXAMPLE
Here we illustrate the process of developing a specific application model to describe
video content at different semantic levels. The process is incremental, where the different
elements of the VIMET framework are used to retrieve goal segments from basketball
videos. Here we have used a top-down approach to modeling, where we first developed
the interpretation models from an empirical investigation of basketball videos (Nepal et
al., 2001). We then develop an object model for the application.
Figure 9 shows an instantiation of the metamodel.
The observation of television broadcasts of basketball games gave some insights
into commonly occurring patterns of events perceived during the course of a basketball
game. We have considered a subset of these as key events that occur repeatedly
throughout the video of the game to develop a temporal interpretation model. They were:
crowd cheer, scorecard display and change in players direction. We then identified the
audio-video features that correspond to the key events manually observed. The
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
151
Domain
Knowledge
Interpretation Goal
Connotative
Multi-Modal
Multi-Feature
Relationships
Events
Crowd Cheer,
Score Card
Single Feature
Relationship
Audio Video
Feature
Diegetic
features
loudness value,
motion vector
embedded text
Diegetic
Perceptual , Cinematic
Video and audio
content
features identified include energy levels of audio signals, embedded text regions, and
change in direction of motion. Since the focus of this chapter is on building a video
semantic goal, we will not elaborate on the feature extraction algorithms in detail. We
used the MPEGMaaate audio analysis toolkit (MPEG MAAATE, 2000) to identify highenergy segments from loudness values. Appropriate thresholds were applied to these
high-energy segments to capture an event crowd cheer. The mapping to crowd cheer
expresses a simple domain heuristic (diegetic and connotative interpretation) for sports
video content. Visual feature extraction consists of identifying embedded text (Gu, 1998)
and is mapped to score card displays in the sports domain. Similarly, camera pan
(Srinivasan et al., 1997) motion is extracted and used to capture change in players
direction. The next level of semantics is developed by exploring the temporal order of
events such as crowd cheer and scorecard displays. Here we use temporal interpretation
models that show relationships on multiple features from multiple modalities.
Interpretation Models
We observed basketball videos and developed five different temporal interpretation models, as follows.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Model I
This model is based on the first key event crowd cheer. Our observation shows
that there is a loud cheer within three seconds of scoring a legal goal. Hence, in this model,
the basic diegetic interpretation is that a loud cheer follows every legal goal, and a loud
cheer only occurs after a legal goal. The model T1 is represented by
T1: Goal [3 sec] crowd cheer
Intuitively, one can see that the converse may not always be true, as there may be
other cases where a loud cheer occurs, for example, when a streaker runs across the field.
Such limitations are addressed to some extent in the subsequent models.
Model II
This model is based on the second key event scoreboard display. Our observation shows that the scoreboard display is updated after each goal. Our cinematic
interpretation in this model is that a scoreboard display appears (usually as embedded
text) within 10 seconds of scoring a legal goal. This is represented by the model T2.
T2: Goal [10 sec] Scoreboard
The limitation of this model is that the converse here may not always be true, that
is, a scoreboard display may not always be preceded by a legal goal.
Model III
This model uses a combination of two key events with a view to address the
limitations of T1 and T2. As pointed out earlier, all crowd cheers and scoreboard displays
may not always indicate a legal goal. Ideally when we classify segments that show a
shooter scoring goals, we need to avoid inclusion of events that do not show a goal, even
though there may be a loud cheer. In this model, this is achieved by temporally combining
the scoreboard display with crowd cheer. Here, our diegetic interpretation is that every
goal is followed by crowd cheer within three seconds, and by a scoreboard display within
seven seconds after the crowd cheer. This discards events that have crowd cheer, but
no scoreboard and events that have scoreboards, but no crowd cheer.
T3: Goal [3 sec] Audio Cheer [7 sec] Score Board
Model IV
This model addresses the strict constraints imposed in Model 3. Our observations
show that while the pattern shown in three is valid most of the times, there are cases where
the field goals are accompanied by loud cheers and no scoreboard display. Similarly,
there are cases where goals are followed by scoreboard displays but not crowd cheer,
as in the case of free throws. In order to capture such scenarios, we have used a
combination of models I and II and proposed a model IV.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
153
T4: T1 T2
where is the union of results from models H1 and H2.
Model V
While model IV covers most cases of legal goals, due to the inherent limitations
pointed out in models I and II, model IV could potentially classify segments where there
are no goals. Our observations show that Model IV captures the maximum number of
goals, but it also identifies many nongoal segments. In order to retain the number of goal
segments and still remove the nongoal segments, we introduce the third key event
change in direction of players. In this model, if a crowd cheer appears within 10 seconds
of a change in direction, or a scoreboard appears within 10 seconds of a change in
direction, there is likely to be a goal within 10 seconds of the change in direction. This
is represented as follows.
T5: Goal [10 secs] Change in direction [10secs] Crowd cheer
OR
T5:Goal [10 secs] Change in direction [10secs] Scoreboard
Although, in all models, the time interval between two key-events is hardwired, the
main idea is to provide a temporal link between key-events to build up high-level
semantics.
Object Model
For the interpretation models described above, we combine video objects characterized by different types of intensional attributes to develop new video objects or video
concept objects. An instance of video data model is shown in Figure 10. For example, we
use the loudness feature and a temporal feature to develop a new video concept object
called crowd cheer. For example, we first define video semantics crowd cheer as.
DEFINE CrowdCheer AS
SELECT V1.Start-time, V1.End-time
FROM VCO V1
WHERE <V1.Loudness, Very-high> AND
<V1.Duration, Short>
Here we use diegetic knowledge about the domain to interpret a segment where a
very high loud value appeared for a short duration as a crowd cheer. Similarly, we can
define scorecard as follows.
DEFINE ScoreCard AS
SELECT V2.Start-time, V2.End-time
FROM VCO V2
WHERE <V2.Text-Region, Bottom-Right>
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 10. An instance of the data model for an example explained in this section.
Goal
Semantics
CrowdCheer
ScoreCard
Temporal
Operator
BEFORE
Interpretation
Models
<Text-Region, BottomRight>
Audio Attribute:
(name, value)
Loudness: {Soft,
Loud, }
Combining
Operator
AND
Temporal attribute:
(name, value)
Start-Time: 20 sec
End_Time: 25 sec.
Duration: {short,
long, }
Visual Attribute:
(name, value)
Text-Region:
{top-right,
bottom-right, }
Temporal
attribute:
(name, value)
Start-Time: 20 sec
End-Time: 25 sec.
VCO1
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
Temporal attribute:
(name, value)
Start-Time: 20 sec
End_Time: 25 sec.
Duration: 5 sec
VO1
VCO2
Fuzzy
Linguistic
Model
Soft, loud,
very soft,
very loud
Bottom-right
Top-Left, etc.
Short, long,
medium, very
short, very
long
Visual Attribute:
(name, value)
Text-Region:
[240,10,255,25]
Temporal
attribute:
(name, value)
Start-Time: 30 sec
End-Time: 40 sec.
VO2
We then use the temporal interpretation model to build a query to develop a video
concept object that represents a goal segment. This is defined as follows:
DEFINE Goal AS
SELECT [10] MIN(V1.Start-time, V2.Start-time),
MAX(V1.End-time, V2.End-time)
FROM CrowdCheer V1, ScoreCard V2
WHERE V1 BEFORE V2
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
155
Table 2. A list of data sets used for our experiments and the results of our observations
Games
Description
Length
Number
of cheers
Number
of goals
31
Number of
scoreboard
displays
49
52
Number of changes in
direction of players
movements
74
Australia Vs
Cuba 1996
(Women)
Australia Vs
USA 1994
(Women)
Australia Vs
Cuba 1996
(Women)
Australia Vs
USA 1997
(Men)
00:42:00
00:30:00
27
46
30
46
00:14:51
16
16
17
51
00:09:37
13
16
18
48
Here the query returns the 10 best-fit video sequences that satisfy the query criteria
- goal.
We have implemented this application model for basketball videos. Implementing
such a model necessarily involves developing appropriate interpretation models, which
is a labour-intensive task. The VIMET framework helps this process by facilitating
incremental descriptions of video content.
We now present an evaluation of the interpretation models outlined in the previous
section.
precision =
| relevant retrieved |
| retrieved |
Recall is the ratio of the number of relevant clips retrieved to the total number of
relevant clips. We can achieve ideal recall (100%) by retrieving all clips from the data set,
but the corresponding precision will be poor.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Total
number of
baskets
(relevant)
17
18
recall =
Algorithms
Total
(retrieved)
T1
T2
T3
T4
T5
T1
T2
T3
T4
T5
15
20
11
24
17
16
19
10
25
22
Correct
Decision
(relevant
retrieved)
12
15
11
16
15
11
17
10
18
16
Precision
(%)
Recall
(%)
80
75
100
66.66
88.23
68.75
89.47
100
72.0
72.72
70.50
88.23
64.70
94.1
88.23
61.11
94.44
55.55
100
88.88
| relevant retrieved |
| relevant |
We evaluated the temporal interpretation models in our example data set. The
manually identified legal goals (which include field goals and free throws) of the videos
in our data set are shown in column 2 in Table 3.
The result of automatic analysis shows that the combination of all three key events
performs much better with high recall and precision values (~88%). In model T1, the model
correctly identifies 12 goal segments out of 17 for video C and 11 out of 18 for video D.
However, the total number of crowd segments detected by our crowd cheer detection
algorithm is15 and 16 for videos C and D, respectively. That is, there are few legal goal
segments that do not have crowd cheer and vice versa. Further analysis shows that crowd
cheer events resulting from other interesting events such as fast break, clever steal
and great check or screen gives false positive results. We observed that most of the
field goals are accompanied by crowd cheer. However, many goals scored by free throws
are not accompanied by crowd cheer. We also observed that in certain cases lack of
supporters among the spectators for a team yield false negative results. In model T2, the
model correctly identifies 15 goals out of 17 in video C and 17 out of 18 in video D. Our
further analysis confirmed that legal goals due to free throws are not often accompanied
by scoreboard displays, particularly when the scoreboard is updated at the end of free
throws rather than after each free throw. Similarly, our feature extraction algorithm used
for scoreboard not only detects scoreboards but also other textual displays such as team
coach names and the number of fouls committed by a player. Such textual features
increase the number of false positive results. We plan to use the heuristics developed
here to improve our scoreboard detection algorithm in the future. The above discussion
is valid for algorithms T3, T4 and T5 as well.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
157
CONCLUDING REMARKS
Emerging new technologies in video delivery such as streaming over the Internet
have made video content a significant component of the information space. Video
content holders now have an opportunity to provide new video products and services
by reusing their video collections. This requires more than content-based analysis
systems.
We need effective ways to model, represent and describe video content at several
semantic levels that are meaningful to users of video content. At the lowest level, content
can be described using low-level features such as colour and texture, and at the highestlevel the same content can be described using high-level concepts. In this chapter, we
have provided a systematic way of developing content descriptions at several semantic
levels. In the example in the fifth section, we have shown how (audio and visual) feature
extraction techniques can be effectively used together with interpretation models, to
develop higher-level semantic descriptions of content. Although the example is specific
to the sports genre, the VIMET framework and the associated data model provides a
platform to develop several descriptions of the same video content, depending on the
interpretation level of the annotator. The model supports one of the mandates of MPEG7 by supporting content descriptions, to accommodate multiple interpretations.
REFERENCES
Adam, B., Dorai, C., & Venkatesh, S. (2000). Study of shot length and motion as
contributing factors to movie tempo. ACM Multimedia, 353-355.
Ahanger, G., & Little, T.D.C. (1996). A survey of technologies for parsing and indexing
digital video. Journal of Visual Communication and Image Representation, 7(1),
28-43.
Allen, J.F. (1983). Maintaining knowledge about temporal intervals. Communication of
the ACM, 26(11), 832-843.
Baral, C., Gonzalez, G., & Son, T. (1998). Conceptual modeling and querying in multimedia
databases. Multimedia Tools and Applications, 7(), 37-66.
Bryan-Kinns, N. (2000). VCMF: A framework for video content modeling. Multimedia
Tools and Applications, 10(1), 23-45.
Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., & Zhong, D. (1997). VideoQ: An
automated content based video search system using visual cues. ACM Multimedia, 313-324, Seattle, Washington, November.
Colombo, C., Bimbo, A.D., & Pala, P. (2001). Retrieval of commercials by semantic
content: the semiotic perspective. Multimedia Tools and Applications, 13, 93-118.
Driankov, D., Hellendoorn, H., & Reinfrank, M. (1993). An introduction to fuzzy control.
Springer-Verlag.
Egenhofer, M., & Franzosa, R. (1991). Point-set topological spatial relations. International Journal of Geographic Information Systems, 5(2), 161-174.
Fagin, R. (1999). Combining fuzzy information from multiple systems. Proceedings of the
15th ACM Symposium on Principles of Database Systems (pp. 83-99).
Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M.,
Hafner, J., Lee, D., Petkovic, D., & Steele, D. (1995). Query by image and video
content: The QBIC system. Computer, 28(9), 23-32.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
159
Smith, J.R., & Chang, S.-F. (1997). Querying by color regions using the VisualSEEk
content-based visual query system. In M. T. Maybury (Ed.), Intelligent multimedia information retrieval. IJCAI.
Srinivasan, U., Gu, L., Tsui, & Simpsom-Young, B. (1997). A data model to support
content-based search on digital video libraries. The Australian Computer Journal,
29(4), 141-147.
Srinivasan, U., Lindley, C., & Simpsom-Young, B. (1999). A multi-model framework for
video information systems. Database Semantics- Semantic Issues in Multimedia
Systems, January (pp. 85-108). Kluwer Academic Publishers.
Srinivasan, U., Nepal, S., & Reynolds, G. (2001). Modelling high level semantics for video
data management. Proceedings of ISIMP 2001, Hong Kong, May (pp. 291-295).
Tansley, R., Dobie, M., Lewis, P., & Hall, W. (1999). MAVIS 2: An architecture for content
and concept based multimedia information exploration. ACM Multimedia, 203.
Truong, B.T., Dorai, C., & Venkatesh, S. (2001). Determining dramatic intensification via
flashing lights in movies. International Conference on Multimedia and Expo,
August 22-25, Tokyo (pp. 61-64).
Yap, K., Simpson-Young, B., & Srinivasan, U. (1996). Enhancing video navigation with
existing alternate representations. First International Conference on Image Databases and Multimedia Search, Amsterdam, August.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 7
ABSTRACT
The Continuous Media Web project has developed a technology to extend the Web to
time-continuously sampled data enabling seamless searching and surfing with existing
Web tools. This chapter discusses requirements for such an extension of the Web,
contrasts existing technologies and presents the Annodex technology, which enables
the creation of Webs of audio and video documents. To encourage uptake, the
specifications of the Annodex technology have been submitted to the IETF for
standardisation and open source software is made available freely. The Annodex
technology permits an integrated means of searching, surfing, and managing a World
Wide Web of textual and media resources.
INTRODUCTION
Nowadays, the main source of information is the World Wide Web. Its HTTP
(Fielding et al., 1999), HTML (World Wide Web Consortium, 1999B), and URI (BernersLee et al., 1998) standards have enabled a scalable, networked repository of any sort of
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
information that people care to publish in textual form. Web search engines have enabled
humanity to search for any information on any public Web server around the world. URI
hyperlinks in HTML documents have enabled surfing to related information, giving the
Web its full power. Repositories of information within organisations are also building on
these standards for much of their internal and external information dissemination.
While Web searching and surfing has become a natural way of interacting with
textual information to access their semantic content, no such thing is possible with media.
Media on the Web is cumbersome to use: it is handled as dark matter that cannot be
searched through Web search engines, and once a media document is accessed, only
linear viewing is possible no browsing or surfing to other semantically related
documents.
Multimedia research of the recent years has realised this issue. One means to enable
search on media documents is to automate the extraction of content, store the content
as index information, and provide search facilities through that index information. This
has led to extensive research on the automated extraction of metadata from binary media
data, aiming at bridging the semantic gap between automatically extracted low level
image, video, and audio features, and the high level of semantics that humans perceive
when viewing such material (see, e.g., Dimitrova et al., 2002).
It is now possible to create and store a large amount of metadata and semantic
content from media documents be that automatically or manually. But how do we exploit
such a massive amount of information in a standard way? What framework can we build
to satisfy the human need to search for content in media, to quickly find and access it
for reviewing, and to manage and reuse it in an efficient way?
As the Web is the most commonly used means for information access, we decided
to develop a technology for time-continuous documents that enables their seamless
integration into the Webs searching and surfing. Our research is thus extending the
World Wide Web with its familiar information access infrastructure to time-continuous
media such as audio and video, creating a Continuous Media Web.
Particular aims of our research are:
This chapter presents our developed Annodex (annotation and indexing) technology, the specifications of which have been published at the IETF (Internet Engineering
Task Force) as Internet-Drafts for the purposes of international standardisation. Implementations of the technology are available at http://www.annodex.net/. In the next
section we present related works and their shortcomings with respect to our aims. We
then explain the main principles that our research and development work adheres. The
subsequent section provides a technical description of the Continuous Media Web
(CMWeb) project and thus forms the heart of this book chapter. We round it off with a
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
view on research opportunities created by the CMWeb, and conclude the paper with a
summary.
BACKGROUND
The World Wide Web was created by three core technologies (Berners-Lee et al.,
1999): HTML, HTTP, and URIs. They respectively enable:
the markup of textual data integrated with the data itself giving it a structure,
metadata, and outgoing hyperlinks,
the distribution of Web documents over the Internet, and
the hyperlinking to and into Web documents.
One expects that the many existing standardisation efforts in multimedia would
cover these requirements. However, while the required pieces may exist, they are not
packaged and optimised for addressing the issues and for solving them in such a way
as to make use of the existing Web infrastructure with the least necessary adaptation efforts.
Here we look at the three most promising standards: SMIL, MPEG-7, and MPEG-21.
SMIL
The W3Cs SMIL (World Wide Web Consortium, 2001), short for Synchronized
Multimedia Interaction Language, is an XML markup language used for authoring
interactive multimedia presentations. A SMIL document describes the sequence of media
documents to play back, including conditional playback, loops, and automatically
activated hyperlinks. SMIL has outgoing hyperlinks and elements that can be addressed
inside it using XPath (World Wide Web Consortium, 1999A) and XPointer (World Wide
Web Consortium, 2002).
Features of SMIL cover the following modules:
1.
2.
3.
4.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
5.
6.
7.
8.
9.
10.
11.
Media Objects: describes media objects that come in the form of hyperlinks to
animations, audio, video, images, streaming text, or text. Restrictions of continuous
media objects to temporal subparts (clippings) are possible, and short and long
descriptions may be attached to a media object.
Metainformation: allows description of SMIL documents and attachment of RDF
metadata to any part of the SMIL document.
Structure: structures a SMIL document into a head and a body part, where the head
part contains information that is not related to the temporal behaviour of the
presentation and the body tag acts as a root for the timing tree.
Timing and Synchronization: provides for different choreographing of multimedia
content through timing and synchronization commands.
Time Manipulation: allows manipulation of the time behaviour of a presentation,
such as control of the speed or rate of time for an element.
Transitions: provides for transitions such as fades and wipes.
Scalability: provides for the definition of profiles of SMIL modules (1-10) that meet
the needs for a specific class of client devices.
SMIL is designed for creating interactive multimedia presentations, not for setting
up Webs of media documents. A SMIL document may result in a different experience for
every user and therefore is not a single, temporally addressable time-continuous
document. Thus, addressing temporal offsets does not generally make sense on a SMIL
document.
SMIL documents cannot generally be searched for clips of interest as they dont
typically contain the information required by a Web search engine: SMIL does not focus
on including metadata, annotations and hyperlinks, thus it does not provide for the
information necessary to be crawled and indexed by a search engine.
In addition, SMIL does not integrate the media documents required for its presentation in one single file, but instead references them from within the XML file. All media
data is only referenced and there is no transport format for a presentation that includes
all the relevant metadata, annotations, and hyperlinks interleaved with the media data to
provide a streamable format. This would not make sense anyway as some media data that
is referenced in a SMIL file may never be viewed by users as they may never activate the
appropriate action. SMIL interactions media streams will be transported on separate
connections to the initial SMIL file, requiring the client to perform all the media
synchronization tasks, and proxy caching can happen only on each file separately, not
on the complete interaction.
Note, however, that a single SMIL interaction, if recorded during playback, can
become a single time-continuous media document, which can be treated with our
Annodex technology to enable it to be searched and surfed. This may be interesting for
archiving and digital record-keeping.
MPEG-21
The ISO/MPEGs MPEG-21 (Burnett et al., 2003) standard is building an open
framework for multimedia delivery and consumption. It thus focuses on addressing how
to generically describe a set of content documents that belong together from a semantic
point of view, including all the information necessary to provide services on these digital
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
items. This set of documents is called a Digital Item, which is a structured representation
in XML of a work including identification, and metadata information.
The representation of a Digital Item may be composed of the following descriptors:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
MPEG-21 further provides for the handling of rights associated with Digital Items,
and for the adaptation of Digital Items to usage environments.
As an example for a Digital Item, consider a music CD album. When it is turned into
a digital item, the album is described in an XML document that contains references to the
cover image, the text on the CD cover, the text on an accompanying brochure, references
to a set of audio files that contain the songs on the CD, ratings of the album, rights
associated with the album, information on the different encoding formats in which the
music can be retrieved, different bitrates that can be supported when downloading etc.
This description supports the handling of a digital CD album as an object: it allows you
to manage it as an entity, describe it with metadata, exchange it with others, and collect
it as an entity.
An MPEG-21 document does not typically describe just one time-continuous
document, but rather several. These descriptions are temporally addressable and
hyperlinks can go into and out of them. Metadata can be attached to the descriptions of
the documents making them searchable and indexable for search engines.
As can be seen, MPEG-21 addresses the problem of how to handle groups of files
rather than focusing on the markup of a single media file, and therefore does not address
how to directly link into time-continuous Web resources themselves. There is a important
difference between linking into and out of descriptions of a time-continuous document
and linking into and out of a time-continuous document itself integrated handling
provides for cacheablility and for direct URI access.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
The aims of MPEG-21 are orthogonal to the aims that we pursue. While MPEG-21
enables a better handling of collections of Web resources that belong together in a
semantic way, Annodex enables a more detailed handling of time-continuous Web
resources only. Annodex provides a granularity of access into time-continuous resources that an MPEG-21 Digital Item can exploit in its descriptions of collections of
Annodex and other resources.
MPEG-7
ISO/MPEGs MPEG-7 (Martinez et al., 20020) standard is an open framework for
describing multimedia entities, such as image, video, audio, audiovisual, and multimedia
content. It provides a large set of description schemes to create markup in XML format.
MPEG-7 description schemes can provide the following features:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Specification of links and locators (such as time, media locators, and referencing
description tools).
Specification of basic information such as people, places, textual annotations,
controlled vocabularies, etc.
Specification of the spatio-temporal structure of multimedia content.
Specification of audio and visual features of multimedia content.
Specification of the semantic structure of multimedia content.
Specification of the multimedia content type and format for management.
Specification of media production information.
Specification of media usage (rights, audience, financial) information.
Specification of classifications for multimedia content.
Specification of user information (user description, user preferences, usage history).
Specification of content entities (still region, video/audio/audiovisual segments,
multimedia segment, ink content, structured collections).
Specification of content abstractions (semantic descriptions, media models, media
summaries, media views, media variations).
The main intended use of MPEG-7 is for describing multimedia assets such that they
can be queried or filtered. Just like SMIL and MPEG-21, the MPEG-7 descriptions are
regarded completely independent of the content itself.
An MPEG-7 document is an XML file that contains any sort of meta information
related to a media document. While the temporal structure of a media document can be
represented, this is not the main aim of MPEG-7 and not typically the basis for attaching
annotations and hyperlinks. This is the exact opposite approach to ours where the basis
is the media document and its temporal structure. Much MPEG-7 markup is in fact not
time-related and thus does not describe media content at the granularity we focus on.
Also, MPEG-7 does not attempt to create a temporally interleaved document format that
integrates the markup with the media data.
Again, the aims of MPEG-7 and Annodex are orthogonal. As MPEG-7 is a format that
focuses on describing collections of media assets, it is a primarily database-driven
approach towards the handling of information, while Annodex comes from a background
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
THE CHALLENGE
The technical challenge for the development of Annodex (Annodex.net, 2004) was
the creation of a solution to the three issues presented earlier:
1.
2.
3.
CMML (Pfeiffer, Parker, & Pang, 2003a), the Continuous Media Markup Language which is based on XML and provides tags to mark up time-continuous data
into sets of annotated temporal clips. CMML draws upon many features of HTML.
Annodex (Pfeiffer, Parker, & Pang, 2003b), the binary stream format to store and
transmit interleaved CMML and media data.
temporal URIs (Pfeiffer, Parker, & Pang, 2003c), which enables hyperlinking to
temporally specified sections of an Annodex resource.
Aside from the above technical requirements, the development of these technologies has been led by several principles and non-technical requirements. It is important
to understand these constraints as they have strongly influenced the final format of the
solution.
Hook into existing Web Infrastructure: The Annodex technologies have been
designed to hook straight into the existing Web infrastructure with as few
adaptations as necessary. Also, the scalability property of the Web must not be
compromised by the solution. Thus, CMML is very similar to HTML, temporal URI
queries are CGI (Common Gateway Interface) style parameters (NCSA HTTPd
Development Team, 1995), temporal URI fragments are like HTML fragments, and
Annodex streams are designed to be cacheable by Web proxies.
Open Standards: The aim of the CMWeb project is to extend the existing World
Wide Web to time-continuous data such as audio and video and create a more
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
These principles stem from a desire to make simple standards that can be picked up
and integrated quickly into existing Web infrastructure.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
THE SOLUTION
Figures 1 and 2 show screen shots of an Annodex Web browser and an Annodex
search engine. The Annodex browsers main window displays the media data, typical
Web browser buttons and fields (at the top), typical media transport buttons (at the
bottom), and a representative image (also called a keyframe) and hyperlink for the
currently displayed media clip. The story board next to the main window displays the list
of clips that the current resource consists of, enabling direct access to any clip in this
table of contents. The separate window on the top right displays the free-text annotation
stored in the description for the current clip, while the one on the lower right displays
the structured metadata stored for the resource or for the current clip. When crawling this
particular resource, a Web search engine can index all this textual information.
The Annodex search engine displayed in Figure 2 is a standard Web search engine
extended with the ability to crawl and index the markup of Annodex resources. It retrieves
clips that are relevant to the users query and presents ranked search results based on
the relevance of the markup of the clips. The keyframe of the clip and its description are
displayed.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Architecture Overview
As the Continuous Media Web technology has to be interoperable with existing
Web technology, its architecture must be the same as the World Wide Web (see Figure
3): there is a Web client, that issues a URI request over HTTP to a Web server, who
resolves it and serves out the requested resource back to the client. In the case where
the client is a Continuous Media Web browser, the request will be for an Annodex file,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
which contains all the relevant markup and media data to display the content to the user.
In the case where the client is a Web crawler (e.g. part of a Web search engine), the client
may add a HTTP Content-type request header with a preference for receiving only the
CMML markup and not the media data. This is possible because the CMML markup
represents all the textual content of an Annodex file and is thus a thin representation of
the full media data. In addition it is a bandwidth-friendly means of crawling and indexing
media content, which is very important for scalability of the solution.
Annodex is the format in which media with interspersed CMML markup is transferred over the wire. Analogous to a normal Web server offering a collection of HTML
pages to clients, an Annodex server offers a collection of Annodex files. After a Web
client has issued a URI request for an Annodex resource, the Web server delivers the
Annodex resource, or an appropriate subpart of it according to the URI query parameters.
Annodex files conceptually consist of multiple media streams and one CMML
annotation stream, interleaved in a temporally synchronised way. The annotation stream
may contain several sets of clips that provide alternative markup tracks for the Annodex
file. The media streams may be complementary, such as an audio track with a video track,
or alternative, such as two speech tracks in different languages. Figure 4 shows an
example Annodex file with three media tracks (light coloured bars) and an annotation
track with a header describing the complete file (dark bar at the start) and several
interspersed clips.
One way to author Annodex files is by creating a CMML markup file and encoding
the media data together with the markup based on the authoring instructions found in
the CMML file. Figure 5 displays the principle of the Annodex file creation process: the
header information of the CMML file and the media streams are encoded at the start of
the Annodex file, while the clips and the actual encoded media data are appended
thereafter in a temporally interleaved fashion.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
The choice of a binary encapsulation format for Annodex files was one of the
challenges of the CMWeb project. We examined several different encapsulation formats
and came up with a list of requirements:
1.
2.
3.
4.
5.
6.
7.
8.
the format had to provide framing for binary media data and XML markup,
temporal synchronisation between media data and XML markup was necessary,
the format had to provide a temporal track paradigm for interleaving,
the format had to have streaming capabilities,
for fault tolerance, resynchronisation after a parsing error should be simple,
seeking landmarks were necessary to allow random access,
the framing information should only yield a small overhead, and
the format needed to be simple to allow handling on devices with limited capabilities.
Hierarchical formats like MPEG-4 and QuickTime did not qualify due to requirement
2, making it hard to also provide for requirements 3 and 4. An XML based format also did
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
not qualify because binary data cannot be included in XML tags without having to
encode it in base64, which is inflating the data size by about 30% and creating
unnecessary additional encoding and decoding steps, and thus violating requirements
1 and 7. After some discussion, we adopted the Ogg encapsulation format (Pfeiffer, 2003)
developed by Xiphophorous (Xiphophorus, 2004). That gave us the additional advantage of having Open Source libraries available on all major platforms, much simplifying
the task of rolling out format support.
Structured annotations are name-value pairs which can follow a new or existing
metadata annotation scheme such as the Dublin Core (Dublin Core Metadata Initiative,
2003).
The markup of a clip tag contains information on the various clips or fragments of
the media:
Anchor points provide entry points into the media document that a URI can refer
to. Anchor points identify the start time and the name (id) of a clip. This enables
URIs to refer to Annodex clips by name.
URI hyperlinks can be attached to a clip, linking out to any other place a URI can
point to, such as clips in other annodexed media or HTML pages. These are given
by the a (anchor) tag with its href attribute. Furthermore, the a tag contains a textual
annotation of the link, the so-called anchor text (in the example above: Related video
on Detection of Galaxies) specifying why the clip is linked to a given URI. Note that
this is similar to the a tag in HTML.
An optional keyframe in the img tag provides a representative image for the clip and
enables display of a story board for Annodex files.
Unstructured textual annotations in the desc tags provide for searchability of
Annodex files. Unstructured annotation is free text that describes the clip itself.
Each clip belongs to a specific set of temporally non-overlapping clips that make
up one track of annotations for a time-continuous data file. The track attribute of a clip
provides this attribution if it is not specified, the clip belongs to the default track.
Using the above sample CMML file for authoring Annodex, the result will be a
galaxies.anx file of the form given in Figure 6.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Linking to Clips
Clips in Annodex files are identified by their id attribute. Thus, accessing a named
clip in an Annodex (and, for that matter, a CMML) file is achieved with the following CGI
conformant query parameter specification:
id=clip_id
Examples for accessing a clip in the above given sample CMML and Annodex files
are:
http://www.annodex.net/galaxies.cmml?id=findingGalaxies
http://www.annodex.net/galaxies.anx?id=findingGalaxies
On the Annodex server, the CMML and Annodex resources will be pre-processed
as a result of this query before being served out: the file header parts will be retained,
the time basis will be adjusted and the queried clip data will be concatenated at the end
to regain conformant file formats.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Available time schemes are npt for normal play time, different smpte specifications
of the Society of Motion Pictures and Television Engineers (SMPTE), and clock for a
Universal Time Code (UTC) time specification. For more details see the specification
document (Pfeiffer, Parker, & Pang, 2003c).
Examples for requesting one or several time intervals from the above given sample
CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml?t=85.28
http://www.annodex.net/galaxies.anx?t=npt:15.6-85.28,100.2
http://www.annodex.net/galaxies.anx?t=smpte-25:00:01:25:07
http://www.annodex.net/galaxies.anx?t=clock:20040114T153045.25Z
Where only a single time point is given, this is interpreted to relate to the time
interval covered from that time point onwards until the end of the stream.
The same pre-processing as described above will be necessary on the Annodex
server.
Views on Clips
Restricting the view on an Annodex (or CMML) file to a named clip makes use of
the value of the id tag of the clip in a fragment specification:
#clip_id
Examples for local clip views for the above given sample CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml#findingGalaxies
http://www.annodex.net/galaxies.anx#findingGalaxies
The Web client that is asked for such a resource will ask the Web server for the
complete resource and perform its application-specific operation on the clip only. This
may for example result in a sound editor downloading a complete sound file, then
selecting the named clip for further editing. An Annodex browser would naturally behave
analogously to an existing Web browser that receives a html page with a fragment offset:
it will fast forward to the named clip as soon as that clip has been received.
#[time-scheme:]time_interval
Examples for restrictions to one or several time intervals from the above given
sample CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml#85.28
http://www.annodex.net/galaxies.anx#npt:15.6-85.28,100.2
http://www.annodex.net/galaxies.anx#smpte-25:00:01:25:07
http://www.annodex.net/galaxies.anx#clock:20040114T153045.25Z
Where only a single time point is given, this is interpreted to relate to the time
interval covered from that time point onwards until the end of the stream. The same usage
examples as described above apply in this case, too. Specifying several time segments
may make sense only in specific applications, such as an editor, where an unconnected
selection for editing may result.
FEATURES OF ANNODEX
While developing the Annodex technology, we discovered that the Annodex file
format addresses many challenges of media research that were not part of the original
goals of its development but came to it with serendipity. Some of these will be regarded
briefly in this chapter.
For more details refer to the Annodex format specification document (Pfeiffer,
Parker, and Pang, 2003b).
A standardised multitrack media format is currently non-existent many applications, amongst them multitrack audio editors, will be able to take advantage of it,
especially since the Annodex format also allows inclusion of arbitrary meta information.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Multitrack Annotations
CMML and Annodex have been designed to provide a means of annotating and
indexing time-continuous data files by structuring their time-line into regions of interest
called clips. Each clip may have structured and unstructured annotations, a hyperlink and
a keyframe. A simple partitioning however does not allow for several different, potentially
overlapping subdivisions of the time-line into clips. After considering several different
solutions for such different subdivisions, we decided to adapt a multitrack paradigm for
annotations as well:
every clip of an Annodex or CMML file belongs to one specific annotation track,
clips within one annotation track cannot overlap temporally,
clips on different tracks can overlap temporally as needed,
the attribution of a clip to a track is specified through its track attribute if its not
given, its attributed to a default track.
Internationalisation Support
CMML and Annodex have also been designed to be language-independent and
provide full internationalisation support. There are two issues to consider for text in
CMML elements: different character sets and different languages.
As CMML is an XML markup language, different character sets are supported
through the xml processing instructions encoding attribute containing a file-specific
character set (World Wide Web Consortium, 2000). A potentially differing character set
for an import media file will be specified in the contenttype attribute of the source tag as
a parameter to the mime type.
Any tag or attribute that could end up containing text in a different language to the
other tags may specify their own language. This is only necessary for tags that contain
human-readable text. The language is specified in the lang and dir attributes.
the descriptive content of the title and desc tags, and the representative keyframe given
in the img tag, to provide a nice visual overview of the retrieved clip (see Figure 2).
For retrieval of the CMML file encapsulated in an Annodex file from an Annodex
server, HTTPs content type negotiation is used. The search engine only needs to
include into its HTTP request an Accept header with a higher priority on text/x-cmml than
on application/x-annodex and a conformant Annodex server will provide the extracted
CMML content for the given Annodex resource.
RESEARCH OPPORTUNITIES
There are a multitude of open research opportunities related to Annodex, some of
which are mentioned in this Section.
Further research is necessary for exploring transcoding of metadata. A multitude
of different markup languages for different kinds of time-continuous data already exist.
CMML is a generic means to provide structured and unstructured annotations on clips
and media files. Many of the existing ways to markup media may be transcode into CMML,
and utilise the power of Annodex. Transcoding is simple to implement for markup that
is also based on XML, because XSLT (World Wide Web Consortium, 1999C) provides
a good tool to implement such scripts.
Transcoding of metadata directly leads to the question of interoperability with other
standards. MPEG-7 is such a metadata standard for which it is necessary to explore
transcoding, however MPEG-7 is more than just textual metadata and there may be more
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
CONCLUSION
This chapter presented the Annodex technology, which brings the familiar searching and surfing capabilities of the World Wide Web to time-continuously sampled data
(Pfeiffer, Parker, and Schremmer, 2003). At the core of the technology are the Continuous
Media Markup Language CMML, the Annodex stream and file format, and clip- and timereferencing URI hyperlinks. These enable the extensions of the Web to a Continuous
Media Web with Annodex browsers, Annodex servers, and Annodex search engines.
Annodex is however more powerful as it also represents a standard multitrack media file
format with multitrack annotations, which can be cached on Web proxies and used in Web
server scripts for dynamic content creation. Therefore, Annodex and CMML present a
Web-integrated means for managing multimedia semantics.
ACKNOWLEDGMENT
The authors greatly acknowledge the comments, contributions, and proofreading
of Claudia Schremmer, who is making use of the Continuous Media Web technology in
her research on metadata extraction of meeting recordings.
REFERENCES
Annodex.net (2004). Open standards for annotating and indexing networked media.
Retrieved January 2004 from http://www.annodex.net
Berners-Lee, T., Fielding, R., & Masinter, L. (1998, August). Uniform resource identifiers
(URI): Generic syntax. Internet Engineering Task Force, RFC 2396. Retrieved
January 2003 from http://www.ietf.org/rfc/frc2396.txt
Berners-Lee, T., Fischetti, M. & Dertouzos, M.L. (1999). Weaving the Web: The original
design and ultimate destiny of the World Wide Web by its inventor. San Francisco:
Harper.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Burnett, I., Van de Walle, R., Hill, K., Bormans, J., & Pereira, F. (2003). MPEG-21: Goals
and achievements. IEEE Multimedia Magazine, Oct-Dec, 60-70.
Dimitrova, N., Zhang, H.-J., Shahraray, B., Sezna, I., Huang, T. & Zakhor A. (2002).
Applications of video-content analysis and retreival. IEEE Multimedia Magazine,
July-Sept, 42-55.
Dublin Core Metadata Initiative. (2003). The Dublin Core Metadata Element Set, v1.1.
February. Rectrieved January 2004 from http://dublincore.org/documents/2003/
02/04/dces
dvdforum (2000, September). DVD Primer. Retrieved January 2004 from http://
www.dvdforum.org/tech-dvdprimer.htm
Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L., Leach, P, & Berners-Lee, T.
(1999, June). Hypertext Transfer Protocol HTTP/1.1. Internet Engineering Task
Force, RFC 2616. Received January 2004 from http://www.ietf.org/rfc/rfc2616.txt
Martinez, J.M., Koenen, R., & Pereira, F. (2002). MPEG-7: The generic multimedia content
description standard. IEEE Multimedia Magazine, April-June, 78-87.
MPEG Industry Forum. (2002, February). MPEG-4 users frequently asked questions.
Retrieved January 2004 from http://www.mpegif.org/resources/mpeg4userfaq.php
NCSA HTTPd Development Team (1995, June). The Common Gateway Interface (CGI).
Retrieved January 2004 from http://hoohoo.ncsa.uiuc.edu/cgi/
Pfeiffer, S. (2003, May). The Ogg encapsulation format version 0. Internet Engineering
Task Force, RFC 3533. Retrieved January 2004 from http://www.ietf.org/rfc/
rfc3533.txt
Pfeiffer, S., Parker, C., & Pang, A. (2003a). The Continuous Media Markup Language
(CMML), Version 2.0 (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http://www.annodex.net/TR/draft-pfeiffercmml-01.txt
Pfeiffer, S., Parker, C., & Pang, A. (2003b). The Annodex annotation and indexing format
for time-continuous data files, Version 2.0 (work in progress). Internet Engineering
Task Force, December 2003. Retrieved January 2004 from http://www.annodex.net/
TR/draft-pfeiffer-annodex-01.txt
Pfeiffer, S., Parker, C., & Pang, A. (2003c). Specifying time intervals in URI queries and
fragments of time-based Web resources (BCP) (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http://
www.annodex.net/TR/draft-pfeiffer-temporal-fragments-02.txt
Pfeiffer, S., Parker, C., & Schremmer, C. (2003). Annodex: A simple architecture to enable
hyperlinking, search & retrieval of time-continuous data on the Web. Proceedings
5th ACM SIGMM International Workshop on Multimedia Information Retrieval
(MIR), Berkeley, California, November (pp. 87-93).
Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (1996, January). RTP: A
transport protocol for real-time applications. Internet Engineering Task Force, RFC
1889. Retrieved January 2004 from http://www.ietf.org/rfc/rfc1889.txt
Schulzrinne, H., Rao, A., & Lanphier, R. (1998, April). Real Time Streaming Protocol
(RTSP). Internet Engineering Task Force, RFC 2326. Retrieved January 2003 from
http://www.ietf.org/rfc/rfc2326.txt
World Wide Web Consortium (1999A). XML Path Language (XPath). W3C XPath,
November 1999. Retrieved January 2004 from http://www.w3.org/TR/xpath/
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
World Wide Web Consortium (1999B). HTML 4.01 Specification. W3C HTML, December
1999. Retrieved January 2004 from http://www.w3.org/TR/html4/
World Wide Web Consortium (1999C). XSL Transformations (XSLT) Version 1.0. W3C
XSLT, November 1999. Retrieved January 2004 from http://www.w3.org/TR/xslt/
World Wide Web Consortium (2000, October). Extensible Markup Language (XML) 1.0.
W3C XML. Retrieved January 2004 from http://www.w3.org/TR/2000/REC-xml20001006
World Wide Web Consortium (2001, August). Synchronized Multimedia Integration
Language (SMIL 2.0). W3C SMIL. Retrieved January 2004 from http://www.w3.org/
TR/smil20/
World Wide Web Consortium (2002, August). XML Pointer Language (XPointer). W3C
XPointer. Retrieved January 2004 from http://www.w3.org/TR/xptr/
Xiphophorus (2004). Building a new era of Open multimedia. Retrieved January 2004 from
http://www.xiph.org/
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 8
Management of
Multimedia Semantics
Using MPEG-7
Uma Srinivasan, CSIRO ICT Centre, Australia
Ajay Divakaran, Mitsubishi Electric Research Laboratories, USA
ABSTRACT
This chapter presents the ISO/IEC MPEG-7 Multimedia Content description Interface
Standard from the point of view of managing semantics in the context of multimedia
applications. We describe the organisation and structure of the MPEG-7 Multimedia
Description schemes which are metadata structures for describing and annotating
multimedia content at several levels of granularity and abstraction. As we look at
MPEG-7 semantic descriptions, we realise they provide a rich framework for static
descriptions of content semantics. As content semantics evolves with interaction, the
human user will have to compensate for the absence of detailed semantics that cannot
be specified in advance. We explore the practical aspects of using these descriptions
in the context of different applications and present some pros and cons from the point
of view of managing multimedia semantics.
INTRODUCTION
MPEG-7 is an ISO/IEC Standard that aims at providing a standard way to describe
multimedia content, to enable fast and efficient searching and filtering of audiovisual
content. MPEG-7 has a broad scope to facilitate functions such as indexing, management, filtering, authoring, editing, browsing, navigation, and searching content descripCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
183
tions. The purpose of the standard is to describe the content in a machine-readable format
for further processing determined by the application requirements.
Multimedia content can be described in many different ways depending on the
context, the user, the purpose of use and the application domain. In order to address the
description requirement of a wide range of applications, MPEG-7 aims to describe content
at several levels of granularity and abstraction to include description of features,
structure, semantics, models, collections and metadata about the content.
Initial research focused on feature extraction techniques influenced the description
of content at the perceptual feature level. Examples of visual features that can be extracted
using image-processing techniques are colour, shape and texture. Accordingly, there are
several MPEG-7 Descriptors (Ds) to describe visual features. Similarly there are a number
of low-level Descriptors to describe audio content at the level of spectral, parametric and
temporal features of an audio signal. While these Descriptors describe objective
measures of audio and visual features, they are inadequate for describing content at a
higher level of semantics to describe relationships among audio and visual descriptors
within an image or over a video segment. This need is addressed through the construct
called Multimedia Descriptions Scheme (MDS), also referred to simply as Description
Scheme (DS). Description schemes are designed to describe higher-level content
features such as regions, segments, objects and events, as well as metadata about the
content, its usage, and so forth. Accordingly, there are several groups or categories of
MDS tools.
An important factor that needs to be considered while describing audiovisual
content is the recognition that humans start to interpret and describe the meaning of the
content that goes far beyond visual features and cinematic constructs introduced in
films. While such meanings and interpretations cannot be extracted automatically,
because they are contextual, they can be described using free text descriptions. MPEG7 handles this aspect through several description schemes that are based on structured
free text descriptions.
As our focus is on management of multimedia semantics, we look at MPEG-7 MDS
constructs from two perspectives: (a) the level of granularity offered while describing
content, and (b) the level of abstraction available to describe multimedia semantics. The
second section provides an overview of the MPEG-7 constructs and how they hang
together. The third section looks at MDS tools to manage multimedia semantics at
multiple levels of granularity and abstraction. The fourth section takes a look at the whole
framework from the perspective of different applications. The last section presents some
discussions and conclusions.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
The Description Tools provide a set of Descriptors (D) that define the syntax and
the semantics of each feature, and a library of Description Schemes (DS) that specify the
structure and semantics of the relationships between their components, that may be both
Descriptors and Description Schemes. A description of a piece of audiovisual content
is made up of a number of Ds and DSs determined by the application. The description
tools can be used to create such descriptions which form the basis for search and
retrieval. A Description Definition Language (DDL) is used to create and represent the
descriptions. DDL is based on XML and hence allows the processing of descriptions
in a machine-readable format. Content descriptions created using these tools could be
stored in a variety of ways. The descriptions could be physically located with the content
in the same data stream or the same storage system, allowing efficient storage and
retrieval. However, there could be instances where content and its descriptions may not
be colocated. In such cases, we need effective ways to synchronise the content and its
Descriptions. System tools support multiplexing of description, synchronization issues,
transmission mechanisms, file format, and so forth. Figure 1 (Martnez, 2003) shows the
main MPEG-7 elements and their relationships.
MPEG-7 has a broad scope and aims to address the needs of several types of
applications (Vetro, 2001). MPEG-7 descriptions of content could include
D1
D2
D3
D5
D6
Descriptors
DS1
D4
structuring
D7
DS2
D3
DS4
DS5
DS6
DS7
Description
Schemes
Instantiation
<scene id=5>
<time>.
<camera>
<annotation>
</scene>
<scene id=6>
Encoding and
Delivery
Description
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
185
Content organization
User
interaction
Models
Navigation &
Access
Creation &
production
User
Preferences
Summary
Media
Usage
Content management
Views
User
History
Content description
Structural
aspects
Conceptual
aspects
Variations
Basic Elements
Schema tools
Basic datatypes
Basic Tools
Conceptual information of the reality captured by the content (example: objects and
events, interactions among objects).
Information about how to browse the content in an efficient way (example:
summaries, variations, spatial and frequency subbands).
Information about collections of objects.
Information about the interaction of the user with the content (user preferences,
usage history).
Basic Elements
Basic elements provide the fundamental constructs in defining MPEG-7 DSs. This
includes basic data types and a set of extended data types such as vectors, matrices to
describe the features and structural aspects of the content. The basic elements also
include constructs for linking media files, localising specific segments, describing time
and temporal information, place, individual(s), groups, organizations, and other textual
annotations.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Content Description
MPEG-7 DSs for content description are organised into two categories: DSs for
describing structural aspects, and DSs for describing conceptual aspects of the content.
The structural DSs describe audiovisual content at a structural level organised around
a segment. The Segment DS represents the spatial, temporal or spatiotemporal structure
of an audiovisual segment. The Segment DS can be organised into a hierarchical structure
to produce a table of contents for indexing and searching audiovisual content in a
structured way. The segments can be described at different perceptual levels using
Descriptors for colour, texture, shape, motion, and so on. The conceptual aspects are
described using semantic DS, to describe objects, events and abstract concepts. The
structure DSs and semantic DSs are related by a set of links that relate different semantic
concepts to content structure. The links relate semantic concepts to instances within the
content described by the segments. Many of the content description DSs are linked to
Ds which are, in turn, linked to DSs in a content management group.
Content Management
MPEG-7 DSs for content management include tools to describe information pertaining to creation and production, media coding, storage and file formats, and content
usage.
Creation information provides information related to the creators of the content,
creation locations, dates, other related material, and so forth. These could be textual
annotations or other multimedia content such as an image of a logo. This also includes
information related to classification of the content from a viewers point of view.
Media information describes information including location, storage and delivery
formats, compression and coding schemes, and version history based on media profiles.
Usage information describes information related to usage rights, usage record, and
related financial information. While rights management is not handled explicitly, the
Rights DS provides references in the form of unique identifiers to external rights owners
and regulatory authorities.
Content Organisation
The DSs under this category facilitate organising and modeling collections of
audiovisual content descriptions. The Collection DS helps to describe collections at the
level of objects, segments, and events, based on common properties of the elements in
the collection.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
187
User Interaction
This set of DSs describes user and usage preferences, usage history to facilitate
personalization of content access, presentation and consumption.
For more details of the full list of DSs, the reader is referred to the MPEG-7 URL at
http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm and Manjunath et al.
(2002).
REPRESENTATION OF
MULTIMEDIA SEMANTICS
In the previous section we described the MPEG-7 constructs and the method of
organising the MDS from a functional perspective, as presented in various official MPEG7 documents. In this section we look at the Ds and DSs from the perspective of addressing
multimedia semantics and its management. We look at the levels of granularity and
abstraction that MPEG-7 Ds and DSs are able to support. The structural aspects of
content description are meant to describe content at different levels of granularity
ranging from visual descriptors to temporal segments. The semantic DSs are developed
for the purpose of describing content at several abstract levels in a free text, but in a
structured form.
MPEG-7 deals with content semantics by considering narrative worlds. Since
MPEG-7 targets description of multimedia content, which is mostly narrative in nature,
it is reasonable for it to view the participants, background, context, and all the other
constituents of a narrative as the narrative world. Each narrative world can exist as a
distinct semantic description. The components of the semantic descriptions broadly
consist of entities that inhabit the narrative worlds, their attributes, and their relationships with each other.
Levels of Granularity
Let us consider a video of a play that consists of four acts. Then we can segment
the video temporally into four parts corresponding to the acts. Each act can be further
segmented into scenes. Each scene can be segmented into shots while a shot is defined
as a temporally continuous segment of video captured by a single camera. The shots can
in turn be segmented into frames. Finally, each frame can be segmented into spatial
regions. Note that each level of the hierarchy lends itself to meaningful semantic
description. Each level of granularity lends itself to distinctive Ds. For instance, we
could use the texture descriptor to describe the texture of spatial regions. Such a
description is clearly confined to the lowest level of the hierarchy we just described. The
2-D shape descriptors are similarly confined by definition. Each frame can also be
described using the scalable color descriptor, which is essentially a color histogram. A
shot consisting of several frames, however, has to be described using the group of frames
color descriptor, which aggregates the histograms of all the constituent shots using, for
instance, the median. Note that while it is possible to extend the color description to a
video segment of any length of time, it is most meaningful at the shot level and below.
The MotionActivity descriptor can be used to meaningfully describe any length of video,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
since it merely captures the pace or action in the video. Thus, a talking head segment
would be described as low action while a car chase scene would be described as
high action. A one-hour movie that mostly consists of car chases could reasonably
be described as high action. The motion trajectory descriptor, on the other hand, is
meaningful only at the shot level and meaningless at any lower or higher level. In other
words, each level of granularity has its own set of appropriate descriptors that may or
may not be appropriate at all levels of the hierarchy. The aim of such description is to
enable content retrieval at any desired level of granularity.
Levels of Abstraction
Note that in the previous section, the hierarchy stemmed from the temporal and
spatial segmentation, but not from any conceptual point of view. Therefore such a
description does not let us browse the content at varying levels of semantic abstraction
that may exist at a given constant level of temporal granularity. For instance, we may be
only interested in dramatic dialogues between character A and character B in one case,
and in any interactions between character A and character B in another. Note that the
former is an instance of the latter and therefore is at a lower level of abstraction. In the
absence of multilayered abstraction, our content browsing would have to be either
excessively general through restriction to the highest level of abstraction, or excessively
particular through restriction to the lowest level of abstraction. Note that to a human
being, the definition of too general and too specific depends completely on the need
of the moment, and therefore is subject to wide variation. Any useful representation of
the content semantics has to therefore be at as many levels of abstraction as possible.
Returning to the example of interactions between the characters A and B, we can
see that the semantics consists of the entities A and B, with their names being their
attributes and whose relationship with each other consists of the various interactions
they have with each other. MPEG-7 considers two types of abstraction. The first is media
abstraction, that is, a description that can describe more than one instance of similar
content. We can see that the description all interactions between characters A and B,
is an example of media abstraction since it describes all instances of media in which A
and B interact. The second type of abstraction is formal abstraction, in which the pattern
common to a set of multimedia examples contains placeholders. The description interaction between any two of the characters in the play is an example of such formal
abstraction. Since the definition of similarity depends on the level of detail of the
description and the application, we can see that these two forms of abstraction allow us
to accommodate a wide range of abstraction from the highly abstract to the highly
concrete and detailed.
Furthermore, MPEG-7 also provides ways to describe abstract quantities such as
properties, through the Property element, and concepts, through the Concept DS. Such
quantities do not result from an abstraction of an entity, and so are treated separately.
For instance, the beauty of a painting is a property and is not the result of somehow
generalizing its constituents. Concepts are defined as collections of properties that
define a category of entities but do not completely characterize it.
Semantic entities in MPEG-7 mostly consist of narrative worlds, objects, events,
concepts, states, places and times. The objects and events are represented by the Object
and Event DSs respectively. The Object DS and Event DS provide abstraction through
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
189
a recursive definition that allows, for example, subcategorization of objects into subobjects.
In that way, an object can be represented at multiple levels of abstraction. For instance,
a continent could be broken down into continent-country-state-district, and so forth, so
that it can be described at varying levels of semantic granularity. Note that the Object
DS accommodates attributes so as to allow for the abstraction we mentioned earlier, that
is, abstraction that is related to properties rather than generalization of constituents such
as districts. The hospitable nature of the continents inhabitants for instance cannot
result from abstraction of districts to states to countries, and so forth.
Semantic entities can be described by labels, by a textual definition, or in terms of
properties or of features of the media or segments in which they occur. The SemanticBase
DS contains such descriptive elements. The AbstractionLevel data type in the
SemanticBase DS describes the kind of abstraction that has been performed in the
description of the entity. If it is not present, then the description is considered concrete.
If the abstraction is a media abstraction, then the dimension of the AbstractionLevel
element is set to zero. If a formal abstraction is present, the dimension of the element is
set to 1 or higher. The higher the value is, the higher is the abstraction. Thus, a value of
2 would indicate an abstraction of an abstraction.
The Relation DS rounds off the collection of representation tools for content
semantics. Relations capture how semantic entities are connected with each other. Thus,
examples of a relation is doctor-patient, student-teacher, and so forth. Note that
since each of the entities in the relation lends itself to multiple levels of abstraction and
the relations in turn have properties, there is further abstraction that results from
relations.
APPLICATIONS
As we cover MPEG-7 semantic descriptions, we realize that they provide a rich
framework for static description of content semantics. Such a framework has the inherent
problem of providing an embarrassment of riches, which makes the management of the
browsing very difficult. Since MPEG-7 content semantics is very graph oriented, it is clear
that it does not scale well as the number of concepts/events/objects goes up. Creation
of a deep hierarchy through very fine semantic subdivision of the objects would result
in the same problem of computational intractability. As the content semantic representation is pushed more and more towards a natural language representation, evidence from
natural language processing research indicates that the computational intractability will
be exacerbated. In our view, therefore, the practical utility of such representation is
restricted to cases in which either the concept hierarchies are not unmanageably broad,
or the concept hierarchies are not unmanageably deep, or both.
Our view is that in interactive systems, the human uses will compensate for the
shallowness or narrowness of the concept hierarchies through their domain knowledge.
Since humans are known to be quick at sophisticated processing of data sets of small size,
the semantic descriptions should be at a broad scale to help narrow down the search
space. Thereafter, the human can compensate for the absence of detailed semantics
through use of low-level feature-based video browsing techniques such as video
summarization. Therefore, MPEG-7 semantic representations would be best used in
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
applications in which a modest hierarchy can help narrow down the search space
considerably. Let us consider some candidate applications.
Educational Applications
At first glance, since education is, after all, intended to be systematic acquisition
of knowledge, a semantics-based description of all the content seems reasonable. Our
experience indicates that restriction of the description to a narrow topic allows for a rich
description within the topic of research and makes for a successful learning experience
for the student. Any application in which the intention is to learn abstract concepts, an
overly shallow concept hierarchy will be a hindrance. Hence, our preference for narrowing the topic itself to limit the breadth of the representation so as to buy some space for
a deeper representation. The so called edutainment systems fall in the same general
category with varying degrees of compromise between the richness of the descriptions
and the size of the database. Such applications include tourist information, cultural
services, shopping, social, film and radio archives, and so forth.
191
that evolves with interaction and users context. There is a static aspect to the
descriptions, which limits adaptive flexibility needed for different types of applications.
Nevertheless, a standard way to describe the relatively unambiguous aspects of content
does provide a starting point for many applications where the focus is content management.
The generic nature of MPEG-7 descriptions can be both a strength and a weakness.
The comprehensive library of DSs are aimed to support a large number of applications,
and there are several tools to support the development of descriptions required for a
particular application. However, this requires a deep knowledge of MPEG-7, and the large
scope becomes a weakness, as it becomes impossible to pick and choose from a huge
library without understanding the implications of the choices made. As discussed in
section 4, often a modest set of content descriptions, DSs and elements may suffice for
a given application. This requires an application developer to first develop the
descriptions in the context of the application domain, determine the DSs to support the
descriptions, and then identify the required elements in the DSs. This is an involved
process and cannot be viewed in isolation of the domain and application context. As
MPEG-7 compliant applications start to be developed, it is possible that there could be
context-dependent elements and DSs that are essential to the application, but not
described in the standard, because the application context cannot be predetermined
during the definition stage.
In conclusion, these are still early days for MPEG-7 and their deployment in
managing the semantic aspects of multimedia applications. As the saying goes, the
proof of the pudding lies in the eating, and the success of the applications will determine
the success of the standard.
REFERENCES
Manjunath, B.S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-: Multimedia
content description interface. New York: John Wiley & Sons.
Martnez, J.M. (2003, March). MPEG-7 overview (version 9). ISO/IEC JTC1/SC29/
WG11N5525.
Vetro, A. (2001, January). MPEG-7 applications document version10. ISO/IEC JTC1/
SC29/WG11/N3934.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Section 3
User-Centric Approach
to Manage Semantics
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
193
Chapter 9
Visualization, Estimation
and User Modeling for
Interactive Browsing of
Personal Photo Libraries
Qi Tian, University of Texas at San Antonio, USA
Baback Moghaddam, Mitsubishi Electric Research Laboratories, USA
Neal Lesh, Mitsubishi Electric Research Laboratories, USA
Chia Shen, Mitsubishi Electric Research Laboratories, USA
Thomas S. Huang, University of Illinois, USA
ABSTRACT
Recent advances in technology have made it possible to easily amass large collections
of digital media. These media offer new opportunities and place great demands for new
digital content user-interface and management systems which can help people construct,
organize, navigate, and share digital collections in an interactive, face-to-face social
setting. In this chapter, we have developed a user-centric algorithm for visualization
and layout for content-based image retrieval (CBIR) in large photo libraries. Optimized
layouts reflect mutual similarities as displayed on a two-dimensional (2D) screen,
hence providing a perceptually intuitive visualization as compared to traditional
sequential one-dimensional (1D) content-based image retrieval systems. A framework
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
for user modeling also allows our system to learn and adapt to a users preferences. The
resulting retrieval, browsing and visualization can adapt to the users (time-varying)
notions of content, context and preferences in style and interactive navigation.
INTRODUCTION
Personal Digital Historian (PDH) Project
Recent advances in digital media technology offer opportunities for new storysharing experiences beyond the conventional digital photo album (Balabanovic et al.,
2000; Dietz & Leigh, 2001). The Personal Digital Historian (PDH) project is an ongoing
effort to help people construct, organize, navigate and share digital collections in an
interactive multiperson conversational setting (Shen et al., 2001; Shen et al., 2003). The
research in PDH is guided by the following principles:
1.
2.
3.
4.
The display device should enable natural face-to-face conversation: not forcing
everyone to face in the same direction (desktop) or at their own separate displays
(hand-held devices).
The physical sharing device must be convenient and customary to use: helping to
make the computer disappear.
Easy and fun to use across generations of users: minimizing time spent typing or
formulating queries.
Enabling interactive and exploratory storytelling: blending authoring and presentation.
Current software and hardware do not meet our requirements. Most existing
software in this area provides users with either powerful query methods or authoring
tools. In the former case, the users can repeatedly query their collections of digital
content to retrieve information to show someone (Kang & Shneiderman, 2000). In the
latter case, a user experienced in the use of the authoring tool can carefully craft a story
out of his or her digital content to show or send to someone at a later time. Furthermore,
current hardware is also lacking. Desktop computers are not suitably designed for group,
face-to-face conversation in a social setting, and handheld story-telling devices have
limited screen sizes and can be used only by a small number of people at once. The
objective of the PDH project is to take a step beyond.
The goal of PDH is to provide a new digital content user-interface and management
system enabling face-to-face casual exploration and visualization of digital contents.
Unlike conventional desktop user interface, PDH is intended for multiuser collaborative
applications on single display groupware. PDH enables casual and exploratory retrieval,
and interaction with and visualization of digital contents.
We design our system to work on a touch-sensitive, circular tabletop display
(Vernier et al., 2002), as shown in Figure 1. The physical PDH table that we use is a
standard tabletop with a top projection (either ceiling mounted or tripod mounted) that
displays on a standard whiteboard as shown in the right image of Figure 1. We use two
Mimio (www.mimio.com/meet/mimiomouse) styluses as the input devices for the first set
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
195
Figure 1. PDH table (a) an artistic rendering of the PDH table (designed by Ryan
Bardsley, Tixel HCI www.tixel.net) and (b) the physical PDH table
(a)
(b)
of user experiments. The layout of the entire tabletop display consists of (1) a large storyspace area encompassing most of the tabletop until the perimeter, and (2) one or more
narrow arched control panels (Shen et al., 2001). Currently, the present PDH table is
implemented using our DiamondSpin (www.merl.com/projects/diamondspin) circular
table Java toolkit. DiamondSpin is intended for multiuser collaborative applications
(Shen et al., 2001; Shen et al., 2003; Vernier et al., 2002).
The conceptual model of PDH is to focus on developing content organization and
retrieval metaphors that can be easily comprehended by users without distracting from
the conversation. We adopt a model of organizing the materials using the four questions
essential to storytelling: who, when, where, and what (the four Ws). We do not currently
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 2. An example of navigation by the four-Ws model (Who, When, Where, What)
support why, which is also useful for storytelling. Control panels located on the perimeter
of the table contain buttons labeled people, calendar, location, and events,
corresponding to these four questions. When a user presses the location button, for
example, the display on the table changes to show a map of the world. Every picture in
the database that is annotated with a location will appear as a tiny thumbnail at its
location. The user can pan and zoom in on the map to a region of interest, which increases
the size of the thumbnails. Similarly, by pressing one of the other three buttons, the user
can cause the pictures to be organized by the time they were taken along a linear timeline,
the people they contain, or the event keywords with which the pictures were annotated.
We assume the pictures are partially annotated. Figure 2 shows an example of navigation
of a personal photo album by the four-Ws model. Adopting this model allows users to
think of their documents in terms of how they would like to record them as part of their
history collection, not necessarily in a specific hierarchical structure. The user can make
selections among the four Ws and PDH will automatically combine them to form rich
Boolean queries implicitly for the user (Shen et al., 2001; Shen, Lesh, Vernier, Forlines,
& Frost, 2002; Shen et al., 2003; Vernier et al., 2002).
The PDH project combines and extends research in largely two areas: (i) humancomputer interaction (HCI) and interface (the design of the shared-display devices, user
interface for storytelling and online authoring, and storylistening) (Shen et al., 2001, 2002,
2003; Vernier et al., 2002); (ii) content-based information visualization, presentation and
retrieval (user-guided image layout, data mining and summarization) (Moghaddam et al.,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
197
2001, 2002, 2004; Tian et al., 2001, 2002). Our work has been done along these two lines.
The work by Shen et al. (2001, 2002, 2003) and Vernier et al. (2002) focused on the HCI
and interface design issue of the first research area. The work in this chapter is under the
context of PDH but focuses on the visualization, smart layout, user modeling and retrieval
part. In this chapter, we propose a novel visualization and layout algorithm that can
enhance informal storytelling using personal digital data such as photos, audio and
video in a face-to-face social setting. A framework for user modeling also allows our
system to learn and adapt to a users preferences. The resulting retrieval, browsing and
visualization can adapt to the users (time-varying) notions of content, context and
preferences in style and interactive navigation.
Related Work
In content-based image retrieval (CBIR), most current techniques are restricted to
matching image appearance using primitive features such as color, texture, and shape.
Most users wish to retrieve images by semantic content (the objects/events depicted)
rather than by appearance. The resultant semantic gap between user expectations and
the current technology is the prime cause of the poor takeup of CBIR technology. Due
to the semantic gap (Smeulders et al., 2000), visualization becomes very important for user
to navigate the complex query space. New visualization tools are required to allow for
user-dependent and goal-dependent choices about what to display and how to provide
feedback. The query result has an inherent display dimension that is often ignored. Most
methods display images in a 1D list in order of decreasing similarity to the query images.
Enhancing the visualization of the query results is, however, a valuable tool in helping
the user navigate query space. Recently, Horoike and Musha (2000), Nakazato and Huang
(2001), Santini and Jain (2000), Santini et al. (2001), and Rubner (1999) have also explored
toward content-based visualization. A common observation in these works is that the
images are displayed in 2D or 3D space from the projection of the high-dimensional
feature spaces. Images are placed in such a way that distances between images in 2D or
3D reflect their distances in the high-dimensional feature space. In the works of Horoike
and Musha (2000) and Nakazato and Huang (2001), the users can view large sets of images
in 2D or 3D space and user navigation is allowed. In the works of Nakazato and Huang
(2001), Santini et al. (2000, 2001), the system allows user interaction on image location
and forming new groups. In the work of Santini et al. (2000, 2001), users can manipulate
the projected distances between images and learn from such a display.
Our work (e.g., Tian et al., 2001, 2002; Moghaddam et al., 2001, 2002, 2004) under the
context of PDH shares many common features with the related work (Horoike & Musha,
2000; Nakazato & Huang, 2001; Santini et al., 2000, 2001; Rubner, 1999). However, a
learning mechanism from the display is not implemented in Horoike and Musha (2000),
and 3D MARS (Nakazato & Huang, 2001) is an extension to our work (Tian et al., 2001;
Moghaddam et al. 2001) from 2D to 3D space. Our system differs from the work ofRubner
(1999) in that we adopted different mapping methods. Our work shares some features with
the work by Santini and Jain (2000) and Santini et al. (2001) except that our PDH system
is currently being incorporated into a much broader system for computer human-guided
navigating, browsing, archiving, and interactive storytelling with large photo libraries.
The part of this system described in the remainder of this chapter is, however, specifically
geared towards adaptive user modeling and relevance estimation and based primarily on
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
visual features as opposed to semantic annotation as in Santini and Jain (2000) and
Santini et al. (2001).
The rest of the chapter is organized as follows. In Content-Based Visualization, we
present designs for uncluttered visualization and layout of images (or iconic data in
general) in a 2D display space for content-based image retrieval (Tian et al., 2001;
Moghaddam et al., 2001). In Context and User Modeling, we further provide a mathematical framework for user modeling, which adapts and mimics the users (possibly changing)
preferences and style for interaction, visualization and navigation (Moghaddam et al.,
2002, 2004; Tian et al., 2002). Monte Carlo simulations in the Statistical Analysis section
plus the next section on User Preference Study have demonstrated the ability of our
framework to model or mimic users, by automatically generating layouts according to
users preference. Finally, Discussion and Future Work are given in the final section.
CONTENT-BASED VISUALIZATION
With the advances in technology to capture, generate, transmit and store large
amounts of digital imagery and video, research in content-based image retrieval (CBIR)
has gained increasing attention. In CBIR, images are indexed by their visual contents
such as color, texture, and so forth. Many research efforts have addressed how to extract
these low-level features (Stricker & Orengo, 1995; Smith & Chang, 1994; Zhou et al., 1999),
evaluate distance metrics (Santini & Jain, 1999; Popescu & Gader, 1998) for similarity
measures and look for efficient searching schemes (Squire et al. 1999; Swets & Weng,
1999).
In this section, we present a user-centric algorithm for visualization and layout for
content-based image retrieval. Image features (visual and/or semantic) are used to
display retrievals as thumbnails in a 2D spatial layout or configuration which conveys
pair-wise mutual similarities. A graphical optimization technique is used to provide
maximally uncluttered and informative layouts. We should note that one physical
instantiation of the PDH table is that of a roundtable, for which we have in fact
experimented with polar coordinate conformal mappings for converting traditional
rectangular display screens. However, in the remainder of this chapter, for purposes of
ease of illustration and clarity, all layouts and visualizations are shown on rectangular
displays only.
Traditional Interfaces
The purpose of automatic content-based visualization is augmenting the users
understanding of large information spaces that cannot be perceived by traditional
sequential display (e.g., by rank order of visual similarities). The standard and commercially prevalent image management and browsing tools currently available primarily use
tiled sequential displays that is, essentially a simple 1D similarity based visualization.
However, the user quite often can benefit by having a global view of a working
subset of retrieved images in a way that reflects the relations between all pairs of images
that is, N2 measurements as opposed to only N. Moreover, even a narrow view of ones
immediate surroundings defines context and can offer an indication on how to explore
the dataset. The wider this visible horizon, the more efficient the new query will be
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
199
Visual Features
We will first describe the low-level visual feature extraction used in our system.
There are three visual features used in our system: color moments (Stricker & Orengo,
1995), wavelet-based texture (Smith & Chang, 1994), and water-filling edge-based
structure feature (Zhou et al., 1999).
The color space we use is HSV because of its decorrelated coordinates and its
perceptual uniformity (Stricker & Orengo, 1995). We extract the first three moments
(mean, standard deviation and skewness) from the three-color channels and therefore
have a color feature vector of length 33 = 9.
For wavelet-based texture, the original image is fed into a wavelet filter bank and
is decomposed into 10 decorrelated subbands. Each subband captures the characteristics of a certain scale and orientation of the original image. For each subband, we extract
the standard deviation of the wavelet coefficients and therefore have a texture feature
vector of length 10.
For water-filling edge-based structure feature vector, we first pass the original
images through an edge detector to generate their corresponding edge map. We extract
eighteen (18) elements from the edge maps, including max fill time, max fork count, and
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
so forth. For a complete description of this edge feature vector, interested readers are
referred to Zhou et al. (1999).
201
Figure 3. Top 20 retrieved images (ranked top to bottom and left to right; query is shown
first in the list)
Figure 4 shows an example of a PCA Splat for the top 20 retrieved images shown in
Figure 3. In addition to visualization by layout, in this particular example, the sizes
(alternatively contrast) of the images are determined by their visual similarity to the
query. The higher the rank, the larger is the size (or higher the contrast). There is also
a number next to each image in Figure 4 indicating its corresponding rank in Figure 4. The
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
view of query image, that is, the top left one in Figure 3, is blocked by the images ranked
19 th, fourth, and 17th in Figure 4. A better view is achieved in Figure 7 after display
optimization.
Clearly the relevant images are now better clustered in this new layout as opposed
to being dispersed along the tiled 1D display in Figure 3. Additionally, PCA Splats
convey N2 mutual distance measures relating all pair-wise similarities between images,
whereas the ranked 1D display in Figure 3 provides only N.
Display Optimization
However, one drawback of PCA Splat is that some images can be partially or totally
overlapped, which makes it difficult to view all the images at the same time. The overlap
will be even worse when the number of retrieved images becomes larger, for example,
larger than 50. To solve the overlapping problem between the retrieved images, a novel
optimized technique is proposed in this section.
Given a set the retrieved images and their corresponding sizes and positions, our
optimizer tries to find a solution that places the images at the appropriate positions while
deviating as little as possible from their initial PCA Splat positions. Assume the number
of images is N. The image positions are represented by their center coordinates (xi, yi),
i = 1, ..., N, and the initial image positions are denoted as (xoi, yoi), i = 1, ..., N. The minimum
and maximum coordinates of the 2D screen are [xmin, x max, ymin, ymax]. The image size is
represented by its radius ri for simplicity, i = 1, ..., N and the maximum and minimum image
size is r max and rmin in radius, respectively. The initial image size is r oi, i = 1, ..., N.
To minimize the overlap, the images can be automatically moved away from each
other to decrease the overlap between images, but this will increase the deviation of the
images from their initial positions. Large deviation is certainly undesirable because the
initial positions provide important information about mutual similarities between images.
So there is a trade-off problem between minimizing overlap and minimizing deviation.
Without increasing the overall deviation, an alternative way to minimize the overlap is
to simply shrink the image size as needed, down to a minimum size limit. The image size
will not be increased in the optimization process because this will always increase the
overlap. For this reason, the initial image size r oi is assumed to be rmax.
The total cost function is designed as a linear combination of the individual cost
functions taking into account two factors. The first factor is to keep the overall overlap
between the images on the screen as small as possible. The second factor is to keep the
overall deviation from the initial position as small as possible.
J = F ( p) + S G ( p )
(1)
where F(p) is the cost function of the overall overlap and G(p) is the cost function of the
overall deviation from the initial image positions, S is a scaling factor which brings the
range of G(p) to the same range of F(p), and S is chosen to be (N1)/2. is a weight and
0. When is zero, the deviation of images is not considered in overlapping
minimization. When is less than one, minimizing overall overlap is more important than
minimizing overall deviation, and vice versa for is greater than one.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
203
F( p) = f ( p)
i =1 j =i +1
f
u>0
1
e
f ( p) =
0 u 0
(2)
(3)
f = ln(1uT ) |u =rmax
2
(4)
G ( p) = g ( p)
i =1
(5)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
g ( p) = 1 e
2
g
(6)
where v = ( xi xio )2 + ( yi yio )2 , v is the measure of deviation of the i th image from its
initial position. g is a curvature-controlling factor. (xi,yi) and (xoi,yoi) are the optimized and
initial center coordinates of the ith image, respectively, i = 1, ..., N.
Figure 6 shows the plot of g(p). With the increasing value of v, the cost of deviation
is also increasing.
From Figure 6, g in Equation (6) is calculated by setting T=0.95 when v = maxsep.
In our work, maxsep is set to be 2r max.
g = ln(1vT ) |v=maxsep
2
(7)
The optimization process is to minimize the total cost J by finding a (locally) optimal
set of size and image positions. The nonlinear optimization method was implemented by
an iterative gradient descent method (with line search). Once converged, the images will
be redisplayed based on the new optimized sizes and positions.
Figure 7 shows the optimized PCA Splats for Figure 3. The image with a yellow frame
is the query image in Figure 3. Clearly, the overlap is minimized while the relevant images
are still close to each other to allow a global view. With such a display, the user can see
the relations between the images, better understand how the query performed, and
subsequently formulate future queries more naturally. Additionally, attributes such as
contrast and brightness can be used to convey rank. We note that this additional visual
aid is essentially a third dimension of information display. For example, images with
higher rank could be displayed with larger size or increased brightness to make them
stand out from the rest of the layout. An interesting example is to display time or
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
205
timeliness by associating the size or brightness with how long ago the picture was
taken, thus images from the past would appear smaller or dimmer than those taken
recently. A full discussion of the resulting enhanced layouts is deferred to future work.
Also we should point out that despite our ability to clean-up layouts for maximal
visibility with the optimizer we have designed, all subsequent figures in this chapter show
Splats without any overlap minimization, because, for illustrating (as well as comparing)
the accuracy of the estimation results in subsequent sections, the absolute position was
necessary and important.
J = (dij p k p | Xi ( k ) X j ( k ) | p )2
i =1 j =1
k =1
(8)
The global minimum of this cost function corresponding to the optimal weight
parameter , is easily obtained using a constrained (nonnegative) least-squares. To
minimize J, take the partial derivative of J relative to l p for l = 1, ..., L and set them to
zero, respectively.
J
l p
=0
l = 1,L , L
(9)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
207
We thus have
L
| X
k =1
i =1 j =1
(l )
X j (l ) | p |X i ( k ) X j ( k ) | p = dij p | Xi (l ) X j (l ) | p l = 1,L , L
i =1 j =1
(10)
Define
N
R(l , k ) = | Xi (l ) X j (l ) | p | Xi ( k ) X j ( k ) | p
i =1 j =1
(11)
r (l ) = dij p | Xi (l ) X j (l ) | p
i =1 j =1
(12)
k =1
R (l , k ) = r (l ) l = 1,L , L
(13)
1 p
r (1)
R (1,1) R (1, 2) L R (1, L )
p
r (2)
R (2,1) R (2, 2) L R (2, L)
= 2
r=
R=
M and
M
M
M
M
M
p
L
r ( L)
R( L,1) R( L, 2) L R( L, L)
(14)
J = | p ( x, y) p (i ) ( x, y) |
i =1
(i )
(15)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
where p(i)(x,y) and p (i ) ( x, y ) are the original and projected 2D locations of the ith image,
respectively. This formulation is a more direct approach to estimation since it deals with
the final position of the images in the layout. Unfortunately, however, this approach
requires the simultaneous estimation of both the weight vectors as well as the projection
basis and consequently requires less-accurate iterative re-estimation techniques (as
opposed to more robust closed-form solutions possible with Equation (8)). A full
derivation of the solution for our deviation-based estimation is shown in Appendix A.
Compare two different estimation methods: stress-based and deviation-based. The
former is most useful, robust, and identifiable in the control theory sense of the word.
The latter uses a somewhat unstable re-estimation framework and does not always give
satisfactory results. However, we still provide a detailed description for the sake of
completeness. The shortcomings of this latter method are immediately apparent from the
solution requirements. This discussion can be found in Appendix A.
For the reasons mentioned above, in all the experiments reported in this chapter, we
use only the stress-based method of Equation (8) for estimation.
We note that in principle it is possible to use a single weight for each dimension of
the feature vector. However, this would lead to a poorly determined estimation problem
since it is unlikely (and/or undesirable) to have that many sample images from which to
estimate all individual weights. Even with plenty of examples (an over-determined
system), chances are that the estimated weights would generalize poorly to a new set of
images this is the same principle used in a modeling or regression problem where the
order of the model or number of free parameters should be less than the number of
available observations.
Therefore, in order to avoid the problem of over-fitting and the subsequent poor
generalization on new data, it is ideal to use fewer weights. In this respect, the less
weights (or more subspace groupings) there are, the better the generalization performance. Since the origin of all visual features, that is, 37 features, is basically from three
different (independent) visual attributes: color, texture and structure, it seems prudent
to use three weights corresponding to these three subspaces. Furthermore, this number
is sufficiently small to almost guarantee that we will always have enough images in one
layout from which to estimate these three weights. Therefore, in the remaining portion
of the chapter, we only estimated a weighting vector = { c , t , s }T , where c is the
weight for color feature of length Lc, t is the weight for texture feature of length Lt, s
and is the weight for structure feature of length , respectively. These weights c, t, s
are constrained such that they always sum to 1, and L = Lc + L t + Ls.
Figure 8 shows a simple user layout where three car images are clustered together
despite their different colors. The same is performed with three flower images (despite
their texture/structure). These two clusters maintain a sizeable separation, thus suggesting two separate concept classes implicit by the users placement. Specifically, in this
layout the user is clearly concerned with the distinction between car and flower
regardless of color or other possible visual attributes.
Applying the -estimation algorithm to Figure 8, the feature weights learned from
this layout are c = 0.3729, t = 0.5269 and s = 0.1002. This shows that the most important
feature in this case is texture and not color, which is in accord with the concepts of car
versus flower as graphically indicated by the user in Figure 8.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
209
Figure 9. PCA Splat on a larger set of images using (a) estimated weights (b) arbitrary
weights
(a)
(b)
Now that we have the learned feature weights (or modeled the user) what can we
do with them? Figure 9 shows an example of a typical application: automatic layout of a
larger (more complete data set) set of images in the style indicated by the user. Figure
9(a) shows the PCA Splat using the learned feature weight for 18 cars and 19 flowers. It
is obvious that the PCA Splat using the estimated weights captures the essence of the
configuration layout in Figure 8. Figure 9(b) shows a PCA Splat of the same images but
with a randomly generated , denoting an arbitrary but coherent 2D layout, which in this
case, favors color (c = 0.7629). This comparison reveals that proper feature weighting
is an important factor in generating the user-desired and sensible layouts. We should
point out that a random does not generate a random layout, but rather one that is still
coherent, displaying consistent groupings or clustering. Here we have used such
random layouts as substitutes for alternative (arbitrary) layouts that are nevertheless
valid (differing only in the relative contribution of the three features to the final design
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
of the layout). Given the difficulty of obtaining hundreds (let alone thousands) of real
user layouts that are needed for more complete statistical tests (such as those in the next
section), random layouts are the only conceivable way of simulating a layout by a
real user in accordance with familiar visual criteria such as color, texture or structure.
Figure 10(a) shows an example of another layout. Figure 10(b) shows the corresponding computer-generated layout of the same images with their high-dimensional
feature vectors weighted by the estimated , which is recovered solely from the 2D
configuration of Figure 10(a). In this instance the reconstruction of the layout is near
perfect, thus demonstrating that our high-dimensional subspace feature weights can in
fact be recovered from pure 2D information. For comparison, Figure 10(c) shows the PCA
Splat of the same images with their high-dimensional feature vectors weighted by a
random .
Figure 11 shows another example of user-guided layout. Assume that the user is
describing her family story to a friend. In order not to disrupt the conversational flow,
she only lays out a few photos from her personal photo collections and expects the
computer to generate a similar and consistent layout for a larger set of images from the
same collection. Figure 11(b) shows the computer-generated layout based on the learned
feature weights from the configuration of Figure 11(a). The computer-generated layout
is achieved using the -estimation scheme and postlinear, for example, affine transform
or nonlinear transformations. Only the 37 visual features (nine color moments (Stricker
& Orengo 1995), 10 wavelet moments (Smith & Chang, 1994) and 18 water-filling features
(Zhou et al., 1999)) were used for this PCA Splat. Clearly the computer-generated layout
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
211
Figure 11: Usermodeling for automatic layout. (a) a user-guided layout: (b) computer
layout for larger set of photos (four classes and two photos from each class)
(a)
(b)
is similar to the user layout with the visually similar images positioned at the userindicated locations. We should add that in this example no semantic features (keywords)
were used, but it is clear that their addition would only enhance such a layout.
STATISTICAL ANALYSIS
Given the lack of sufficiently large (and willing) human subjects, we undertook a
Monte Carlo approach to testing our user-modeling and estimation method. Monte Carlo
simulation (Metropolis & Ulam, 1949) randomly generates values for uncertain variables
over and over to simulate a model. Thereby simulating 1000 computer generated layouts
(representing ground-truth values of s), which were meant to emulate 1000 actual
userlayouts or preferences. In each case, estimation was performed to recover the
original values as best as possible. Note that this recovery is only partially effective due
to the information loss in projecting down to a 2D space. As a control, 1000 randomly
generated feature weights were used to see how well they could match the user layouts
(i.e., by chance alone).
Our primary test database consists of 142 images from the COREL database. It has
7 categories of car, bird, tiger, mountain, flower, church and airplane. Each class has about
20 images. Feature extraction based on color, texture and structure has been done offline and prestored. Although we will be reporting on this test data set due to its
common use and familiarity to the CBIR community we should emphasize that we have
also successfully tested our methodology on larger and much more heterogeneous image
libraries. (For example, real personal photo collections of 500+ images, including family,
friends, vacations, etc.). Depending on the particular domain, one can obtain different
degrees of performance, but one thing is for sure: for narrow application domain (for
example, medical, logos, trademarks, etc.) it is quite easy to construct systems which work
extremely well, by taking advantages of the limiting constraints in the imagery.
The following is the Monte Carlo procedure that was used for testing the significance and validity of user modeling with estimation:
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 12. Scatter plot of estimation, estimated weights versus original weights
1.
2.
3.
4.
5.
6.
7.
Randomly select M images from the database. Generate arbitrary (random) feature
weights in order to simulate a user layout.
Do a PCA Splat using this ground truth .
From the resulting 2D layout, estimate and denote the estimated as .
Select a new distinct (nonoverlapping) set of M images from the database.
Do PCA Splats on the second set using the original , the estimated , and a third
random ' (as control).
Calculate the resulting stress in Equation (8), and layout deviation (2D position
error) in Equation (9) for the original, estimated and random (control) values of ,
, and ', respectively.
Repeat 1,000 times.
The scatter plot of estimation is shown in Figure 12. Clearly there is a direct linear
relationship between the original weights and the estimated weights . Note that when
the original weight is very small (<0.1) or very large (>0.9), the estimated weight is zero
or one correspondingly. This means that when one particular feature weight is very large
(or very small), the corresponding feature will become the most dominant (or least
dominant) feature in the PCA, therefore the estimated weight for this feature will be either
one or zero. This saturation phenomenon in Figure 12 is seen to occur more prominently
for the case of structure (lower left of the rightmost panel) that is possibly more
pronounced because of the structure feature vector being so (relatively) high dimensional. Additionally, structure features are not as well defined compared with color and
texture (e.g., they have less discriminating power).
In terms of actual measures of stress and deviation we found that the -estimation
scheme yielded the smaller deviation 78.4% of the time and smaller stress 72.9%. The main
reason these values are less than 100% is due to the nature of the Monte Carlo testing
and the fact that working with low-dimensional (2D) spaces, random weights can be close
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
213
Figure 13. Scatter plot of deviation scores (a) equal weights (y-axis) versus estimation
weights (x-axis) (b) random weights (y-axis) versus estimated weights (x-axis)
to the original weights and hence can often generate similar user layouts (in this case
apparently about 25% of the time).
We should add that an alternative control or null hypothesis to that of random
weights = { 13 , 13 , 13}T is that of fixed equal weights. This weighting scheme corresponds
to the assumption that there are to be no preferential biases in the subspace of the
features, that they should all count equally in forming the final layout (or default PCA).
But the fundamental premise behind the chapter is that there is a change or variable bias
in the relative importance of the different features as manifested by different user layout
and styles. In fact, if there was to be no bias in the weights (i.e., they were set equal) then
there would be no usermodeling or adaptation necessary since there would always be
just one type or style of layout (the one resulting from equal weights). In order to
understand this question fully, we compare the results of random weights versus equal
weights (compared to the estimation framework advocated).
In an identical set of experiments, replacing random weights for comparison layouts
with equal weights = { 13 , 13 , 13}T , we found a similar distribution of similarity scores. In
particular, since the goal is obtaining accurate 2D layouts where positional accuracy is
critical, we look at the resulting deviation in the case of both random weights and equal
weight versus estimated weights. We carry out a large Monte Carlo experiments (10,000
trials) and Figure 13 shows the scatter plot of the deviation scores. Points above the
diagonal (not shown) indicate a deviation performance worse than that of weight
estimation. As can be seen here, the test results are roughly comparable for equal and
random weights.
In Figure 13, we noted that the -estimation scheme yielded the smaller deviation
72.6% of the time compared to equal weights (as opposed to the 78.4% compared with
random weights). We therefore note that the results and conclusions of these experiments are consistent despite the choice of equal or random controls, and ultimately direct
estimation of a user layout is best.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Finally we should note that all weighting schemes (random or not) define sensible
or coherent layouts. The only difference is in the amount by which color, texture and
structure is emphasized. Therefore even random weights generate nice or pleasing
layouts, that is, random weights do not generate random layouts.
Another control other than random (or equal) weights is to compare the deviation
of an -estimation layout generator to a simple scheme which assigns each new image
to the 2D location of its (un-weighted or equally weighted) 37-dimensional nearest
neighbor (NN) from the set of images previously laid out by the user. This control
scheme essentially operates on the principle that new images should be positioned on
screen at the same location as their nearest neighbors in the original 37-dimensional
feature space (the default similarity measure in the absence of any prior bias) and thus
essentially ignores the operating subspace defined by the user in a 2D layout. The NN
placement scheme would place the test picture, despite their similarity score, directly on
top of whichever image, currently on the table, that it is closest to. To do otherwise,
for example to place it slightly shifted away, and so forth, would simply imply the
existence of a nondefault smart projection function which defeats the purpose of this
control. The point of this particular experiment is to compare our smart scheme with
one which has no knowledge or preferential subspace weightings and see how this would
subsequently map to (relative) position on the display. The idea behind that is that a
dynamic user-centric display should adapt to varying levels of emphasis to color, texture,
and structure.
The distributions of the outcomes of this Monte Carlo simulation are shown in
Figure 14 where we see that the layout deviation using estimation (red: = 0.9691,
= 0.7776) was consistently lower by almost an order of magnitude than the nearest
neighbor layout approach (blue: = 7.5921 = 2.6410). We note that despite the
noncoincident overlap of the distributions tails in Figure 14, in every one of the 1,000
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
215
random trials the -estimation deviation score was found to be smaller than that of
nearest-neighbour (a key fact not visible in such a plot).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
User 1
User 2
User 3
User 4
User 5
User 6
Average
Preference for
estimates
90%
98%
98%
95%
98%
98%
96%
Preference for
random weights
10%
2%
2%
5%
2%
2%
4%
In fact, the nave test subjects were told nothing at all about the three feature types
(color, texture, structure), the associated or obviously the estimation technique. In this
regard, the paucity of the instructions was entirely intentional: whatever mental grouping
that seemed valid to them was the key. In fact, this very same flexible association of the
user is what was specifically tested for in the consistency part of the study.
Table 1 shows the results of this user study. The average preference indicated for
the -estimation-based layout was found to be 96% and an average consistency rate of
a user was 97%. We note that the -estimation method of generating a layout in a similar
style to the reference was consistently favored by the users. A similar experimental
study has shown this to also be true even if the test layouts consist of different images
than those used in the reference layout (i.e., similar but not identical images from the same
categories or classes).
217
screens, or, for example, on embedded tabletop devices (Shen et al., 2001; Shen et al.,
2003) designed specifically for purposes of storytelling or multiperson collaborative
exploration of large image libraries.
Many interesting questions still remain as our future research in the area of contentbased information visualization and retrieval. The next task is to carry out an extended
user-modeling study by having our system learn the feature weights from various sample
layouts provided by the user. We have already developed a framework to incorporate
visual features with semantic labels for both retrieval and layout.
Another challenging area is automatic summarization and display of large image
collections. Since summarization is implicitly defined by user preference, a estimation for
user modeling will play a key role in this and other high-level tasks where context is
defined by the user.
Finally, incorporation of relevance feedback for content-based image retrieval
based on the visualization of the optimized PCA Splat seems very intuitive and is
currently being explored. By manually grouping the relevant images together at each
relevance feedback step, a dynamic user-modeling technique will be proposed.
ACKNOWLEDGMENTS
This work was supported in part by Mitsubishi Electric Research Laboratories
(MERL), Cambridge, MA, and National Science Foundation Grant EIA 99-75019.
REFERENCES
Balabanovic, M., Chu, L., & Wolff, G. (2000). Storytelling with digital photographs.
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
The Hague, The Netherlands (pp. 564-571).
Belkin, M., & Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation. Neural Computation,15(6),1373-1396.
Brand, M. (2003). Charting a manifold. Mitsubishi Electric Research Laboratories
(MERL), TR2003-13.
Dietz, P., & Leigh, D. (2001). DiamondTouch: A multi-user touch technology. The
Proceedings of the 14th ACM Symposium on User Interface Software and Technology, Orlando, Florida (pp. 219-226).
Horoike, A., & Musha, Y. (2000). Similarity-based image retrieval system with 3D
visualization. Proceedings of IEEE International Conference on Multimedia and
Expo, New York, New York (Vol. 2, pp. 769-772).
Jolliffe, I. T. (1996). Principal component analysis. New-York: Springer-Verlag.
Kang, H., & Shneiderman, B. (2000). Visualization methods for personal photo collections: Browsing and searching in the photofinder. Proceedings of IEEE International Conference on Multimedia and Expo, New York, New York.
Metropolis, N. & Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247), 335-341.
Moghaddam, B., Tian, Q., & Huang, T. S. (2001). Spatial visualization for content-based
image retrieval. Proceedings of IEEE International Conference on Multimedia
and Expo, Tokyo, Japan.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Moghaddam, B., Tian, Q., Lesh, N., Shen, C., & Huang, T.S. (2002). PDH: A human-centric
interface for image libraries. Proceedings of IEEE International Conference on
Multimedia and Expo, Lausanne, Switzerland (Vol. 1, pp. 901-904).
Moghaddam, B., Tian, Q., Lesh, N., Shen, C., & Huang, T.S. (2004). Visualization and usermodeling for browsing personal photo libraries. International Journal of Computer Vision, Special Issue on Content-Based Image Retrieval, 56(1-2), 109-130.
Nakazato, M., & Huang, T.S. (2001). 3D MARS: Immersive virtual reality for contentbased image retrieval. Proceedings of IEEE International Conference on Multimedia and Expo, Tokyo, Japan.
Popescu, M., & Gader, P. (1998). Image content retrieval from image databases using
feature integration by choquet integral. Proceeding of SPIE Conference on
Storage and Retrieval for Image and Video Databases VII, San Jose, California.
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500), 2323-2326.
Rubner, Y. (1999). Perceptual metrics for image database navigation. Doctoral dissertation, Stanford University.
Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image
databases. IEEE Transactions on Knowledge and Data Engineering, 13(3), 337351.
Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 21(9), 871-883.
Santini, S., & Jain, R., (2000, July-December). Integrated browsing and querying for image
databases. IEEE Multimedia Magazine, 26-39.
Shen, C., Lesh, N., & Vernier, F. (2003). Personal digital historian: Story sharing around
the table. ACM Interactions, March/April (also MERL TR2003-04).
Shen, C., Lesh, N., Moghaddam, B., Beardsley, P., & Bardsley, R. (2001). Personal digital
historian: User interface design. Proceedings of Extended Abstract of SIGCHI
Conference on Human Factors in Computing Systems, Seattle, Washington (pp.
29-30).
Shen, C., Lesh, N., Vernier, F., Forlines, C., & Frost, J. (2002). Sharing and building
digital group histories. ACM Conference on Computer Supported Cooperative
Work.
Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based
image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(12), 1349-1380.
Smith, J. R., & Chang, S. F. (1994). Transform features for texture classification and
discrimination in large image database. Proceedings of IEEE International Conference on Image Processing, Austin, TX.
Squire, D. M., Mller, H., & Mller, W. (1999). Improving response time by search pruning
in a content-based image retrieval system using inverted file techniques. Proceedings of IEEE Workshop On Content-Based Access of Image and Video Libraries
(CBAIVL), Fort Collins, CO.
Stricker, M., & Orengo, M. (1995). Similarity of color images. Proceedings of. SPIE
Storage and Retrieval for Image and Video Databases, San Diego, CA.
Swets, D., & Weng, J. (1999). Hierarchical discriminant analysis for image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 21(5), 396-401.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
219
Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework
for nonlinear dimensionality reduction. Science, 290, 2319-2323.
Tian, Q., Moghaddam, B., & Huang, T. S. (2001). Display optimization for image browsing.
The Second International Workshop on Multimedia Databases and Image Communications, Amalfi, Italy (pp. 167-173).
Tian, Q., Moghaddam, B., & Huang, T. S. (2002). Visualization, estimation and usermodeling for interactive browsing of image libraries. International Conference on
Image and Video Retrieval, London (pp. 7-16).
Torgeson, W. S. (1998). Theory and methods of scaling. New York: John Wiley & Sons.
Vernier, F., Lesh, N., & Shen, C. (2002). Visualization techniques for circular tabletop
interface. Proceedings of Advanced Visual Interfaces (AVI), Trento, Italy (pp. 257266).
Zhou, S. X., Rui, Y., & Huang, T. S. (1999). Water-filling algorithm: A novel way for image
feature extraction based on edge maps. Proceedings of IEEE International
Conference on Image Processing, Kobe, Japan.
Zwillinger, D. (Ed.) (1995). Affine transformations, 4.3.2 in CRC standard mathematical tables and formulae (pp. 265-266). Boca Raton, FL: CRC Press.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
APPENDIX A
xi
xi
Let Pi = p ( i ) ( x, y) = and P i = p ( i ) ( x, y) = , and Equation (15) is rewritten as
yi
yi
N
J = || Pi P i ||2
(A.1)
i =1
X (ci )
Let Xi be the column feature vector of the ith image, where Xi = Xt(i ) , i = 1,L , N .
X (si )
(i)
(i)
X (i)
c , X t and X s are the corresponding color, texture and structure feature vector of
c X(i)
(i)
the ith image and their lengths are L c, Lt and L s, respectively. Let Xi = t X t be the
s X(i)
P i = U T ( Xi - Xm )
i = 1,L , N
(A.2)
seeking the optimal feature weights , projection matrix U, and column vector Xm such
as J in Equation (A.3) is minimized, given Xi, Pi, i = 1, ..., N.
N
J = || U T ( Xi X m ) Pi ||2
i =1
(A.3)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
221
column vector Xm, and then estimate feature weight vector based on the computed U
and Xm, and iterate until convergence.
Let U(0) be the eigenvectors corresponding to the largest two eigenvalues of the
covariance matrix of X, where X = [X1, X2, ..., XN], X (0)
m is the mean vector of X.
We have
(A. 4)
P i(0) = A Pi(0) + T
(A. 5)
2
J = || AU (0) ( Xi X (0)
m ) ( Pi T ) ||
i =1
(A. 6)
U11,LU1( Lc + Lt + Ls )
T
=
U
Let us rewrite
U 21,L ,U 2( Lc + Lt + Ls )
After some simplifications on Equation (A. 3), we have
N
J = || c Ai + t Bi + s Ci Di ||2
i =1
Lc
(k )
U1k X c (i )
k =1
where Ai = L
c
(k )
U 2 k Xc (i)
k =1
(A. 7)
Lt
Ls
(k )
(k )
X
(
)
U
i
1( k + Lc ) t
U1( k + Lc + Lt ) X s (i)
k =1
C = k =1
Bi = L
i
Ls
t
(k )
(k )
U 2( k + Lt ) Xt (i)
U 2( k + Lc + Lt ) X s (i)
k =1
k =1
and Di = U T X m + Pi ,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
=0
J
t
=0
J
s
=0
(A. 8)
We thus have:
E = f
N T
Ai Ai
i=1
N
where E = BiT Ai
i=1
N T
C i Ai
i=1
(A. 9)
A
i =1
B
i =1
C
i =1
Bi
Bi
Bi
N T
Ci
Ai Di
i =1
i=1
N T
T
Bi Ci , f = Bi Di
i =1
i=1
N T
T
Ci Ci
Ci Di
i =1
i =1
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 10
Multimedia Authoring:
Human-Computer Partnership
for Harvesting Metadata
from the Right Sources
Brett Adams, Curtin University of Technology, Australia
Svetha Venkatesh, Curtin University of Technology, Australia
ABSTRACT
This chapter takes a look at the task of creating multimedia authoring tools for the
amateur media creator, and the problems unique to the undertaking. It argues that a
deep understanding of both the media creation process, together with insight into the
precise nature of the relative strengths of computers and users, given the domain of
application, is needed before this gap can be bridged by software technology. These
issues are further demonstrated within the context of a novel media collection
environment, including a real- world example of an occasion filmed in order to
automatically create two movies of distinctly different styles. The authors hope that
such tools will enable amateur videographers to produce technically polished and
aesthetically effective media, regardless of their level of expertise.
INTRODUCTION
Accessibility to the means of authoring multimedia artifacts has expanded to
envelop the majority of desktops and homes of the industrialized world in the last decade.
Forrester Research predicts that by 2005, 92% of online consumers will create personal
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
multimedia content at least once a month (Casares et al., 2002). To take the crude analogy
of the written word, we all now have the paper and pencils at hand, the means to author
our masterpieces or simply communicate. Or rather, we would do if only we knew how to
write. Simply adding erasers, coloured pencils, sharpeners, scissors, and other such
items to the writing desk doesnt help us write War and Peace, or a Goosebumps novel,
nor even a friendly epistle. Similarly, software tools that allow us to cut, copy, and paste
video do not address the overarching difficulty of forming multimedia artifacts that
effectively (and affectively) achieve the desired communication or aesthetic integrity.
There is a large and diverse research community, utilizing techniques from a far
flung variety of fields including human-computer interaction, signal processing, linguistic analysis, computer graphics, video and image databases, information sciences and
knowledge representation, computational media aesthetics, and so forth, which has
grown up around this problem, and offering solutions and identifying problems that are similarly
varied. Issues and questions pertinent to the problem include, but are not limited to
The objective of this chapter is to emphasize the importance of clearly defining the
domain and nature of the creative/authoring activities of the user that we are seeking to
support with technology: The flow on effects of this decision are vitally important to the
whole authoring endeavour and impact all consequent stages of the process. Simply put,
definition of the domain of application for our technology the user, audience, means,
mood, and so forth of the authoring situation enables us to more precisely define the
lack our technology is seeking to supply, and in turn the nature and extent of metadata
or semantic information necessary to achieve the result, as well as the best way of going
about getting it. We will use our existing media creation framework, aimed at amateur
videographers, to help demonstrate the principles and possible implementations.
The structure of the remainder of the chapter is as follows: We first explore related
work with reference to the traditional three-part media creation process. This alerts us
to the relative density and location of research efforts, notes the importance of a holistic
approach to the media creation process, and helps define the questions we need to
answer when building our own authoring systems. We next examine those questions
arising in more detail. In particular, we note the importance of defining the domain of the
technology, which has flow on implications regarding the nature of the gap that our
technology is trying to close, and the best way of doing so. Finally, we present an example
system in order to offer further insight into the issues discussed.
BACKGROUND
Computer technology, like any other technology, is applied to a problem in order
to make the solution easier or possible. Questions that we should ask ourselves include:
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
What is the potential user trying to do? What is the lack that we need to supply? What
materials or information do we need to achieve this?
Let us consider solutions and ongoing research in strictly video related endeavours,
a subset of all multimedia authoring-related research, for the purpose of introducing some
of the terms and issues pertinent to the more detailed discussion that follows. We will,
at times, stray outside this to a consideration of related domains (e.g., authoring of
computer-generated video) for the purpose of illuminating ideas and implications with
the potential for beneficial cross-pollination. We will not be considering research
aimed at abstracting media as opposed to authoring it (e.g., Informedia Video Skims
(Wactlar et al., 1999) or the MoCA groups Video Abstracting (Pfeiffer & Effelsberg,
1997)), although it could be considered authoring in one sense.
Traditionally the process of creating a finished video presentation contains something like the following three phases: Preproduction, production and postproduction.
For real video (as opposed to entirely computer-generated media), we could be more
specific and label the phases: Scripting/Storyboarding (both aural and visual), Capture,
and Editing that is, decide what footage to capture and how, capture the footage, and
finally compose by selection, ordering and fitting together of footage, with all of the major
and minor crosstalk and feedback present in any imperfect process.
In abstract terms the three phases partition the process into Planning, Execution and
Polishing, and from this we can see that this partitioning is not the exclusive domain of
multimedia authoring, but is indeed appropriate, even necessary, for any creative or
communicative endeavour. Of course, the relative size of each stage, and the amount and
nature of feedback or revisitation for each stage will be different depending on the
particular kind of multimedia being authored and the environment in which it is being
created, but they nevertheless remain a useful partitioning of this process1. Figure 1 is
a depiction of the media creation workflow with the three phases noted.
One way of classifying work in the area of multimedia authoring technology is to
consider which stage(s) of this authoring process the work particularly targets or
emphasizes. Irrespective of the precise domain of application, where do they perceive the
most problematic lack to be?
Figure 1. Three authoring phases in relation to the professional film creation workflow
Planning
Execution
Author
Screenwriter
language rules
adaptation constraints,
screenplay rules etc.
genre conventions
[author intent]
Polishing
Director/
Cameraman
Editor
Viewer
perception
other movies
[director intent]
[screenwriter intent]
Novel
Raw footage
Movie
Experience
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Edit
We will start from the end, the editing phase, and work back toward the beginning.
Historically, this seems to have received the most attention in the development of
software solutions.
Technologies to aid the low-level operations of movie editing are now so firmly
established as to be the staples of commercial software. Software including iMovie,
Pinnacle Studio, Power Director, Video Wave and a host of others, provide the ability to
cut and paste or drag and drop footage, and align video and audio tracks, usually by
means of some flavour of timeline representation for the video under construction.
Additionally, the user is able to create complex transitions between shots and add titles
and credits. The claims about iMovie even run to it having single-handedly made
cinematographers out of parents, grandparents, students (iMovie, 2003). Although the
materials have changed somewhat scissors and film to mouse and bytes conceptually the film is still treated as an opaque entity, the operations blind to any aesthetic
or content attributes intrinsic to the material under the scalpel. All essentially provide
help for doing a mechanical process. They are the hammer and drill of the garage. As with
the professional version, what makes for the quality of the final product is the skill of the
hand guiding the scissors.
The next rung up the ladder of solution power sees a more complex approach to
adding smarts to the editing phase. It is here that we still find a lot of interest in the
research community.
Girgensohn et al. (2000) describe their semiautomatic video editor, Hitchcock, which
uses an automatic suitability metric based upon how erratic camera work is deemed for
a given piece of footage. They also employ a spring model, based upon the unsuitability metric, in order to provide a more global mechanism for manipulating the total output
video length. There are a number of similar applications, both commercial and research,
that follow this approach of bringing such smarts to the editing phase, including SILVER
(Casares et al., 2002), muvee autoProducer and ACD Video Magic.
Muvee autoProducer is particularly interesting in that it allows the user to specify
a style, including such options as Chaplinesque, Fifties TV and Cinema. These
presumably translate to allowable clip lengths, image filters and transitions an example
of broad genre conventions influencing automatic authoring, which are themselves
the end product of cinesthetic considerations. It provides an extreme example of low user
burden. All that is required of the user is that they provide a video file, a song music
selection, and select a style.
Roughly, these approaches all infer something about the suitability or desirability
or otherwise of a given piece of film based on a mapping between that property and a lowlevel feature of the video signal. This knowledge is then used to relieve the user of the
burden of actually having to carry out the mechanical operations of footage cut and paste
described above. The foundations of these approaches are as strong or weak as the link
between low-level feature and inferred cinematic property. They offer the equivalent
of a spellchecker for our text. These approaches attempt to automatically generate simple
metadata about video footage, a term which is becoming increasingly prominent.
Up to this point we can observe that none of these approaches either demand or
make use of information about footage related to its meaning, its semantics. They might
be able to gauge that a number of frames are the result of erratic camera work, but they
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
are not able to tell us that the object being filmed so poorly is the users daughter, who
also happened to be the subject of the preceding three shots.
Lindley et al. (2001) present work on interactive and adaptive generation of news
video presentations, using Rhetorical Structure Theory (RST) as an aid to construction.
Video segments, if labeled with a rhetorical functional role, such as elaboration or
motivation, may be combined automatically into a hierarchical structure of RST relations,
and thence into one or more coherent linear or interactive video presentations by
traversing that structure. Here the semantic information relates to rhetorical role and the
emphasis is on the coherency of the presentation.
Lindley et al. (2001, p. 8) note that, Content representation scheme[s] that can
indicate subject matter and bibliographical material such as the sources and originating
dates of the video contents of the database is necessary supplemental metadata for their
chosen genre of news presentation generation. Additionally, they suggest that narrative or associative/categorical techniques may help provide an algorithmic basis for
sequencing material so as to avoid problems such as continuity within subtopics. This
introduces the potential need for narrative theory or, more broadly, some sort of
discoursive theory in addition to the necessary intrinsic (denotative and connotative)
content semantic information.
Of interest is the work of Nack and Parkes (1995) and Nack (1996), who propose a
film editing model, and attempt to automatically compile humorous scenes from existing
footage. Their system, AUTEUR, aims to achieve a video sequence that realises an
overall thematic specification. They define the two prime problems of the video editing
process: Composing the film such that it is perceptible in its entirety, and in a manner
that engages the viewer emotionally and intellectually. The presentation should be
understandable and enthralling, or at least kind of interesting.
In addition to its own logic of story generation, the editor draws upon a knowledge
base which is labeled as containing World, Common sense, Codes, Filmic representations, Individual. Such knowledge is obviously hard to come by, but illustrates the type
of metadata that needs to be brought to bear upon the undertaking.
An interesting tangent to this work is the search for systems that resolve the ageold dilemma of the interactive narrative. For example, see Skov and Andersen (2001), Lang
(1999), and Sack and Davis (1994).
It is apparent that these last few approaches we casually lump together by virtue
of their revolving around (a) some theory of presentation, be it narrative, rhetorical or
whatever is appropriate for the particular domain of application, coupled with (b) some
knowledge about the available or desirable raw material by which inferences relating to
their function in terms of that theory may be made. It would also be fair to say that the
tacit promise is: The more you tell me about the footage, the more I can do for you.
Capture
Capture refers to the process of actually creating the raw media. For a home movie,
it means the time when the camera is being toted, gathering all of those waves and
embarrassed smiles. Unlike the process of recording incident sound, where the creative
input into the capture perhaps runs to adjusting the overall volume or frequency filter,
there is a whole host of parameters that come into play, which the average home
videographer is unaware of (due in part to the skill of the professional filmmaker, no
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
doubt; Walter Murch, a professional film editor, states that the best cut is an invisible
one). Any given shot capture attempt, where shot is defined to be the contiguous video
captured between when the start and stop buttons are pressed for a recording, or its
analog in simulated camera footage, be it computer generated or traditionally animated,
veritably bristles with parameters; light, composition, camera and object motion, camera
mounting, z-axis blocking, duration, focal length, and so forth, all impact greatly on the
final captured footage, its content and aesthetic potential.
So, given this difficulty for the amateur (or even professional) camera operator, what
can we do? To belabour our writing analogy, what do we do when we recognize that the
one holding the pen has little idea about what is required of them? We give them a form.
Or we provide friendly staff to answer their queries or offer suggestions.
There is research which focuses on the difficulties of this stage of the multimedia
authoring process. Bobick and Pinhanez (1995) describe work focusing on smart
cameras, able to follow simple framing requests originating with the director of a
production within a constrained environment (their example is a cooking show). This
handles the translation of cinematic directives to physical camera parameters, no mean
contribution on its own, but is reliant on the presence of a knowledgeable director.
There also exists a large literature that addresses the problem of capturing shots in
a virtual environment all the more applicable in these days of purely computergenerated media. The system described by He et al. (1996) takes a description of the
events taking place in the virtual world and uses simple film idioms, where an idiom might
be an encoded rule as to how to capture a two-way conversation, to produce camera
placement specifications in order to capture the action in a cinematically pleasing manner.
Or see Tomlinson et al. (2000) who attempt automated cinematography via a cameracreature that maps viewed agent emotions to cinematic techniques that best express
those emotions.
These approaches, however, are limited to highly constrained environments, real
or otherwise, which rules out the entire domain of the home movie.
Barry and Davenport (2003) describe interesting work aimed at transforming the role
of the camera from tool to creative partner. Their approach aims to merge subject sense
knowledge, everyday common sense knowledge stored in the Openmind Commonsense
database, and formal sense knowledge, the sort of knowledge gleaned from practiced
videographers, in order to provide on-the-spot shot suggestions. The aim is to help
during the capture process such that the resulting raw footage has the potential to be
sculpted into an engaging narrative come composition time. If taken, shot suggestions
retain their own metadata about the given shot.
The idea here is to influence the quality of the footage that will be presented to the
editing phase for those tools to take advantage of, because, from the point of view of the
editor, garbage in, garbage out.
Planning
Well, if that old adage is just as applicable here, why not shift the quality control
even farther upstream? Instead of the just-in-time (JIT) approach, why not plan for a good
harvest of footage right from the beginning? There are some who take this approach.
There are also tools for mocking up visualizations of the presentation in a variety
of manifestations. The storyboard is a popular representation a series of panels on
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
which sketches depicting important shots or scenes are arranged and has been a staple
of professional filmmaking for a long time. Productions today often take advantage of
three- dimensional (3D) modeling techniques to get an initial feel for appropriate shots
and possible difficulties. Baecker et al. (1996) provide an example in their Movie [and
lecture presentation] Authoring and Design (MAD) system. Aimed at a broad range of
users, it uses a variety of metaphors and views, allowing top-down and bottom-up
structuring of ideas, and even the ability to preview the production as it begins to take
shape. They note that an 11-year-old girl was able to create a two-minute film about
herself using a rough template for an autobiography provided by the system designers,
and this without prior experience using the software.
Bailey et al. (2001) present a multimedia storyboarding tool targeted at exploring
numerous behavioral design ideas early in the development of an interactive multimedia
application. One goal of the editor is to help authors determine narration and visual
content length and synchronization in order to achieve a desirable pace to the presentation.
In the field of (semi)automated media production, Kennedy and Mercer (2001) state
that, There is a rich environment for automated reasoning and planning about cinematographic knowledge (p. 1), referring to the multitude of possibilities available to the
cinematographer for mapping high-level concepts, such as mood, into decisions regarding which cinematic techniques to use. They present a semiautomated planning system
that aids animators in presenting intentions via cinematographic techniques. Instead of
limiting themselves to a specific cinematic technique, they operate at the meta level,
focusing on animator intentions for each shot. The knowledge base that they refer to is
part of the conventions of film making (e.g., see Arijon, 1976, or Monaco, 1981), including
lighting, colour choice, framing, and pacing to enhance expressive power. This is an
example of a planning tool that leverages a little knowledge about the content of the
production.
This technology, where applicable, moves the problem of getting decent footage
to the editing phase one step earlier it is no longer simply impromptu help at capture
time, but involves prior cognition.
clips, which in turn support the editing model. Editing models are constituted by rules
written in a computer language specifically for the production at hand. They leverage
designer-declared, viewer-settable response variables, which are parameterizations that
dictate the allowable final configurations of the presented video, and are coded specifically for a given production in a computer language. An example might be a tell me the
same story faster button.
Davis (2003) frames the major problems of multimedia creation to be solved as
enabling mass customization of media presentations and making media creation more
accessible for the average home user. He calls into question the tripartite media
production process outlined above as being inappropriate to the defined goals. That
issue aside for now, it is interesting to note the specific deficiencies that his Media
Streams system seeks to address. In calling for a new paradigm for media creation, the
needs illuminated include: (1) capture of guaranteed quality reusable assets and rich
metadata by means of an Active Capture model, and (2) the redeploying of (a) domain
knowledge to the representation in software of media content and structure and the
software functions that dictate recombination and adaptation of those artifacts, and (b)
the creative roles to the designers of adaptive media templates, which are built of those
functions in order to achieve a purpose whilst allowing the desired personalization of that
message (or whatever). He uses two analogies to illustrate the way the structure is fixed
in one sense whilst customisable in another: Lego provides a building block, a fixed
interface, simple parts from which countless wholes may be made this is analagous
to syntagmatic substitution; Mad Libs, a game involving blind word substitution into
an existing sentence template, is an example of paradigmatic substitution, where the user
shapes the (probably nonsensical, but nevertheless amusing) meaning of the sentence
within the existing syntactical bounds of the sentence. The title, Editing out editing,
alludes to the desire to enable the provider or user to push a button and have the media
automatically assembled in a well-formed manner: movies as programs.
These approaches rely on the presence of strong threads running through the
production process from beginning to end. The precise nature of those threads varies.
It may be clearly defined production purpose (this will be an educational presentation,
describing the horrors of X), or a guaranteed chain of consistent annotation (e.g., content
expressed reliably in terms of the ontology in play this family member is present in
this shot), or assumptions about the context of captured material and so forth, or a
combination of these, but it is this type of long-range coherency and consistency of
assumptions and information transmission that enables a quality media artifact in the
final analysis, by whatever criteria quality is judged in the particular instance.
There is another aspect to the multimedia creation process, which we have neglected thus far. It has to do with reusing or repurposing media artifacts, and the
corresponding phase might be called packaging for further use. The term reuse is
getting a lot of press these days in connection with multimedia. It is a complex topic and
we will not deal with it here except to say that some of the issues that come into play have
already cropped up in the preceding discussion. For example, those systems that
automatically allow for multiple versions of the same presentations contain a sort of
reuse. In order to do that our system needs to know something about the data and about
the context into which we are seeking to insert it.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
ISSUES IN DESIGNING
MULTIMEDIA AUTHORING TOOLS
Defining Your Domain
The preceding, somewhat loosely grouped, treatment of research related to multimedia authoring serves to highlight an important issue, one so obviously necessary as
to be often neglected: namely, definition of the domain of application of the ideas or
system being propounded (do we assume our audience is well aware of it? ours is the most
interesting after all, isnt it?) By domain is meant the complex of assumed context (target
user, physical limitations, target audience(s), etc.) and rules, conventions or proprieties operating (including all levels of genre as commonly understood, and possibly even
intrinsic genre ala Hirsch (1967)) which together constitute the air in which the solution
lives and becomes efficient in achieving its stated goals.
Consider, just briefly, a few of the possibilities: Intended users can be reluctantly
involved hobbyists, amateurs or professionals. Grasp of, and access to, technology may
range from knowing where the record button is on the old low-resolution, mono camera,
to being completely comfortable with multiple handheld devices and networks of remote
computing power. Some things might come easily to the user, while others are a struggle.
The setting might be business or pleasure, the environment solo or collaborative. The
user might have a day to produce the multimedia artifact, or a year. The intended audience
may be self, family and friends, the unknown interested, the unknown uninterested, or
all of the above, each with constraints reflecting his or her own.
That is not to say that each and every aspect must be explicitly enumerated and
instantiated, nor that they should all be specified to a fine point. Genres and abstractions,
catchalls, exist precisely because of their ability to specify ranges or sets that are more
easily handled. In one sense, they are the result of a consideration of many of the above
factors and serve to funnel all of those concerns into a set of manageable conventions.
What is important, though, is that the bounds you are assuming are made explicit.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
vacation movie from assembled clips. How much of what and which should go where?
The problem in this case is which sequence makes for an interesting progression, and
the corresponding gap is an understanding of film grammar and narrative principles (or
some other theory of discourse structure). The difference is between finding a citation
and knowing the principles of essay writing.
might sense intuitively where we want to end up (an enjoyable home movie), but we dont
know how to get there. Those very same pieces of knowledge (sons and fathers) do not
tell us anything about how they should be combined when plucked from their semantic
webs and put to the purpose of serving the argument.
But this is the sort of thing we have had success at getting computers to do.
Enumerating possibilities from a given set of rules and constraints is something that we
find easy to express algorithmically.
The particular model that helps us generate our discourse, its level of complexity
and emphasis, will vary. But deciding upon an appropriate model is the necessary first
step. If the target is an entertaining home movie, one choice is some form of narrative
model: it may be simple resolutions follow climaxes and so forth or more involved,
like Dramatica (www.dramatica.com), involving a highly developed view of story as
argument, with its Story Mind and four throughlines (e.g., Impact character
throughline, who represents an alternate approach to the main character). Following
Dramaticas queues is meant to help develop stories without holes in the argument.
If the target is verity, something like RST may be more appropriate, given its emphasis
on determining an objective basis for the coherency of a document. The nature of its
relations seem to be appropriate for a genre like news, which purports to deal in fact, and
we have already seen Lindley et al. (2001) use it for this purpose.
Given that we have a discourse laid out for us in elements. How do we fill them in
with content? (Lang, 1999) uses a generative grammar to produce terminals that are firstorder predicate calculus schemas about events, states, goals and beliefs. The story logic
has helped us to define what is needed at this abstract level, but how do we instantiate
the terminal with a piece of living, breathing content that matches the abstract contract
but is nuanced by the superior human world model?
Computer
Computer
Computer
Human
Manifestation
directive
Final (Re-)
manifestation
Content
Human
Instantiated
manifestation
* Constrained by success of
instantiation (i.e., captured footage)
Where would this leave us in relation to respective roles for human and computer
in the generation of quality media which communicates or entertains and does so well?
An important point is that the degree of freedom of the manifestation parameterization asked for, for example, give me a medium shot of your son next to his friend, is not
unlimited if the context of the user is impromptu or the videographer is purely an observer
and cannot effect the environment being filmed, that attend the home movie maker
(unlike, for example, automated text generation).
EXAMPLE SOLUTION TO
THE RAISED ISSUES
In this section, in order to further concretize the issues raised, we will consider a
specific example of a multimedia authoring system which endeavours to address them.
Our Domain
The domain of this system is, for the amateur/home user, seeking to make media to
be shared with friends and family, with potentially enough inherent interest to be sharable
with a wider audience for pleasure. Planning level is assumed to be anything up to last
minute, with obvious benefits for the more prior notice. Some mechanism for adapting
to time constraints, such as I only have an hour of filming left, is desirable. At present
we only assume a single camera available, and ideally a handheld personal computer (PC).
Assumed cinematographic skill of user is from point and click upward. Scope for personal
reuse and multiple views of the same discourse parameterized by genre is a desired goal.
Generally impromptu context with some ability to manipulate scene existents is assumed
to be possible although not necessary.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
The user in this context generally wants to better communicate with their home
movies, and implicitly wants something a bit more like what can be seen on TV or at the
movies. We can visualize what amateur videographers produce on a scale from largely
unmediated record easily obtained by simply pointing the camera at anything vaguely
interesting to the user being what they have to a movie, edited well, expressing a
level of continuity and coherency, with a sense of implicit narrative or even overt, and
using the particulars of the film medium (camera angle, motion, focal distance, etc.) to
heighten expression of content being what they want.
It is instructive to consider the genres on this scale more closely.
Record: This is what we are calling the most unmediated of all footage which the
home user is likely to collect. It may be a stationary camera capturing anything
happening within its field of view (golf swing, party, etc.). We note that many of
the low-level issues, such as adjustment for lighting conditions and focus, are dealt
with automatically by the hardware. But that is all you get. That means the resulting
footage, left as is, is only effective for a limited range of purposes mainly as
information, as hinted at by the label record. In other words, this is what my golf
swing looks like where the purpose might be to locate problems. In the case of
the party, the question might be who was there?
Moving photo-album: This is where the video camera is being used something like
a still camera but with the added advantage of movement and sound. The user
typically walks around and snaps content of interest. As with still images, he may
more or less intuitively compose the scene for heightened mediation for example,
close-ups of the faces of kids having fun at the sea but there is little thought of
clip-to-clip continuity. The unit of coherence is the clip.
Thematic or revelatory narrative: Here is where we first find loose threads of
coherence between clips/shots, and that coherence lies in an overarching theme
or subject. In the case of a thematic continuity, an example might be a montage
sequence of baby. By revelatory narrative, we mean concentrated on simply
observing the different aspects and nuances of a situation as it stands, unconcerned with a logical or emotional progression of any sort. There are ties that bind
shots together which need to be observed, but they are by no means stringent.
Traditional narrative home movie: By traditional narrative we mean a story in the
generally conceived sense, displaying that movement toward some sort of resolution, where There is a sense of problem-solving, of things being worked out in
some way, of a kind of ratiocinative or emotional teleology (Chatman, 1978, p. 48).
The units of semantics are larger than shots, which are subordinated to the larger
structures of scenes and sequences. Greater demands are placed upon the humble
shot, as it must now snap into scenes. The bristle of shot parameters (framing type,
motion, angle) now feed into threads, such as continuity, running between shots
that must be observed, makes them more difficult to place. Therefore we need more
forethought so that when we get to the editing stage we have the material we need.
System Overview
We have implemented a complete video production system that attempts to achieve
these goals, constituted by a storyboard, direct and edit life cycle analogous to the
professional film production model. Space permits only a high-level treatment of the
system. The salient components of the framework are
Figure 3 to Figure 5 present different views of the media creation process. We will
now discuss what happens at each stage of the workflow depicted in Figure 5. Note that,
while similar to Figure 1, Figure 5 is the amateur workflow in this case.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Purpose
STAR T
STO P
Directing
Storyboarding
Editing
Shooting-scripter
Aesthetic
structuralizers
Shot to footage map
Capture record
Film assembler
Storyboard
Narrative
template
Movie
Film
Author
Stage
(author)
Directives
Screenwriter
(mediate)
Director/Cameraman
(affect)
(capture)
Editor
(align)
(redress)
Story
Media
Potential "Video"
Potential Movie
Raw Footage
"Rough Cut"
Movie
Event or Scene
Shot directive
Author: The purpose of the first stage is to create the abstract, media nonspecific
story for the occasion (wedding, party, anything) that is to be the object of the home
movie. That is to say, the given occasion separated into parts, selected and ordered,
and thus made to form the content of a narrative. It culminates in a narrative
template, which is the deliverable passed to the next stage. Events may have a
relative importance attached to them, which may be used when time constraints
come into play. Templates may be created by the user, either from scratch or
through composition, specialization or generalization of existing templates, built
with the aid of a wizard using a rule-based engine or generative grammar seeded
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Execution
Author
Screenwriter
Director/C'man
Editor
Viewer
[apply creativity]
story m etadata
story m etadata
story m etadata
perception
cinesthetic metadata
cinesthetic metadata
other movies
formulated rules of
film gramm ar
cinematic m etadata
formulated adaption
constraints
cinesthetic metadata
Story
Idea
Polishing
cinematic m etadata
Raw footage
Movie
Experience
with user input or else simply selected as is from a library. The user is offered
differing levels of input allowing for his creativity or lack thereof. We have our
discourse.
Mediate: The purpose of this stage is to apply or specialize the narrative
template obtained in the author stage to our specific media and domain of the home
movie. This stage encapsulates the knowledge required to manifest the abstract
events of the discourse in a concrete surface manifestation in this case the pixels
and sound waves of video.
Affect: The purpose of this stage is to transform the initial media-specific directives
produced by the mediate stage into directives that maintain correct or well-formed
use of the film medium, such as observing good film convention like continuity, and
also better utilize the particular expressive properties of film in relation to the story,
such as raising the tempo toward a climax, and this with reference to the style or
genre chosen by the user, for example, higher tempo is allowed if the user wants
an action flick. The end result is a storyboard of shot directives for the user to
attempt to capture. Shot directives are the small circles below the larger scene
squares of the storyboard in Figure 3. They can also be seen as small circles in the
rectangular storyboards of Figure 4. The affect and mediate stages taken together
achieve the transformation from story structure to surface manifestation.
Capture: The purpose of this stage is simply to realize all shot directives with actual
footage. The user attempts to capture shots in the storyboard. He is allowed to do
this in any order, and may attempt a given shot directive any number of times if
unhappy with it. A capture is deemed a success or failure with respect to the shot
directive, consisting of all of its cinematic parameters. For example, a shot directive
might require the user to capture the bride and groom in medium shot, at a different
angle than the previous shot. Some parameters are harder to capture than others,
and the level of difficulty may be thresholded by the user. But this does affect which
metadata are attached to the footage when the user verifies it as a success. For
example, if the user is currently not viewing the camera angle parameter of the shot
directive, it is not marked as having been achieved in the given footage. This simple
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Misunderstanding of Directives
The unit of instruction for the user is the shot directive. Each shot directive consists
of a number of cinematic primitives which the user is to attempt to achieve in the shot.
The problem lies in the fact that even these terms are subject to misinterpretation by our
average user, remembering that the user may have little grasp of cinematic concepts.
The system has allowances for differing degrees of user comfort with cinematic
directives, a requirement stemming from definition of our target user as average that
is, variable in skill. Shot directive parameters may be thresholded in number and difficulty.
For example, a novice might only want to know what to shoot and whether to use camera
motion, whereas someone more comfortable with the camera might additionally want
directives concerning angle and aspect and perhaps even motivation for the given shot
in term of high-level movie elements, such as tempo.
But this still doesnt solve the problem of when a user thinks he understands a shot
directive parameter, but in actual fact does not.
In these thumbnails, Figure 6, taken from a recent home movie of a holiday in India,
built with the authoring system, we see that the same shot directive parameter framing
type, in this case calling for a close-up in two shots, has been shot once correctly as
a close-up and then as something more like a medium shot. The incorrectly filmed shot
will undoubtedly impact negatively on whatever aesthetic or larger scene orchestration
goals for which it was intended.
One possible solution to this problem would be to use a face detection algorithm
to cross-check the users understanding. The algorithm may be improved with reference
to the other cinematic primitives of shot directive in question (e.g., the subject is at an
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
oblique angle) as well as footage resulting from similar shot directives. Obviously this
solution is only applicable to shots where the subject is human.
The goal is to detect this inconsistency and alert the user to the possibility that they
have a misunderstanding about the specific parameter request. Of course, good visualization, 3D or otherwise, of what the system is requesting goes a long way in helping the
user understand.
Figure 7. Thumbnails from three shots intended to serve the scene function of Familiar
Image
after the new information of intervening shots. It cues the viewer back into the spatial
layout of the scene, and provides something similar to the reiteration of points covered
so far of an effective lecturing style.
The two shots on the bottom are shot in a way that the intended familiar image
function is achieved; they are similar enough visually. The first shot, however, is not.
It was deemed close enough by the user, with reference to the shot directive, but in terms
of achieving the familiar image function for which the footage is intended, it fails. In actual
fact, the reason for the difference stemmed from an uncontrollable element; a crowd
gathered at the first stall.
Here, given the impromptu context of the amateur videographer, part of the solution
lies in stressing the importance of the different shot directive parameters. There is the
provision in the system for ascribing differing levels of importance to shot parameters
on a shot-by-shot basis. For a shot ultimately supposed to provide the familiar image
function, this would amount to raising the importance of all parameters related to
achieving a visual shot composition similar to another shot (aspect, angle, etc., included). This in turn requires that the system prioritize from the discourse level down;
At this point, is the stabilizing function of the familiar image more important, impacting
on the clarity of the presentation, or is precise subject matter more needful?
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
CONCLUSION
We have considered the problem of creating effective multimedia authoring tools.
This need has been created by the increased ease with which we generate raw media
artifacts images, sound, video, text a situation resulting from the growing power
of multimedia enabling hardware and software, coupled with an increasing desire by
would-be authors to create and share their masterpieces.
In surveying existing approaches to the problem, we considered the emphases of
each in relation to the traditional process of media creation: planning, execution, and
polishing. We finished with a treatment of some examples of research with a particular
eye to the whole process.
We then identified some of the key issues to be addressed in developing multimedia
authoring tools. They include definition of the target domain, recognition of the nature
of the gap our technology is trying to bridge, and the importance of considering both the
deeper structures relating to content and how it is sequenced and the surface manifestations in media to which they give rise. Additionally, we highlighted the issue of
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
deciding upon the scope and nature of metadata, and the question of how it is
instantiated.
We then presented an implementation of a multimedia authoring system for building
home movies, in order to demonstrate the issues raised. In simple terms, we found it
effective to algorithmically construct the underlying discourse humanly fill metadata
and algorithmically shape the raw media to the underlying discourse and given genre
by means of that same metadata.
REFERENCES
Adams, B., Dorai, C., & Venkatesh, S. (2002). Towards automatic extraction of expressive
elements from motion pictures: Tempo. IEEE Transactions on Multimedia, 4(4),
472-481.
Agamanolis, S., & Bove, V., Jr. (2003). Viper: A framework for responsive television. 10(I),
88-98.
Arijon, D. (1976). Grammar of the film language. Silman-James Press.
Baecker, R., Rosenthal, A., Friedlander, N., Smith, E., & Cohen, A. (1996). A multimedia
system for authoring motion pictures. ACM Multimedia, 31-42.
Bailey, B., Konstan, J., & Carlis, J. (2001). DEMAIS: Designing multimedia applications
with interactive storyboards. In the Ninth ACM International Conference on
Multimedia (pp. 241-250).
Barry, B., & Davenport, G. (2003). Documenting life: Videography and common sense.
In the 2003 International Conference on Multimedia and Expo, Baltimore, MD.
Beal, J. (1974). Cine craft. London: Focal Press.
Bobick, A., & Pinhanez, C. (1995). Using approximate models as source of contextual
information for vision processing. In Proceedings of the ICCV95 Workshop on
Context-Based Vision (pp. 13-21).
Casares, J., Myers, B., Long, A., Bhatnagar, R., Stevens, S., Dabbish, L., et al. (2002).
Simplifying video editing using metadata. In Proceedings of Designing Interactive
Systems (DIS 2002) (pp. 157-166).
Chatman, S. (1978). Story and discourse: Narrative structure in fiction and film. Ithaca,
NY: Cornell University Press.
Davis, M. (2003). Editing out editing. In IEEE Multimedia Magazine (Special Edition)
(pp. 54-64). Computational Media Aesthetics. IEEE Computer Society.
Girgensohn, A., Boreczky, J., Chiu, P., Doherty, J., Foote, J., Golovchinsky, G., et al.
(2000). A semi-automatic approach to home video editing. In Proceedings of the
13th Annual ACM Symposium on User Interface Software and Technology (pp. 8189).
He, L.-W., Cohen, M., & Salesin, D. (1996). The virtual cinematographer: A paradigm for
automatic real-time camera control and directing. Computer Graphics, 30(Annual
Conference Series), 217-224.
Hirsch Jr., E. (1967). Validity in interpretation. New Haven, CT:Yale University Press.
iMovie, A. (2003). The new imovie video and audio snap into place. [Brochure].
Kennedy, K., & Mercer, R. E. (2001). Using cinematography knowledge to communicate
animator intentions. In Proceedings of the First International Symposium on
Smart Graphics, Hawthorne, New York (pp. 47-52).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Lang, R. (1999). A declarative model for simple narratives. In AAAI Fall Symposium on
Narrative Intelligence (pp. 134-141).
Lindley, C. A., Davis, J., Nack, F., & Rutledge, L. (2001) The application of rhetorical
structure theory to interactive news program generation from digital archives. CWI
technical report INS-R0101.
Mann, B. (1999). An introduction to rhetorical structure theory (RST). Retrieved from
http://www.sil.org/linguistics/rst/rintro99.htm
Monaco, J. (1981). How to read a film: The art, technology, language, history and theory
of film and media. Oxford, UK: Oxford University Press.
Nack, F. (1996). AUTEUR - The Application of Video Semantics and Theme Representation for Automated Film Editing (pp. 82-89). Doctoral dissertation, Lancaster
University, UK.
Nack, F. & Parkes, A. (1995). Auteur: The creation of humorous scenes using automated
video editing. IJCAI-95 Workshop on AI Entertainment and AI/Alife.
Pfeiiffer, R. L. S., & Effelsberg, W. (1997). Video abstracting. Communicationsof the
ACM, 40(12), 54-63.
Sack, W., & Davis, M. (1994). IDIC: Assembling video sequences from story plans and
content annotations. Proceedings of IEEE International Conference on Multimedia Computing and Systems, (pp. 30-36).
Schultz, E. & Schultz, D. (1972). How to make exciting home movies and stop boring your
friends and relatives. London: Robert Hale.
Skov, M., & Andersen, P. (2001). Designing interactive narratives. In COSIGN 2001 (pp.
59-66).
Tomlinson, B., Blumberg, B., & Nain, D. (2000). Expressive autonomous cinematography
for interactive virtual environments. In Proceedings of the Fourth International
Conference on autonomous Agents (pp. 317-324), Barcelona, Spain, June 3-7
(AGENTS2000).
Wactlar, H., Christel, M., Gong, Y., & Hauptmann, A. (1999). Lessons learned from
building a terabyte digital video library. IEEE Computer Magazine, 32, 66-73.
ENDNOTES
1
In the literary world, surveys turn up a remarkable variety of authoring styles and
an interesting analogy for us, in that they, nevertheless, generally still evince these
three distinct creative phases.
The reader might note the parallels in this view to the text planning stage and
surface realisation stage of a natural language generator.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 11
MM4U:
A Framework for
Creating Personalized
Multimedia Content
Ansgar Scherp, OFFIS Research Institute, Germany
Susanne Boll, University of Oldenburg, Germany
ABSTRACT
In the Internet age and with the advent of digital multimedia information, we succumb
to the possibilities that the enchanting multimedia information seems to offer, but end
up almost drowning in the multimedia information: Too much information at the same
time, so much information that is not suitable for the current situation of the user, too
much time needed to find information that is really helpful. The multimedia material
is there, but the issues of how the multimedia content is found, selected, assembled, and
delivered such that it is most suitable for the users interest and background, the users
preferred device, network connection, location, and many other settings, is far from
being solved. In this chapter, we are focusing on the aspect of how to assemble and
deliver personalized multimedia content to the users. We present the requirements and
solutions of multimedia content modeling and multimedia content authoring as we find
it today. Looking at the specific demands of creating personalized multimedia content,
we come to the conclusion that a dynamic authoring process is needed in which just
in time the individual multimedia content is created for a specific user or user group.
We designed and implemented an extensible software framework, MM4U (short for
MultiMedia for you), which provides generic functionality for typical tasks of a
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
INTRODUCTION
presentation formats. We identify the different tasks that arise in the context of creating
personalized multimedia content. The different components of the framework support
these different tasks for creating user-centric multimedia content: They integrate the
generic access to user profiles, media data, and associated meta data, provide support
for personalized multimedia composition and layout, as well as create the context-aware
multimedia presentations. With such a framework, the development of multimedia
applications becomes easier and much more efficient for different users with their
different (semantic) contexts. On the basis of the MM4U framework, we are currently
developing two sample applications: a personalized multimedia sightseeing tour and a
personalized multimedia sports news ticker. The experiences we gain from the development of these applications give us important feedback on the evaluation and continuous
redesign of the framework.
The remainder of this chapter is organized as follows: To review the notion of
multimedia content authoring, in Multimedia Content Authoring Today we present the
requirements of multimedia content modeling and the authoring support we find today.
Setting off from this, Dynamic Authoring of Personalized Content introduces the reader
to the tasks of creating personalized multimedia content and why such content can be
created only in a dynamic fashion. In Related Approaches, we address the related
approaches we find in the field before we present the design of our MM4U framework
in The Multimedia Personalization Framework section. As the personalized creation of
multimedia content is a central aspect of the framework, Creating Personalized Multimedia Content presents in detail the multimedia personalization features of the framework.
Impact of Personalization to The Development of Multimedia Applications shows how
the framework supports application developers and multimedia authors in their effort to
create personalized multimedia content. The implementation and first prototypes are
presented in Implementation and Prototypical Applications before we come to our
summary and conclusion in the final section.
MULTIMEDIA CONTENT
AUTHORING TODAY
In this section, we introduce the reader to current notions and techniques of
multimedia content modeling and multimedia content authoring. An understanding of
requirements and approaches in modeling and authoring of multimedia content is a
helpful prerequisite to our goal, the dynamic creation of multimedia content. For the
modeling of multimedia content we present our notion of multimedia content, documents,
and presentation and describe the central characteristics of typical multimedia document
models in the first subsection. For the creation of multimedia content, we give a short
overview of directions in multimedia content authoring today in the second subsection.
Multimedia Content
Multimedia content today is seen as the result of a composition of different media
elements (media content) in a continuous and interactive multimedia presentation.
Multimedia content builds on the modeling and representation of the different media
elements that form the building bricks of the composition. A multimedia document
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
represents the composition of continuous and discrete media elements into a logically
coherent multimedia unit. A multimedia document that is composed in advance to its
rendering is called preorchestrated in contrast to compositions that take place just before
rendering that are called live or on-the-fly. A multimedia document is an instantiation of
a multimedia document model that provides the primitives to capture all aspects of a
multimedia document. The power of the multimedia document model determines the
degree of the multimedia functionality that documents following the model can provide.
Representatives of (abstract) multimedia document models in research can be found with
CMIF (Bulterman et al., 1991), Madeus (Jourdan et al., 1998), Amsterdam Hypermedia
Model (Hardman, 1998; Hardman et al., 1994a), and ZYX (Boll & Klas, 2001). A multimedia
document format or multimedia presentation format determines the representation of a
multimedia document for the documents exchange and rendering. Since every multimedia presentation format implicitly or explicitly follows a multimedia document model, it
can also be seen as a proper means to serialize the multimedia documents representation for the purpose of exchange. Multimedia presentation formats can either be
standardized, such as the W3C standard SMIL (Ayars et al., 2001), or proprietary such
as the widespread Shockwave file format (SWF) of Macromedia (Macromedia, 2004). A
multimedia presentation is the rendering of a multimedia document. It comprises the
continuous rendering of the document in the target environment, the (pre)loading of
media data, realizing the temporal course, the temporal synchronization between continuous media streams, the adaptation to different or changing presentation conditions and
the interaction with the user.
Looking at the different models and formats we find, and also the terminology in the
related work, there is not necessarily a clear distinction between multimedia document
models and multimedia presentation formats, and also between multimedia documents
and multimedia presentations. In this chapter, we distinguish the notion of multimedia
document models as the definition of the abstract composition capabilities of the model;
a multimedia document is an instance of this model. The term multimedia content or
content representation is used to abstract from existing formats and models, and
generally addresses the composition of different media elements into a coherent multimedia presentation. Independent of the actual document model or format chosen for the
content, one can say that a multimedia content representation has to realize at least three
central aspects: the temporal, spatial, and interactive characteristics of a multimedia
presentation (Boll et al., 2000). However, as many of todays concrete multimedia
presentation formats can be seen as representing both a document model and an
exchange format for the final rendering of the document, we use these as an illustration
of the central aspects of multimedia documents. We present an overview of these
characteristics in the following listing; for a more detailed discussion on the characteristics of multimedia document models we refer the reader to (Boll et al., 2000; Boll & Klas,
2001).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
1993; Allen, 1983), enhanced interval-based temporal models that can handle time
intervals of unknown duration (Duda & Keramane, 1995; Hirzalla et al., 1995; Wahl
& Rothermel, 1994), event-based temporal models, and script-based realization of
temporal relations. The multimedia presentation formats we find today realize
different temporal models, for example, SMIL 1.0 (Bugaj et al., 1998) provides an
interval-based temporal model only, while SMIL 2.0 (Ayars et al., 2001) also
supports an event-based model.
For a multimedia document not only the temporal synchronization of these
elements is of interest but also their spatial positioning on the presentation media,
for example, a window, and possibly the spatial relationship to other visual media
elements. The positioning of a visual media element in the multimedia presentation
can be expressed by the use of a spatial model. With it one can, for example, place
one image about a caption or define the overlapping of two visual media. Besides
the arrangement of media elements in the presentation, also the visual layout or
design is defined in the presentation. This can range from a simple setting for
background colors and fonts up to complex visual designs and effects. In general,
three approaches to spatial models can be distinguished: absolute positioning,
directional relations (Papadias et al., 1995; Papadias & Sellis, 1994), and topological relations (Egenhofer & Franzosa, 1991). With absolute positioning we subsume both the placement of a media element at an absolute position with respect
to the origin of the coordinate system and the placement at an absolute position
relative to another media element. The absolute positioning of media elements can
be found, for example, with Flash (Macromedia, 2004) and the Basic Language
Profile of SMIL 2.0, whereas the relative positioning is realized, for example, by
SMIL 2.0 and SVG 1.2 (Andersson et al., 2004b).
A very distinct feature of a multimedia document model is the ability to specify user
interaction in order to let a user choose between different presentation paths.
Multimedia documents without user interaction are not very interesting as the
course of their presentation is exactly known in advance and, hence, could be
recorded as a movie. With interaction models a user can, for example, select or
repeat parts of presentations, speed up a movie presentation, or change the visual
appearance. For the modeling of user interaction, one can identify at least three
basic types of interaction: navigational interactions, design interactions, and
movie interactions. Navigational interaction allows the selection of one out of
many presentation paths and is supported by all the considered multimedia
document models and presentation formats.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
needed multimedia content authoring. We will look at the approaches we find in the
field of multimedia content authoring in the next section.
Multimedia Authoring
While multimedia content represents the composition of different media elements
into a coherent multimedia presentation, multimedia content authoring is the process
in which this presentation is actually created. This process involves parties from
different fields including media designers, computer scientists, and domain experts:
Experts from the domain provide their knowledge in the field; this knowledge forms the
input for the creation of a storyboard for the intended presentation. Such a storyboard
forms often the basis on which creators and directors plan the implementation of the story
with the respective media and with which writers, photographers, and camerapersons
acquire the digital media content. Media designers edit and process the content for the
targeted presentation. Finally, multimedia authors compose the preprocessed and
prepared material into the final multimedia presentation. Even though we described this
as a sequence of steps, the authoring process typically includes cycles. In addition, the
expertise for some of the different tasks in the process can also be held by one single
person. In this chapter, we are focusing on the part of the multimedia content creation
process in which the prepared material is actually assembled into the final multimedia
presentation.
This part is typically supported by professional multimedia development programs,
so-called authoring tools or authoring software. Such tools allow the composition of
media elements into an interactive multimedia presentation via a graphical user interface.
The authoring tools we find here range from domain expert tools to general purpose
authoring tools.
Domain expert tools hide as much as possible the technical details of content
authoring from the authors and let them concentrate on the actual creation of the
multimedia content. The tools we find here are typically very specialized and
targeted at a very specific domain. An example for such a tool has been developed
in the context of our previous research project Cardio-OP (Klas et al., 1999) in the
domain of cardiac surgery. The content created in this project is an interactive
multimedia book about topics in the specialized domain of cardiac surgery. Within
the project context, an easy-to-use authoring wizard was developed to allow
medical doctors to easily create pages of a multimedia book in cardiac surgery.
The Cardio-OP-Wizard guides the domain experts through the authoring process
by a digital storyboard for a multimedia book on cardiac surgery. The wizard hides
as much technical detail as possible.
On the other end of the spectrum of authoring tools we find highly generalized tools
such as Macromedia Director (Macromedia, 2004). These tools are independent of
the domain of the intended presentation and let the authors create very sophisticated multimedia presentations. However, the authors typically need to have high
expertise in using the tool. Very often programming in an integrated programming
language is needed to achieve special effects or interaction patterns. Consequently, the multimedia authors need programming skills and along with this some
experience in software development and software engineering.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
DYNAMIC AUTHORING
OF PERSONALIZED CONTENT
The authoring process described above so far represents a manual authoring of
multimedia content, often with high effort and cost involved. Typically, the result is a
multimedia presentation targeted at a certain user group in a special technical context.
However, the one-size-fits-all fashion of the multimedia content created does not
necessarily satisfy different users needs. Different users may have different preferences
concerning the content and also may access the content in networks on different end
devices. For a wider applicability, the authored multimedia content needs to carry some
alternatives that can be exploited to adapt the presentation to the specific preferences
of the users and their technical settings. Figure 1 shows an illustration of the variation
possibilities that a simple personalized city guide application can possess. The root of
the tree represents the multimedia presentation for the personalized city tour. If this
presentation was intended for both Desktop PC and PDA, this results in two variants of
the presentation. If then some tourists are interested only in churches, museums, or
palaces and would like to receive the content in either English or German, this already
sums up to 12 variants. If then the multimedia content should be available in different
presentation formats, the number of variation possibilities within a personalized city tour
increases again. Even though different variants are not necessarily entirely different and
may have overlapping content, the example is intended to illustrate that the flexibility of
multimedia content to personalize to different user contexts quickly leads to an explosion
of different options. And still the content can only be personalized within the flexibility
range that has been anchored in the content.
From our point of view, an efficient and competitive creation of personalized
multimedia content can only come from a system approach that supports the dynamic
authoring of personalized multimedia content. A dynamic creation of such content allows
for a selection and composition of just those media elements that are targeted at the users
specific interest and preferences. Generally, the dynamic authoring comprises the steps
and tasks that occur also with static authoring, but with the difference that the creation
process is postponed to the time when the targeted user context and the presentation
is created for this specific context. To be able to efficiently create presentations for
(m)any given contexts, a manual authoring of a presentation meeting the user needs is
not an option; instead, a dynamic content creation is needed.
As we look into the process of dynamic authoring of personalized multimedia
content, it is apparent that this process involves different phases and tasks. We identify
the central tasks in this process that need to be supported by a suitable solution for
personalized content creation.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
to
Desk
Pock
e
p PC
# Variants
t PC
Pa
la
ce
s
u
Ch
es
ch
Museums
u
Ch
Museums
s
he
rc
Pa
la
ce
s
an
Germ
an
Germ
an
Germ
an
Germ
an
Germ
Germ
sh
Engli
sh
Engli
sh
Engli
sh
Engli
sh
Engli
sh
Engli
an
12
...
...
...
36
SMIL
SVG
HTML
SMIL
BLP
Mobile
SVG
HTML
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Meta data
User profile
Transformation of
internal format
to concrete
presentation
format
Context
dependent
composition of
multimedia
content in
internal format
SMIL
Technical
environment
SVG
Document
structure
...
Horst-Janssen
Musem
Rules and
constraints
Layout and
style
Flash
"Personalization engine"
select
assemble
transform
present
in an internal document model (Scherp & Boll, 2004b). This internal document model
abstracts from the different characteristics of todays multimedia presentation formats
and, hence, forms the greatest common denominator of these formats. Even though our
abstract model does not reflect the fancy features of some of todays multimedia
presentation formats, it supports the very central multimedia features of modeling time,
space, and interaction. It is designed to be efficiently transformed to the concrete syntax
of the different presentation formats. For the assembly, the personalization engine uses
the parameters for document structure, the layout and style parameters, and other rules
and constraints that describe the structure of the personalized multimedia presentation,
to determine among others the temporal course and spatial layout of the presentation.
The center of Figure 2 sketches this temporal and spatial arrangement of selected media
elements over time in a spatial layout following the document structure and other
preferences. Only then in the transformation phase, the multimedia content in the internal
document model is transformed to a concrete presentation format. Finally, the just
generated personalized multimedia presentation is rendered and displayed by the actual
end device.
RELATED APPROACHES
In this section we present the related approaches in the field of personalized
multimedia content creation. We first discuss the creation of personalizable multimedia
content with todays authoring environments before we come to research approaches
that address a dynamic composition of adapted or personalized multimedia content.
Multimedia authoring tools like Macromedia Director (Macromedia, 2004) today
require high expertise from their users and create multimedia presentations that are
targeted only at a specific user or user group. Everything personalizable needs to be
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
programmed or scripted within the tools programming language. Early work in the field
of creating advanced hypermedia and multimedia documents can be found, for example,
with the Amsterdam Hypermedia Model (Hardman, 1998; Hardman et al., 1994b) and the
authoring system CMIFed (van Rossum, 1993; Hardman et al., 1994a) as well as with the
ZYX (Boll & Klas, 2001) multimedia document model and a domain-specific authoring
wizard (Klas et al., 1999). In the field of standardized models, the declarative description
of multimedia documents with SMIL allows for the specification of adaptive multimedia
presentations by defining presentation alternatives by using the switch element. A
manual authoring of such documents that are adaptable to many different contexts is too
complex; also the existing authoring tools such as GRiNS editor for SMIL from Oratrix
(Oratrix, 2004) are still tedious to handle. Some SMIL tools provide support for the
switch element to define presentation alternatives; a comfortable interface for editing
the different alternatives for many different contexts, however, is not provided. Consequently, we have been working on the approach in which a multimedia document is
authored for one general context and is then automatically enriched by the different
presentation alternatives needed for the expected user contexts in which the document
is to be viewed (Boll et al., 1999). However, this approach is reasonable only for a limited
number of presentation alternatives and limited presentation complexity in general.
Approaches that dynamically create personalized content are typically found on
the Web, for example, Amazon.com (Amazon, 1996-2004) or MyYahoo (Yahoo!, 2002).
However, these systems remain text-centric and are not occupied with the complex
composition of media data in time and space into real multimedia presentations. On the
pathway to an automatic generation of personalized multimedia presentations, we
primarily find research approaches that address personalized media presentations only:
For example, the home-video editor Hyper-Hitchcock (Girgensohn et al., 2003; Girgensohn
et al., 2001) provides a preprocessing of a video such that users can interactively select
clips to create their personal video summary. Other approaches create summaries of
music or video (Kopf et al., 2004; Agnihotri et al., 2003). However, the systems provide
an intelligent and intuitive access to large sets of (continuous) media rather than a
dynamic creation of individualized content. An approach that addresses personalization
for videos can be found, for example, with IBMs Video Semantic Summarization System
(IBM Corporation, 2004a) which is, however, still concentrating on one single media type.
Towards personalized multimedia we find interesting work in the area of adaptive
hypermedia systems which has been going on for quite some years now (Brusilovsky
1996; Wu et al., 2001; De Bra et al., 1999a, 2000, 2002b; De Carolis et al., 1998, 1999). The
adaptive hypermedia system AHA! (De Bra et al., 1999b, 2002a, 2003) is a prominent
example here which also addresses the authoring aspect (Stash & De Bra, 2003), for
example, in adaptive educational hypermedia applications (Stash et al., 2004). However,
though these and further approaches integrate media elements in their adaptive
hypermedia presentations, synchronized multimedia presentations are not in their focus.
Personalized or adaptive user interfaces allow the navigation and access of
information and services in a customized or personalized fashion. For example, work done
in the area of personalized agents and avatars considers presentation generation
exploiting natural language generation and visual media elements to animate the agents
and avatars (de Rosis et al., 1999). These approaches address the human computer
interface; the general issue of dynamically creating arbitrary personalized multimedia
content that meets the users information needs is not in their research focus.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
A very early approach towards the dynamic creation of multimedia content is the
Coordinated Multimedia Explanation Testbed (COMET), which is based on an expertsystem and different knowledge databases and uses constraints and plans to actually
generate the multimedia presentations (Elhadad et al., 1991; McKeown et al., 1993).
Another interesting approach to automate the multimedia authoring process has been
developed at the DFKI in Germany by the two knowledge-based systems, WIP (Knowledge-based Presentation of Information) and PPP (Personalized Plan-based Presenter).
WIP is a knowledge-based presentation system that automatically generates instructions for the maintenance of technical devices by plan generation and constraint solving.
PPP enhances this system by providing a lifelike character to present the multimedia
content and by considering the temporal order in which a user processes a presentation
(Andr, 1996; Andr & Rist, 1995,1996). Also a very interesting research approach
towards the dynamic generation of multimedia presentations is the Cuypers system (van
Ossenbruggen et al., 2000) developed at the CWI. This system employs constraints for
the description of the intended multimedia programming and logic programming for the
generation of a multimedia document (CWI, 2004). The multimedia document group at
INRIA in France developed within the Opra project a generic architecture for the
automated construction of multimedia presentations based on transformation sheets and
constraints (Villard, 2001). This work is continued within the succeeding project Web,
Accessibility, and Multimedia (WAM) with the focus on a negotiation and adaptation
architecture for multimedia services for mobile devices (Lemlouma & Layada, 2003,
2004).
However, we find limitations with existing systems when it comes to their expressiveness and flexible personalized content creation support. Many approaches for
personalization are targeted at a specific application domain in which they provide a very
specific content personalization task. The existing research solutions typically use a
declarative description like rules, constraints, style sheets, configuration files, and the
like to express the dynamic, personalized multimedia content creation. However, they can
solve only those presentation generation problems that can be covered by such a
declarative approach; whenever a complex and application-specific personalization
generation task is required, the systems find their limit and need additional programming
to solve the problem. Additionally, the approaches we find usually rely on fixed data
models for describing user profiles, structural presentation constraints, technical infrastructure, rhetorical structure, and so forth, and use these data models as an input to their
personalization engine. The latter evaluates the input data, retrieves the most suitable
content, and tries to most intelligently compose the media into a coherent aesthetic
multimedia presentation. A change of the input data models as well as an adaptation of
the presentation generator to more complex presentation generation tasks is difficult if
not unfeasible. Additionally, for these approaches the border between the declarative
descriptions for describing content personalization constraints and the additional
programming needed is not clear and differs from solution to solution. This leads us to
the development of a software framework that supports the development of personalized
multimedia applications.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
MULTIMEDIA
PERSONALIZATION FRAMEWORK
Most of the research approaches presented above apply to text-centered information only, are limited with regard to the personalizability, or are targeted at very specific
application domains. As mentioned above, we find that existing research solutions in the
field of multimedia content personalization provide interesting solutions. They typically
use a declarative description like style sheets, transformation rules, presentation
constraints, configuration files, and the like to express the dynamic, personalized
multimedia content creation. However, they can solve only those presentation generation problems that can be covered by such a declarative approach; whenever a complex
and application-specific personalization generation task is required, the systems find
their limit and need additional programming to solve the problem. To provide application
developers with a general, domain independent support for the creation of personalized
multimedia content we pursue a software engineering approach: the MM4U framework.
With this framework, we propose a component-based object-oriented software framework that relieves application developers from general tasks in the context of multimedia
content personalization and lets them concentrate on the application domain-specific
tasks. It supports the dynamic generation of arbitrary personalized multimedia presentations and therewith provides substantial support for the development of personalized
multimedia applications. The framework does not reinvent multimedia content creation
but incorporates existing research in the field and also can be extended by domain and
application-specific solutions. In the following subsection we identify by an extensive
study of related work and our own experiences the general design goals of this framework.
In the next subsection, we present the general design of the MM4U framework, and then
we present a detailed insight into the frameworks layered architecture in the last
subsection.
Objects (Hunter, 1999), Resource Description Framework (Beckett & McBride, 2003), and
the MPEG-7 Multimedia content description standard (ISO/IEC JTC 1/SC 29/WG 11, 1999,
2001a-e). For multimedia composition we analyzed the features of multimedia document
models, including SMIL (Ayars et al., 2001), SVG (Andersson et al., 2004b), Macromedia
Flash (Macromedia, 2004), Madeus (Jourdan et al., 1998), and ZYX (Boll & Klas, 2001).
For the presentation of multimedia content, respective multimedia presentation frameworks were regarded including Java Media Framework (Sun Microsystems, 2004),
MET++ (Ackermann 1996), and PREMO (Duke et al., 1999). Furthermore, other existing
systems and general approaches for creating personalized multimedia content that were
considered including the Cuypers engine (van Ossenbruggen et al., 2000) and the
Standard Reference Model for Intelligent Multimedia Presentation Systems (Bordegoni
et al., 1997).
We also derived design requirements to the framework from first prototypes of
personalized multimedia applications we developed in different fields such as a personalized sightseeing tour through Vienna (Boll, 2003), a personalized mobile paper chase
game (Boll et al., 2003), and a personalized multimedia music newsletter.
From the extensive study of related work and the first experiences and requirements
we gained from our prototypical applications, we developed the single layers of the
framework. We also derived three general design goals for MM4U. These design goals are
These general design goals have a crucial impact on the structure of the multimedia
personalization framework, which we present in the following section.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Application
dependent
functionality
modules for supporting the personalization task can be plugged in that provide the
required functionality.
As depicted in Figure 3, the MM4U framework provides four types of such hot
spots, where different types of modules can be plugged in. Each hot spot represents a
particular task of the personalization process. The hot spots can be realized by plugging
in a module that implements the hot spots functionality for a concrete personalized
multimedia application. These modules can be both application-dependent and application-independent. For example, the access to media data and associated meta data is not
necessarily application-dependent, whereas the composition of personalized multimedia
content can be heavily dependent on the concrete application.
After the general design of the framework, we take a closer look at the concrete
architecture of MM4U and its components in the next section.
SMIL 2.0
Generator
SMIL 2.0
BLP
Generator
Sequential
2
1
SVG 1.2
Generator
Mobile
SVG
Generator
Multimedia Composition
Parallel
Citytour
URI
CC/PP
Profile
storage
Connector Connector
(1)
Multimedia Presentation
Slideshow
URI
IR
Media
system
Connector Connector
Connectors: The User Profile Connectors and the Media Data Connectors bring the
user profile data and media data into the framework. They integrate existing
systems for user profile stores, media storage, and retrieval solutions. As there are
many different systems and formats available for user profile information, the User
Profile Connectors abstract from the actual access to and retrieval of user profile
information and provide a unified interface to the profile information. With this
component, the different formats and structures of user profile models can be made
accessible via a unified interface. For example, a flexible URIProfileConnector we
developed for our demonstrator applications gains access to user profiles over the
Internet. These user profiles are described as hierarchical ordered key-value pairs.
This is a quite simple model but already powerful enough to allow effective patternmatching queries on the user profiles (Chen & Kotz, 2000). However, as shown in
Figure 4 also a User Profile Connector for the access to, for example, a Composite
Capability/Preference Profile (CC/PP) server could be plugged into the framework.
On the same level, the Media Data Connectors abstract from the access to media
elements in different media storage and retrieval solutions that are available today
with a unified interface. The different systems for storage and content-based
retrieval of media data are interfaced by this component. For example the
URIMediaConnector, we developed for our demonstrator applications, provides
a flexible access of media objects and its associated meta data from the Internet via
http or ftp protocols. The meta data is stored in a single index file, describing not
only the technical characteristics of the media elements and containing the location
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
(2)
where to find the media elements in the Internet, but also comprise additional
information about them, for example, a short description of what is shown in a
picture or keywords for which one can search. By analogy with the access to user
profile information, another Media Data Connector plugged into the framework
could provide access to other media and meta data sources, for example, an image
retrieval (IR) system like IBMs QBIC (IBM Corporation, 2004b).
The Media Data Connector supports the query of media elements by the client
application (client-pull) as well as the automatic notification of the personalized
application when a new media object arises in the media database (server-push).
The latter is required, for example, by the personalized multimedia sports news
ticker (see the section about Sports4U) which is based on a multimedia event space
(Boll & Westermann, 2003).
Accessors: The User Profile Accessor and the Media Pool Accessor provide the
internal data model of the user profiles and media data information within the
system. Via this layer the user profile information and media data needed for the
desired content personalization are accessible and processable for the application.
The Connectors and Accessors are designed such that they are not reinventing
existing systems for user modeling or multimedia content management. They,
rather, provide a seamless integration of the systems by distinct interfaces and
comprehensive data models. In addition, when a personalized multimedia application uses more than one user profile database or media database, the Accessor layer
encapsulates the resources so that the access to them is transparent to the client
application.
While the following layer (3) to (5) each constitute single components within the
MM4U framework, the Accessor layer and Connectors layer do not. Instead the left side
and the right side of the layers (1) and (2), i.e., the User Profile Accessor and User Profile
Connectors as well as the Media Pool Accessor and Media Data Connectors, each form
one component in MM4U.
(3)
(4)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
(5)
Generators for SMIL 2.0, the Basic Language Profile (BLP) of SMIL 2.0 for mobile
devices (Ayars et al., 2001), SVG 1.2, Mobile SVG 1.2 (Andersson et al., 2004a)
comprising SVG Tiny for multimedia-ready mobile phones and SVG Basic for pocket
computers like Personal Digital Assistants (PDA) and Handheld Computers (HHC),
and HTML (Raggett et al., 1998). We are currently working on Presentation Format
Generators for Macromedia Flash (Macromedia, 2004) and other multimedia document model formats including HTML+TIME, the 3GPP SMIL Language Profile (3rd
Generation Partnership Project, 2003b), which is a subset of SMIL used for scene
description within the Multimedia Messaging Service (MMS) interchange format
(3rd Generation Partnership Project, 2003a), and XMT-Omega, a high-level abstraction of MPEG-4 based on SMIL (Kim et al., 2000).
Multimedia Presentation: The Multimedia Presentation component on top of the
framework realizes the interface for applications to actually play the presentation
of different multimedia presentation formats. The goal here is to integrate existing
presentation components of the common multimedia presentation formats like
SMIL, SVG, or HTML+TIME which the underlying Presentation Format Generator
produces. So the developers benefit from the fact that only players for standardized
multimedia formats need to be installed on the users end device and that they must
not spend any time and resources in developing their own render and display
engine for their personalized multimedia application.
The layered architecture of MM4U permits easy adaption for the particular requirements that can occur in the development of personalized multimedia applications. So
special user profile connectors as well as media database connectors can be embedded
into the Connectors layer of the MM4U framework to integrate the most diverse and
individual solutions for storage, retrieval and gathering for user profile information and
media data. With the ability to extend the Multimedia Composition layer by complex and
sophisticated composition operators, arbitrary personalization functionality can be
added to the framework. The Presentation Format Generator component allows integrating any output format into the framework to support most different multimedia players
that are available for the different end devices.
The personalized selection and composition of media elements and operators into
a coherent multimedia presentation is the central task of the multimedia content creation
process which we present in more detail in the following section.
CREATING PERSONALIZED
MULTIMEDIA CONTENT
The MM4U framework provides the general functionality for the dynamic composition of media elements and composition operators into a coherent personalized
multimedia presentation. Having presented the framework layers in the previous section,
we now look in more detail how the layers contribute to the different tasks in the general
personalization process as shown in Figure 2. The Media Data Accessor layer provides
the personalized selection of media elements by their associated meta data and is
described in the next subsection. The Multimedia Composition layer supports the
composition of media elements into time and space in the internal multimedia represenCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
tation format in three different manners, which are presented in detail in the the next three
subsections, and the final subsection describes the last step, the transformation of the
multimedia content in internal document model to an output format that is actually
delivered to and rendered by the client devices. This is supported by the Presentation
Format Generators layer.
The spatial layout expresses the arrangement and style of the visual media elements in
the multimedia presentation. Finally, with the interaction model the user interaction of
the multimedia presentation is determined, in order to let the user choose between
different paths of a presentation. For the temporal model, we selected an interval-based
approach as found in Duda & Keramane (1995). The spatial layout is realized by a
hierarchical model for media positioning (Boll & Klas, 2001). For interaction with the user
navigational and decision interaction are supported, as can be found with SMIL (Ayars
et al., 2001) and MHEG-5 (Echiffre et al., 1998; International Organisation for Standardization, 1996).
A basic composition operator or basic operator can be regarded as an atomic unit
for multimedia composition, which cannot be further broken down. Basic operators are
quite simple but applicable for any application area and therefore most flexible. Basic
temporal operators realize the temporal model, and basic interaction operators realize the
interaction possibilities of the multimedia presentation, as specified above. The two
basic temporal operators Sequential and Parallel, for example, can be used to present
media elements one after the other in a sequence respectively to present media elements
parallel at the same time. With basic temporal operators and media elements, the temporal
course of the presentation can be determined like a slideshow as depicted in Figure 5.
The operators are represented by white rectangles and the media elements by gray ones.
The relation between the media elements and the basic operators is shown by the edges
beginning with a filled circle at an operator and ending with a filled rhombus respectively
a diamond at a media element or another operator. The semantics of the slideshow shown
in Figure 5 are that it starts with the presentation of the root element, which is the Parallel
operator. The semantics of the Parallel operator are that it shows the operators and media
elements that are attached to it at the same time. This means that the audio file starts to
play while simultaneously the Sequential operator is presented. The semantics of the
Sequential operator are to show the attached media elements one after another, so while
the audio file is played in the background, the four slides are presented in sequence.
Besides the basic composition operators, the so-called projectors are part of the
Multimedia Composition layer. Projectors can be attached to operators and media
elements to define, for example, the visual and acoustical layout of the multimedia
presentation. Figure 6 shows the slideshow example from above with projectors attached.
The spatial position as well as the width and height of the single slide media elements
are determined by the corresponding SpatialProjectors. The volume, treble, bass, and
balance of the audio medium is determined by the attached AcousticProjector.
Figure 5. Slideshow as an example of assembled multimedia content
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
operator can be regarded as filling in fixed composition templates with suitable media
elements. Personalization can only take place in selecting those media elements that fit
the user profile information at best. For the dynamic creation of personalized multimedia
content even more sophisticated composition functionality is needed, that allows the
composition operators to change the structure of the generated multimedia content at
runtime. To realize such sophisticated composition functionality, additional composition logic needs to be included into the composition operators, which cannot be
expressed anymore even by the mentioned advanced document models we find in the
field.
The multimedia document tree generated by the CityMap operator is shown in the
bottom part of Figure 9. Its root element constitutes the Parallel operator. Attached to
it are the image of the city map and a set of InteractiveLink operators. Each InteractiveLink
represents a spot on the city map, instantiated by the spot image. The user can click on
the spots to receive multimedia presentations with further information about the sights.
The positions of the spot images on the city map are determined by the SpatialProjectors.
The personalized multimedia presentations about the sights are represented by the
sophisticated operators Target 1 to Target N.
The CityMap operator is one example of extending the personalization functionality
of the MM4U framework by a sophisticated application-specific multimedia composition
operator, here in the area of (mobile) tourism applications. This operator, for example, is
developed by programming the required dynamic multimedia composition functionality.
However, the realization of the internal composition logic of sophisticated operators is
independent of the used technology and programming language. The same composition
logic could also be realized by using a different technology, for example, a constraintbased approach. Though the actual realization of the personalized multimedia composition functionality would be different, the multimedia document tree generated by this
rule-based sophisticated operator would be the same as depicted in Figure 9.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
spatial model is actually performed and how the temporal model and interaction possibilities of the internal document model are transformed into the characteristics and syntax
of the concrete presentation formats is intentionally omitted in this book chapter due to
its focus on the composition and assembly of the personalized multimedia content and
is described in Scherp and Boll (2005).
IMPACT OF PERSONALIZATION
TO THE DEVELOPMENT OF
MULTIMEDIA APPLICATIONS
The multimedia personalization framework MM4U presented so far provides support to develop sophisticated personalized multimedia applications. Involved parties in
the development of such applications are typically a heterogeneous team of developers
from different fields including media designers, computer scientists, and domain experts.
In this section, we describe what challenges personalization brings to the development
of personalized multimedia applications and how and where the MM4U framework can
support the developer team to accomplish their job.
In the next subsection, the general software engineering issues in regard to
personalization are discussed. We describe how personalization affects the single
members of the heterogeneous developer team and how the MM4U framework supports
the development of personalized multimedia applications. The challenges that arise with
creating personalized multimedia content by the domain experts using an authoring tool
are presented in the following subsection. We also introduce how the MM4U framework
can be used to develop a domain-specific authoring tool in the field of e-learning content,
which aims to hide the technical details of content authoring from the authors and lets
them concentrate on the actual creation of the personalized multimedia content.
a new domain. Rossi et al. (2001) claim that personalization should be considered directly
from the beginning when a project is conceived. Therefore, the first activity when
developing a personalized multimedia application is to determine the personalization
requirements, that is, which aspects of personalization should be supported by the actual
application. For example, in the case of an e-learning application the personalization
aspects consider the automatic adaptation to the different learning styles of the students
and their prior knowledge about the topic. In addition, different degrees of difficulty
should be supported by a personalized e-learning application. In the case of a personalized mobile tourism application, however, the users location and his or her surroundings would be of interest for personalization instead. These personalization aspects must
be kept in mind during every activity throughout the whole development process. The
decision regarding which personalization aspects are to be supported has to be incorporated in the analysis and design of the personalized application and will hopefully
entail a flexible and extendible software design. However, this increases the overall
complexity of the application to be developed and automatically leads to a higher
development effort including longer development duration and higher costs. Therefore,
a good requirement analysis is crucial when developing personalized applications lest
one dissipates ones energies in bad software design with respect to the personalization
aspects.
When transferring the requirements for developing personalized software to the
specific requirements of personalized multimedia applications one can say that it affects
all members of the developer team: the domain expert, the media designers, and the
computer scientists, and putting higher requirements to them.
The domain expert normally contributes to the development of multimedia applications by providing input to draw storyboards of the specific applications domain. These
storyboards are normally drawn by media designers and are the most important means
to communicate the later applications functionality within the developer team. When
personalization comes into account, it is difficult to draw such storyboards, because of
the many possible alternatives and different paths in the application that are implicated
with personalization. Consequently, the storyboards change in regard to, for example,
the individual user profiles and the end devices that are used. When drawing storyboards
for a personalized multimedia application, those points in the storyboard have to be
identified and visualized where personalization is required and needed. Storyboards
have to be drawn for every typical personalization scenario concerning the concrete
application. This drawing task should be supported by interactive graphical tools to
create personalized storyboards and to identify reusable parts and modules of the
content.
It is the task of the media designer in the development of multimedia applications
to plan, acquire, and create media elements. With personalization, media designers have
to think additionally about the usage of media elements for personalization purposes,
that is, the media elements have to be created and prepared for different contexts. When
acquiring media elements, the media designers must consider for which user context the
media elements are created and what aspects of personalization are to be supported, for
example, different styles, colours, and spatial dimensions. Possibly a set of quite similar
media assets have to be developed, that only differ in certain aspects. For example, an
image or video has to be transformed for different end device resolutions, colour depth,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
process. Rather the domain experts would like to control the assembly of the content
because they are responsible for the content conveyed. The smart authoring tool guides
the domain experts through the composition process and supports them in creating
presentations that still provide flexibility to the targeted user context. In the e-learning
context we can expect domain experts such as lecturers that want to create a new elearning unit but do not want to be bothered with the technical details of (multimedia)
authoring.
We use the MM4U framework to build the multimedia composition and personalization functionality of this smart authoring tool. For this, the Multimedia Composition
component supports the creation and processing of arbitrary document structures and
templates. The authoring tool exploits this functionality for composition to achieve a
document structure that is suitable just for that content domain and the targeted
audience. The Media Data Accessor supports the authoring tool in those parts in which
it lets the author choose from only those media elements that are suitable for the intended
user contexts and that can be adapted to the users infrastructure. Using the Presentation
Format Generators, the authoring tool finally generates the presentations for the different
end devices of the targeted users. Thus the authoring process is guided and specialized
with regard to selecting and composing personalized multimedia content. For the
development of this authoring tool, the framework fulfils the same function in the process
of creating personalized multimedia content in a multimedia application as described in
the previous section on the framework. However, the creation of personalized content
is not achieved at once but step by step during the authoring process.
IMPLEMENTATION AND
PROTOTYPICAL APPLICATIONS
The framework, its components, classes and interfaces, are specified using the
Unified Modeling Language (UML) and has been implemented in Java. The development
process for the framework is carried out as an iterative software development with
stepwise refinement and enhancement of the frameworks components. The redesign
phases are triggered by the actual experience of implementing the framework but also by
employing the framework in several application scenarios. In addition, we are planning
to provide a beta version of the MM4U framework to other developers for testing the
framework and to develop their own personalized multimedia applications with MM4U.
Currently, we are implementing several application scenarios to prove the applicability of MM4U in different application domains. These prototypes are the first stress
test for the framework. At the same time the development of the sample applications gives
us an important feedback about the comprehensiveness and the applicability of the
framework. In the following sections, two of our prototypes that are based on the MM4U
framework are introduced: In the Sightseeing4U subsection, a prototype of a personalized city guide is presented, and in the Sports4U subsection a prototype of a personalized
multimedia sports news ticker is described.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 10. Screenshots of the city guide application for a user interested in culture
(presentation generated in SMIL 2.0 and SMIL 2.0 BLP format, respectively)
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 11. Screenshots of the Sightseeing4U prototype for a user searching for a good
restaurant (output generated in SVG 1.2 and Mobile SVG format, respectively)
tation are automatically selected to fit the end devices characteristics best. For example,
a user sitting at a desktop PC receives a high-quality video about the palace of Oldenburg
as depicted in Figure 10a, while a mobile user gets a smaller video of less quality in Figure
10b. In the same way, the user searching for a good restaurant in Oldenburg receives
either a high-quality video when using a Tablet PC as depicted in Figure 11a, or a smaller
one that meets the limitations of the mobile device as shown in Figure 11b. If there is no
video of a particular sight available at all, the personalized tourist guide automatically
selects images instead and generates a slideshow for the user.
sports news to a coherent presentation. It regards possible constraints like running time
limit and particular characteristics of the end device, like the limited display size of a
mobile device. The result is a sports news presentation that can be, for example, viewed
with an SMIL player over the Web as shown in Figure 12. With a suitable Media Data
Connector the Medither is connected to the MM4U framework. This connector not
only allows querying for media elements like the URIMediaConnector but also provides
the notification of incoming multimedia events to the actual personalized application.
Depending on the user context, the Sports4U prototype receives the sports news from
the pool of sports events in the Medither that match the users profile. The Sports4U
application alleviates the user from the time-consuming task of searching for sports news
he or she might be interested in.
CONCLUSION
In this chapter, we presented an approach for supporting the creation of personalized multimedia content. We motivated the need of technology to handle the flood of
multimedia information that allows for a much targeted, individual management and
access to multimedia content. To give a better understanding of the content creation
process we introduced the general approaches in multimedia data modeling and multimedia authoring as we find it today. We presented how the need for personalization of
multimedia content heavily affects the multimedia content creation process and can only
result in a dynamic, (semi)automatic support for the personalized assembly of multimedia
content. We looked into existing related approaches ranging from personalization in the
text-centric Web context over single media personalization to the personalization of
multimedia content. Especially for complex personalization tasks we observe that an
(additional) programming is needed and propose a software engineering support with our
Multimedia for you Framework (MM4U).
We presented the MM4U framework concept in general and, in more detail, the
single layers of the MM4U framework: access to user profile information, personalized
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
REFERENCES
3rd Generation Partnership Project (2003a). TS 26.234; Transparent end-to-end packetswitched streaming service; protocols and codecs (Release 5). Retrieved December 19, 2003, from http://www.3gpp.org/ftp/Specs/html-info/26234.htm
3rd Generation Partnership Project (2003b). TS 26.246; Transparent end-to-end packetswitched streaming service: 3GPP SMIL language profile (Release 6). Retrieved
December 19, 2003, from http://www.3gpp.org/ftp/Specs/html-info/26246.htm
Ackermann, P. (1996). Developing object oriented multimedia software: Based on
MET++ application framework. Heidelberg, Germany: dpunkt.
Adobe Systems, Inc., USA (2001). Adobe SVG Viewer. Retrieved February 25, 2004, from
http://www.adobe.com/svg/
Agnihotri, L., Dimitrova, N., Kender, J., & Zimmerman, J. (2003). Music videos miner. In
ACM Multimedia.
Allen, J. F. (1983, November). Maintaining knowledge about temporal intervals. In
Commun. ACM, 25(11).
Amazon, Inc., USA (1996-2004). Amazon.com. Retrieved February 20, 2004, from http:/
/www.amazon.com/
Andersson, O., Axelsson, H., Armstrong, P., Balcisoy, S., et al. (2004a). Mobile SVG
profiles: SVG Tiny and SVG Basic. W3C recommendation 25/03/2004. Retrieved
June 10, 2004, from http://www.w3.org/TR/SVGMobile12/
Andersson, O., Axelsson, H., Armstrong, P., Balcisoy, S., et al. (2004b). Scalable vector
graphics (SVG) 1.2 specification. W3C working draft 05/10/2004. Retrieved June
10, 2004, from http://www.w3c.org/Graphics/SVG/
Andr, E. (1996). WIP/PPP: Knowledge-based methods for fully automated multimedia
authoring. In Proceedings of the EUROMEDIA96. London.
Andr, E., & Rist, T. (1995). Generating coherent presentations employing textual and
visual material. In Artif. Intell. Rev, 9(2-3). Kluwer Academic Publishers.
Andr, E., & Rist, T. (1996, August). Coping with temporal constraints in multimedia
presentation planning. In Proceedings of the Thirteenth National Conference on
Artificial Intelligence (AAAI-96), Portland, Oregon.
Arndt, T. (1999, June). The evolving role of software engineering in the production of
multimedia applications. In IEEE International Conference on Multimedia Computing and Systems Volume 1, Florence, Italy.
Ayars, J., Bulterman, D., Cohen, A., Day, K., Hodge, E., Hoschka, P., et al. (2001).
Synchronized multimedia integration language (SMIL 2.0) specification. W3C
Recommendation 08/07/2001. Retrieved February 23, 2004, from http://
www.w3c.org/AudioVideo/
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Beckett, D., & McBride, B. (2003). RDF/XML syntax specification (revised). W3C recommendation 15/12/2003. Retrieved February 23, 2004: http://www.w3c.org/RDF/
Bohrer, K., & Holland, B. (2004). Customer profile exchange (cpexchange) specification
version 1.0, 20/10/2000. Retrieved January 27, 2004, from http://
www.cpexchange.org/standard/cpexchangev1_0F.zip
Boll, S. (2003, July). Vienna 4 U - What Web services can do for personalized multimedia
applications. In Proceedings of the Seventh Multi-Conference on Systemics
Cybernetics and Informatics (SCI 2003), Orlando, Florida, USA.
Boll, S., & Klas, W. (2001). ZYX - A multimedia document model for reuse and adaptation.
In IEEE Transactions on Knowledge and Data Engineering, 13(3).
Boll, S., Klas, W., & Wandel, J. (1999, November). A cross-media adaptation strategy for
multimedia presentations. Proc. of the ACM Multimedia Conf. 99, Part 1, Orlando,
Florida, USA.
Boll, S., Klas, W., Heinlein, C., & Westermann, U. (2001, August). Cardio-OP - Anatomy
of a multimedia repository for cardiac surgery. Technical Report TR-2001301,
University of Vienna, Austria.
Boll, S., Klas, W., & Westermann, U. (2000, August). Multimedia Document Formats Sealed Fate or Setting Out for New Shores? In Multimedia - Tools and Applications, 11(3).
Boll, S., Krsche, J., & Scherp, A. (2004, September). Personalized multimedia meets
location-based services. In Proceedings of the Multimedia-Informationssysteme
Workshop associated with the 34th annual meeting of the German Society of
Computing Science, Ulm, Germany.
Boll, S., Krsche, J., & Wegener, C. (2003, August). Paper chase revisited - A real world
game meet hypermedia (short paper). In Proc. of the Intl. Conference on Hypertext
(HT03), Nottingham, UK.
Boll, S., & Westermann, U. (2003, November). Medither - An event space for contextaware multimedia experiences. In Proc. of International ACM SIGMM Workshop
on Experiential Telepresence, Berkeley, CA., USA.
Bordegoni, M., Faconti, G., Feiner, S., Maybury, M. T., Rist, T., Ruggieri, S., et al. (1997,
December). A standard reference model for intelligent multimedia presentation
systems. In ACM Computer Standards & Interfaces, 18(6-7).
Brusilovsky, P. (1996). Methods and techniques of adaptive hypermedia. User Modeling
and User Adapted Interaction, 6(2-3).
Bugaj, S., Bulterman, D., Butterfield, B., Chang, W., Fouquet, G., Gran, C., et al. (1998).
Synchronized multimedia integration language (SMIL 1.0) specification. W3C
Recommendation 06/15/1998. Retrieved June 10, 2004, from http://www.w3.org/
TR/REC-smil/
Bulterman, D. C. A., van Rossum, G., & van Liere, R. (1991). A structure of transportable,
dynamic multimedia documents. In Proceedings of the Summer 1991 USENIX
Conf., Nashville, TN, USA.
Chen, G., & Kotz, D. (2000). A survey of context-aware mobile computing research.
Technical Report TR2000-381. Dartmouth University, Department of Computer
Science,
Click2learn, Inc, USA (2001-2002). Toolbook Standards-based content authoring.
Retrieved February 6, 2004 from http://home.click2learn.com/en/toolbook/
index.asp
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
CWI (2004). The cuypers multimedia transformation engine. Amsterdam, The Netherlands. Retrieved February 25, 2004, on http://media.cwi.nl:8080/demo/
De Bra, P., Aerts, A., Berden, B., De Lange, B., Rousseau, B., Santic, T., Smits, D., & Stash,
N. (2003, August). AHA! The Adaptive Hypermedia Architecture. Proceedings of
the ACM Hypertext Conference, Nottingham, UK.
De Bra, P., Aerts, A., Houben, G.-J., & Wu, H. (2000). Making general-purpose adaptive
hypermedia work. In Proc. of the AACE WebNet Conference, San Antonio, Texas.
De Bra, P., Aerts, A., Smits, D., & Stash, N. (2002a, October). AHA! version 2.0: More
adaptation flexibility for authors. In Proc. of the AACE ELearn2002 Conf.
De Bra, P., Brusilovsky, P., & Conejo, R. (2002b, May). Proc. of the Second Intl. Conf.
for Adaptive Hypermedia and Adaptive Web-Based Systems, Malaga, Spain,
Springer LNCS 2347.
De Bra, P., Brusilovsky, P., & Houben, G.-J. (1999a, December). Adaptive hypermedia:
From systems to framework. ACM Computing Surveys, 31(4).
De Bra, P., Houben, G.-J., & Wu, H. (1999b). AHAM: A dexter-based reference model for
adaptive hypermedia. In Proceedings of the 10th ACM Conf. on Hypertext and
hypermedia: returning to our diverse roots, Darmstadt, Germany.
De Carolis, B., de Rosis, F., Andreoli, C., Cavallo, V., De Cicco, M L (1998). The Dynamic
Generation of Hypertext Presentations of Medical Guidelines. The New Review of
Hypermedia and Multimedia, 4.
De Carolis, B., de Rosis, F., Berry, D., & Michas, I. (1999). Evaluating plan-based
hypermedia generation. In Proc. of European Workshop on Natural Language
Generation, Toulouse, France.
de Rosis, F., De Carolis, B., & Pizzutilo, S. (1999). Software documentation with animated
agents. In Proc. of the 5th ERCIM Workshop on User Interfaces For All, Dagstuhl,
Germanny.
Dublin Core Metadata Initiative (1995-2003). Expressing simple Dublin Core in RDF/
XML,1995-2003. Retrieved February 2, 2004, from http://dublincore.org/documents/2002/07/31/dcmes-xml/
Duda, A., & Keramane, C. (1995). Structured temporal composition of multimedia data.
Proceedings of the IEEE International Workshop Multimedia-Database-Management Systems.
Duke, D. J., Herman, I., & Marshall, M. S. (1999). PREMO: A framework for multimedia
middleware: Specification, rationale, and java binding. New York: Springer.
Echiffre, M., Marchisio, C., Marchisio, P., Panicciari, P., & Del Rossi, S. (1998, JanuaryMarch). MHEG-5 Aims, concepts, and implementation issues. In IEEE Multimedia.
Egenhofer, M. J., & Franzosa, R. (1991, March). Point-Set Topological Spatial Relations.
Int. Journal of Geographic Information Systems, 5(2).
Elhadad, M., Feiner, S., McKeown, K., & Seligmann, D. (1991). Generating customized
text and graphics in the COMET explanation testbed. In Proc. of the 23rd
Conference on Winter Simulation. IEEE Computer Society, Phoenix, Arizona,
USA.
Engels, G., Sauer, S., & Neu, B. (2003, October). Integrating software engineering and
user-centred design for multimedia software developments. In Proc. IEEE Symposia on Human-Centric Computing Languages and Environments - Symposium on
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Visual/Multimedia Software Engineering, Auckland, New Zealand. IEEE Computer Society Press.
Exor International Inc. (2001-2004). eSVG: Embedded SVG. Retrieved February 12, 2004,
from http://www.embedding.net/eSVG/english/overview/overview_frame.html
Fink, J., Kobsa, A., & Schreck, J. (1997). Personalized hypermedia information through
adaptive and adaptable system features: User modeling, privacy and security
issues. In A. Mullery, M. Besson, M. Campolargo, R. Gobbi, & R. Reed (Eds.),
Intelligence in services and networks: Technology for cooperative competition.
Berlin: Springer.
Foundation for Intelligent Physical Agents (2002). FIPA device ontology specification,
2002. Retrieved January 23, 2004, from http://www.fipa.org/specs/fipa00091/
Gaggi, O., & Celentano, A. (2002). A visual authoring environment for prototyping
multimedia presentations. In Proceedings of the IEEE Fourth International
Symposium on Multimedia Software Engineering.
Girgensohn, A., Bly, S., Shipman, F., Boreczky, J., & Wilcox, L. (2001). Home video editing
made easy Balancing automation and user control. In Proc. of the HumanComputer Interaction, Tokyo, Japan.
Girgensohn, A., Shipman, F., & Wilcox, L. (2003, November). Hyper-Hitchcock: Authoring
Interactive Videos and Generating Interactive Summaries. In Proc. ACM Multimedia.
Greiner, C., & Rose, T. (1998, November). A Web based training system for cardiac
surgery: The role of knowledge management for interlinking information items. In
Proc. The World Congress on the Internet in Medicine, London.
Hardman, L. (1998, March). Modeling and Authoring Hypermedia Documents. Doctoral
dissertation, University of Amsterdam, The Netherlands.
Hardman, L., Bulterman, D. C. A., & van Rossum, G. (1994b, February). The Amsterdam
Hypermedia Model: Adding time and context to the Dexter Model. In Comm. of the
ACM, 37(2).
Hardman, L., van Rossum, G., Jansen, J., & Mullender, S. (1994a). CMIFed: A transportable hypermedia authoring system. In Proc. of the Second ACM International
Conference on Multimedia, San Francisco.
Hirzalla, N., Falchuk, B., & Karmouch, A. (1995). A temporal model for interactive
multimedia scenarios. In IEEE Multimedia, 2(3).
Hunter, J. (1999, October). Multimedia metadata schemas. Retrieved June 16, 2004, from
http://www2.lib.unb.ca/ Imaging_docs/IC/schemas.html
IBM Corporation, USA. (2004a). IBM research Video semantic summarization systems.
Retrieved June 15, 2004, from http://www.research.ibm.com/MediaStar/
VideoSystem.html#Summarization%20Techniques
IBM Corporation, USA. (2004b). QBIC home page. Retrieved June 16, 2004, from http:/
/wwwqbic.almaden.ibm.com/
INRIA (2003). PocketSMIL 2.0. Retrieved February 24, 2004, from http://
opera.inrialpes.fr/pocketsmil/
International Organisation for Standardization (1996). ISO 13522-5, information technology Coding of multimedia and hypermedia information, Part 5: Support for
base-level interactive applications. Geneva, Switzerland: International
Organisation for Standardization.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
ISO/IEC (1999, July). JTC 1/SC 29/WG 11, MPEG-7: Context, Objectives and Technical
Roadmap, V.12. ISO/IEC Document N2861. Geneva, Switzerland: Int. Organisation
for Standardization/Int. Electrotechnical Commission.
ISO/IEC (2001a, November). JTC 1/SC 29/WG 11. InformationtechnologyMultimedia
content description interfacePart 1: Systems. ISO/IEC Final Draft International
Standard 15938-1:2001. Geneva, Switzerland: Int. Organisation for Standardization/
Int. Electrotechnical Commission.
ISO/IEC (2001b, September). JTC 1/SC 29/WG 11. Information technologyMultimedia
content description interfacePart 2: Description definition language. ISO/IEC
Final Draft Int. Standard 15938-2:2001. Geneva, Switzerland: Int. Organisation for
Standardization/Int. Electrotechnical Commission.
ISO/IEC (2001c, July). JTC 1/SC 29/WG 11. Information technologyMultimedia content description interfacePart 3: Visual. ISO/IEC Final Draft Int. Standard
15938-3:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int.
Electrotechnical Commission.
ISO/IEC (2001d, June). JTC 1/SC 29/WG 11. Information technologyMultimedia
content description interfacePart 4: Audio. ISO/IEC Final Draft Int. Standard
15938-4:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int.
Electrotechnical Commission.
ISO/IEC (2001e, October). JTC 1/SC 29/WG 11. Information technologyMultimedia
content description interfacePart 5: Multimedia description schemes. ISO/IEC
Final Draft Int. Standard 15938-5:2001. Geneva, Switzerland: Int. Organisation for
Standardization/Int. Electrotechnical Commission.
Jourdan, M., Layada, N., Roisin, C., Sabry-Ismal, L., & Tardif, L. (1998). Madeus, and
authoring environment for interactive multimedia documents. ACM Multimedia.
Kim, M., Wood, S., & Cheok, L.-T. (2000, November). Extensible MPEG-4 textual format
(XMT). In Proc. of the 8th ACM Multimedia Conf., Los Angeles.
Klas, W., Greiner, C., & Friedl, R. (1999, July). Cardio-OP: Gallery of cardiac surgery. IEEE
International Conference on Multimedia Computing and Systems (ICMS 99).
Florence, July.
Klyne, G., Reynolds, F., Woodrow, C., Ohto, H., Hjelm, J., Butler, M. H., & Tran, L. (2003).
Composite capability/preference profile (CC/PP): Structure and vocabularies W3C Working Draft 25/03/2003.
Kopf, S., Haenselmann, T., Farin, D., & Effelsberg, W. (2004). Automatic generation of
summaries for the Web. In Proceedings Electronic Imaging 2004.
Lemlouma, T., & Layada, N. (2003, June). Media resources adaptation for limited devices.
In Proc. of the Sevent ICCC/IFIP International Conference on Electronic Publishing ELPUB 2003, Universidade deo Minho, Portugal.
Lemlouma, T., & Layada, N. (2004, January). Context-aware adaptation for mobile
devices. IEEE International Conference on Mobile Data Management, Berkeley,
California, USA.
Little, T. D. C., & Ghafoor, A. (1993). Interval-based conceptual models for timedependent multimedia data. In IEEE Transactions on Knowledge and Data
Engineering, 5(4).
Macromedia, Inc., USA (2003, January). Using Authorware 7. [Computer manual].
Available from http://www.macromedia.com/software/authorware/
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Macromedia, Inc., USA (2004). Macromedia. Retrieved June 15, 2004, from http://
www.macromedia.com/
McKeown, K., Robin, J., & Tanenblatt, M. (1993). Tailoring lexical choice to the users
vocabulary in multimedia explanation generation. In Proc. of the 31st conference
on Association for Computational Linguistics, Columbus, Ohio.
Oldenettel, F., & Malachinski, M. (2003, May). The LEBONED metadata architecture. In
Proc. of the 12th International World Wide Web Conference, Budapest, Hungary
(pp. S.207-216). ACM Press, Special Track on Education.
Open Mobile Alliance (2003). User agent profile (UA Prof). 20/05/2003. Retrieved
February 10, 2004, from http://www.openmobilealliance.org/
Oratrix (2004). GRiNS for SMIL Homepage. Retrieved February 23, 2004, from http://
www.oratrix.com/GRiNS
Papadias, D., & Sellis, T. (1994, October). Qualitative representation of spatial knowledge
in two-dimensional space. In VLDB Journal, 3(4).
Papadias, D., Theodoridis, Y., Sellis, T., & Egenhofer, M. J. (1995, March). Topological
relations in the world of minimum bounding rectangles: A study with R-Trees. In
Proc. of the ACM SIGMOD Conf. on Management of Data, San Jose, California.
Pree, W. (1995). Design patterns for object-oriented software development. Boston:
Addison-Wesley.
Rabin, M. D., & Burns, M. J. (1996). Multimedia authoring tools. In Conference Companion on Human Factors in Computing Systems, Vancouver, British Columbia,
Canada, ACM Press.
Raggett, D., Le Hors, A., & Jacobs, I. (1998). HyperText markup language (HTML)
version 4.0. W3C Recommendation, revised on 04/24/1998. Retrieved February 20,
2004, from http://www.w3c.org/MarkUp/
RealNetworks (2003). RealOne Player. Retrieved February 25, 2004, from http://
www.real.com/
Rossi, G., Schwabe, D., & Guimares, R. (2001, May). Designing personalized Web
applications. In Proceedings of the tenth World Wide Web (WWW) Conference,
Hong Kong. ACM.
Rout, T. P., & Sherwood, C. (1999, May). Software engineering standards and the
development of multimedia-based systems. In Fourth IEEE International Symposium and Forum on Software Engineering Standards. Curitiba, Brazil.
Scherp, A., & Boll, S. (2004a, March). MobileMM4U - Framework support for dynamic
personalized multimedia content on mobile systems. In Multikonferenz
Wirtschaftsinformatik 2004, special track on Technologies and Applications for
Mobile Commerce.
Scherp, A., & Boll, S. (2004b, October). Generic support for personalized mobile multimedia tourist applications. Technical demonstration for the ACM Multimedia
Conference, New York, USA.
Scherp, A., & Boll, S. (2005, January). Paving the last mile for multi-channel multimedia
presentation generation. In Proceedings of the 11th International Conference on
Multimedia Modeling, Melbourne, Australia.
Schmitz, P., Yu, J., & Santangeli, P. (1998). Timed interactive multimedia extensions for
HTML (HTML+TIME). W3C, version 09/18/1998. Retrieved February 20, 2004, from
http://www.w3.org/TR/NOTE-HTMLplusTIME
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Stash, N., Cristea, A., & De Bra, P. (2004). Authoring of learning styles in adaptive
hypermedia. In WWW04 Education Track. New York: ACM.
Stash, N., & De Bra, P. (2003, June). Building Adaptive Presentations with AHA! 2.0.
Proceedings of the PEG Conference, Sint Petersburg, Russia.
Sun Microsystems, Inc. (2004). Java media framework API. Retrieved February 15, 2004,
from http://java.sun.com/products/java-media/jmf/index.jsp
Szyperski, C., Gruntz, D., & Murer, S. (2002). Component software: Beyond objectoriented programming (2nd ed.). Boston: Addison-Wesley.
van Ossenbruggen, J.R., Cornelissen, F.J., Geurts, J.P.T.M., Rutledge, L.W., & Hardman,
H.L. (2000, December). Cuypers: A semiautomatic hypermedia presentation system. Technical Report INS-R0025. CWI, The Netherlands.
van Rossum, G., Jansen, J., Mullender, S., & Bulterman, D. C. A. (1993). CMIFed: A
presentation environment for portable hypermedia documents. In Proc. of the First
ACM International Conference on Multimedia, Anaheim, California.
Villard, L. (2001, November). Authoring transformations by direct manipulation for
adaptable multimedia presentations. In Proceeding of the ACM Symposium on
Document Engineering, Atlanta, Georgia.
Wahl, T., & Rothermel, K. (1994, May). Representing time in multimedia systems. In Proc.
IEEE Int. Conf. on Multimedia Computing and Systems, Boston.
Wu, H., de Kort, E., & De Bra, P. (2001). Design issues for general-purpose adaptive
hypermedia systems. In Proc. of the 12th ACM Conf. on Hypertext and Hypermedia,
rhus, none, Denmark.
Yahoo!, Inc. (2002). MyYahoo!. Retrieved February 17, 2004, from http://my.yahoo.com/
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
288
Chapter 12
ABSTRACT
Relevance feedback is a mature technique that has been used to take user subjectivity
into account in multimedia retrieval. It can be seen as an attempt to bridge the semantic
gap by keeping a human in the loop. A variety of techniques have been used to
implement relevance feedback in existing retrieval systems. An analysis of these
techniques is used to develop the requirements of a relevance feedback technique that
aims to be capable of managing semantics in multimedia retrieval. It is argued that
these requirements suggest a case for a user-centric framework for relevance feedback
with low coupling to the retrieval engine.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
289
INTRODUCTION
A key challenge in multimedia retrieval remains the issue often referred to as the
semantic gap. Similarity measures computed on low-level features may not correspond
well with human perceptions of similarity (Zhang et al., 2003). Human perceptions of
similarity of multimedia objects such as images or video clips tend to be semantically
based, that is, the perception that two multimedia objects are similar arises from these two
objects evoking similar or overlapping concepts in the users mind. Therefore different
users posing the same query may have very different expectations of what they are
looking for. On the other hand, existing retrieval systems tend to return the same results
for a given query. In order to cater to user subjectivity and to allow for the fact that the
users perception of similarity may be different from the systems similarity measure, the
users need to be kept in the loop.
Relevance feedback is a mature and widely recognised technique for making
retrieval systems better satisfy users information needs (Rui et al., 1997). Informally,
relevance feedback can be interpreted as a technique that should be able to understand
the users semantic similarity perception and to incorporate this in subsequent iterations.
This chapter aims to provide an overview of the rich variety of relevance feedback
techniques described in the literature while examining issues related to the semantic
implications of these techniques. Section 2 presents a discussion of the existing literature
on relevance feedback and highlights certain advantages and disadvantages of the
reviewed approaches. This analysis is used to develop the requirements of a relevance
feedback technique that would be an aid in managing multimedia semantics (Section 3).
A high-level framework for such a technique is outlined in Section 4.
RELEVANCE FEEDBACK IN
CONTENT-BASED MULTIMEDIA RETRIEVAL
Broadly speaking, anything the user does or says can be used to interpret
something about their view of a computer system; for example, the time spent at a Web
page, the motion of their eyes while viewing an electronic document, and so forth. In the
context of content-based multimedia retrieval we use the term relevance feedback in the
conventional sense, whereby users are allowed to indicate their opinion of results
returned by a retrieval system. This is done in a number of ways; for example, the user
only selects results that they consider relevant to their query, the user provides positive
as well as negative examples, or the user is left to provide some sort of ranking of the
images. In general terms, the user classifies the result set into a number of categories.
The relevance feedback module should be able to use this classification to improve
subsequent retrieval. It is expected that several successive iterations will further refine
the result set, thus converging to an acceptable result. What makes a result acceptable
is dependent on the user; it may be a single result (the so-called target searching of
Cox et al. (1996)). On the other hand, acceptable may mean a sufficient number of
relevant results (as when the users are performing a category search).
Intuitively, for results to approximate semantic retrieval, the relevance feedback
mechanism should understand why users mark the results the way they do. It should
ideally be able to identify not only what is common about the results belonging to a
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
290
particular category but also take into account subtleties like what the difference is
between an example being marked nonrelevant and being left unmarked.
There is a rich body of literature on the subject of relevance feedback, particularly
in the context of content-based image retrieval (CBIR). For instance, Zhang et al. (2003)
interpret relevance feedback as a machine-learning problem and present a comprehensive review of relevance feedback algorithms in CBIR based on the learning and searching
natures of the algorithms. Zhou and Huang (2001b) discuss a selection of relevance
feedback variants in the context of multimedia retrieval examined under seven conceptual dimensions. These range from simple dimensions such as, What is the user looking
for, to the highly diverse, The goals and the learning machines (Zhou & Huang,
2001b). We examine a variety of existing relevance feedback methods, many from CBIR
since this area has received much attention in the literature. We classify them into one
of five broad approaches. While there are degrees of overlap, this taxonomy focuses on
the main conceptual thrust of the techniques and may be taken as representative rather
than exhaustive.
In this approach, documents and queries are represented by vectors in an ndimensional feature space. The feedback is implemented through Query Point Movement
(QPM) and/or Query ReWeighting (QRW). QPM aims to estimate an ideal query point
by moving it closer to positive example points and away from negative example points.
QRW tries to give higher importance to the dimensions that help in retrieving relevant
images and reduce the importance of the ones that do not. The classical approach is
arguably the most mature approach. Although elements of it can be found in almost all
existing techniques, certain techniques are explicitly or very strongly of this type.
The MARS system (Rui et al., 1997) uses a reweighting technique based on a
refinement of the text retrieval approach. Later work by Rui and Huang (1999) has been
adapted by Kang (2003) to use relevance feedback to detect emotional events in video.
To overcome certain disadvantages of these approaches, such as the need for ad-hoc
constants, the relevance feedback task is formalised as a minimisation problem in
Ishikawa et al. (1998).
A novel framework is presented in Rui and Huang (1999) based on a two-level image
model with features like colour, texture and shape occupying the higher level. The lower
level contains the feature vector for each feature. The overall distance between a training
sample and the query is defined in terms of both these levels. The performance of the
MARS and MindReader systems is compared against the novel model in terms of a
percentage of relevant images returned. While relevance feedback boosts retrieval
performance of all the techniques, the novel framework is able to consistently perform
better than MARS and MindReader (Rui & Huang, 1999).
The vector space model is not just restricted to CBIR. Liu and Wan (2003) have
developed two relevance feedback algorithms for use in audio retrieval with time domain
features as well as frequency domain features. The first is a standard deviation based
reweighting while the second relies on minimizing the weighted distance between the
relevant examples and the query. They demonstrate an average precision increase as a
result of the use of relevance feedback and claim that, Through the relevance feedback
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
291
some [...] semantics can be added to the retrieval system, although the claim is not very
well supported.
The recent techniques in this area are characterised by their robustness, solid
mathematical formulation and efficient implementations. However, their semantic underpinnings may be seen as somewhat simplistic due to the underlying assumptions that (a)
a single query vector is able to represent the users query needs and (b) visually similar
images are close together in the feature space. There is also often an underlying
assumption in methods of this kind that the users ideal query point remains static
throughout a query session. While this assumption may be justifiable while performing
automated tests for computation of performance benchmarks, it clearly does not capture
users information needs at a semantic level.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
292
293
ing characteristic rather than their use of such techniques. These techniques enable the
so-called cross-modal queries, which allow users to express queries in terms of keywords
and yet have low-level features taken into account and vice versa. Further, Zhang et al.
(2003) are able to utilise a learning process to propagate keywords to unlabelled images
based on the users interaction. The motivation is that in some cases the users
information need is better expressed in terms of keywords (possibly also combined with
low-level features) than in terms of only low-level features. It seems to be taken for
granted in these techniques that, in semantic terms, keywords represent a higher level
of information about the images than the visual features. We elaborate on these
assumptions in Section 3.
294
Advantages
Computationally more efficient than
other approaches, mature
ProbabilisticStatistical
Machine Learning
Disadvantages
Simplistic semantic
interpretation of the users
information need (ideal query
vector)
Computational costs can
become prohibitive with
complex modeling
Training sample can take time
to accumulate, in some
techniques user subjectivity
not explicitly catered for
Possible impedance mismatch
between the different
techniques used for visual and
keyword features, difficulty in
obtaining keyword
annotations
Computational costs can
become prohibitive with
complex modeling
The review of the literature above suggests certain trends. There is a movement to
more general and comprehensive approaches. In the vector-space arena this can be seen
in the changes from MARS (Rui et al., 1997) to MindReader (Ishikawa et al., 1998) to the
novel framework of Rui and Huang (1999). The vector-space style interpretation of the
relevance feedback can be seen to be increasingly general, increasingly complex and
reliant on fewer artificial parameters in its evolution, if we may term it thus. The spirit of
recent generalised methods such as in Rui and Huang (1999) and Geman and Moquets
(1999) stochastic model (an extension of the work by Cox et al., 1996) acknowledges a
much higher level of complexity than was previously recognised. The application of the
machine learning techniques and the augmentation of relevance feedback methods with
historical information also present a promising avenue. Extending these approaches to
incorporate per-user relevance feedback as well as the deduction of implicit truths that
may apply across a large number of users will probably continue. The use of keyword
features as a means to augment visual features is also a step forward. However, the
success of its adoption in applications other than Web-based ones may depend on the
feasibility of obtaining keyword annotations for large multimedia collections. Such
keyword annotations could perhaps be accumulated manually during user feedback.
CONSIDERATIONS FOR A
RELEVANCE FEEDBACK TECHNIQUE TO
HANDLE MULTIMEDIA SEMANTICS
While considering existing techniques from the perspective of selecting appropriate strategies for a proposed multimedia system or for developing a new one, several
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
295
factors need to be taken into account. Zhou and Huang (2001a) have outlined several
such factors while interpreting relevance feedback as a classification problem. Among
these are the small sample issue, that is, the fact that in a given session the user provides
a small number of training examples. Another key issue is the inherent asymmetry
involved the training is in the form of a classification while the desired output of
retrieval is typically a rank-ordered top-k return. They proceed to present a compilation
of a list of critical issues to consider while designing a relevance feedback algorithm
(Zhou & Huang, 2001b). While their analysis continues to be relevant and the issues
presented remain pertinent, they do not explicitly take into account the added complexities of certain semantic implications. We attempt to outline certain critical considerations
that need to be taken into account while selecting an existing relevance feedback
technique (or developing a new one) to be used with a semantically focussed multimedia
retrieval system. These requirements are based on the analysis and review of the
literature in the second section.
296
the result marked relevant is a close approximation? Further, in the case when negative
examples are employed (i.e., the user is allowed to nominate results as either being
relevant or nonrelevant) what does the fact that the user left a result unmarked signify?
For instance Cox et al. (1998) make a distinction between absolute judgement, where
images being marked by the user are nominated as relevant while unmarked ones are
considered nonrelevant; and relative judgement, which interprets the user feedback
to mean that the marked images are more relevant than the unmarked ones. Geman and
Moquet (1999) observe that their most general model with fewest simplifying assumptions may be the most suitable for modeling the human response, which, however, comes
with a penalty in terms of computational complexity. Again, a general model, which
incorporates a soft approach, could be seen as more conceptually robust than a
harder one.
Another important criterion for conceptual robustness is the compatibility of the
tools used while developing a relevance feedback strategy. For instance, in the keywordintegration techniques a semantic web based on keywords is integrated with a vectorspace style representation of low-level features. Care must be taken to ensure that in
cases like these, when multiple tools are combined, conceptually compatible approaches
are used rather than fusing incompatible tools in an ad hoc manner.
Interestingly, Cox et al. (1996) take notice of the fact that PicHunter seems to be able
to place together images that are apparently semantically similar. This remains hard to
explain. They advance as conjecture that their algorithms produce probability distributions that are very complex functions of the feature sets used. This phenomenon of the
feature sets combined with user feedback producing seemingly semantic results does not
seem to be explored in great depth by the literature on relevance feedback and possibly
merits further investigation.
297
computationally feasible. Ideally, the test of whether enough detail has been captured
could be based on the notion of emergent meaning (Santini et al., 2001). If a relevance
feedback technique can facilitate sufficiently complex interaction between the user and
the retrieval system for semantic behaviour to emerge as a result of the interaction, the
technique has effectively represented all aspects of the problem. The computational
issues are related, since if the users interaction is useful in the emergence of semantic
behaviour by the system, extended waiting times will influence their state of mind
(frustration, disinterest) and possibly impair the semantic performance of the relevance
feedback technique.
298
information to mimic the users semantic perception. A future direction along these lines
may be to combine the work of Zhang et al. (2003) with the work of Rui and Huang (1999),
which incorporates the two-layer model for visual information. It may then be possible
to build parallel or even interconnecting networks of visual and textual features in a step
towards truly capturing the semantic content of multimedia objects.
USER-CENTRIC MODELING
OF RELEVANCE FEEDBACK
From the review of various feedback techniques in the second section and the
requirements of relevance feedback outlined in the third section, it can be seen that much
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
299
of the current literature suggests a strong focus on the underlying retrieval engine.
Indeed the existing literature often tends to be constrained by the value set of the
features, for example, the separate handling of numerical and keyword features (Lu et al.,
2002; Zhang et al., 2003). In this section we make a case for and present an alternate
perspective.
Nastar et al. (1998) point out that image databases can be classified into two types.
The first type is the homogeneous collection where there is a ground truth regarding
perceptual similarity. This may be implicitly obvious to human users, such as in
photographs of people where two images are either of the same person or not. Otherwise,
experts may largely agree upon similarity, for example, two images either represent the
same species of flower, or they do not. However, in the second type, the heterogeneous
collection, no ground truth may be available. A characteristic example of this type is a
stock collection of photographs. Systems designed to perform retrieval on this second
category of collection should therefore be as flexible as possible, adapting to and
learning from each user in order to satisfy their goal (Nastar et al., 1998). This
classification has significance in the context of databases of other multimedia objects
such as video and audio clips.
In homogeneous collections, retrieval strategies may perform adequately even if
they are not augmented by relevance feedback. This is because the feature set and
similarity measure can be chosen in order to best reflect the accepted ground truths. This
would not be true of heterogeneous collections. Relevance feedback becomes increasingly necessary with heterogeneity in large multimedia collections. When there is no
immediately obvious ground truth, the semantic considerations can become extremely
complex. We try to illustrate this complexity in Figure 1, which uses notation similar to
that of ER Diagrams, while eschewing the use of cardinality. The users information need
may be specific to any of several relevance entities: the users themselves (subjectivity);
the context of their situation (e.g., while considering fashion, the keyword model would
map to a different semantic concept than if cars were being considered); or the semantic
concept(s) associated with their information need (which may or may not be expressible
in terms of keywords and visual features). There would then be zero or more multimedia
objects that would be associated with the users information need. The key point is that
the information need is potentially specific to all four entities in the diagram.
Figure 1. Complex information need in a heterogeneous collection
User
Multimedia Object
Information need
Semantic Concept
Context
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
300
It can be seen that while considering whether a multimedia object satisfies a users
information need, all the factors that are to be taken into account are based on the user.
This clearly highlights the importance of relevance feedback for semantic multimedia
retrieval feedback provided by the user can be used to ensure that the concept
symbolising their information need is not neglected. It follows that the underlying model
of relevance feedback must be explicitly user-centric and semantically based to better
meet the users information needs.
In order to identify a way to address this challenge, the overall multimedia process
can be reviewed. A high-level overview of the information flow in the relevance feedback
and multimedia retrieval process is presented in Figure 2. In existing approaches the
retrieval mechanism and the relevance feedback module have often been formulated very
closely integrated with each other such as in the classical approach. In the extreme case,
the initial query can be interpreted as the initial iteration of user feedback. For example,
in PicHunter (Cox et al., 1996; Cox et al., 1998) and MindReader (Ishikawa et al., 1998), there
is no real distinction between the relevance feedback and the retrieval.
As an alternative approach, it is possible to interpret the task of relevance feedback
as being distinct from retrieval, by identifying and separating their roles. Retrieval tries
to answer the question which objects are similar to the query specification? Feedback
deals with the questions what does the fact that objects were marked in this fashion tell
us about what the user is interested in? And how can this information be best conveyed
to the retrieval engine? As reflected in Figure 2, the task of relevance feedback can then
be considered a module in the overall multimedia retrieval process.
If the relevance feedback technique is modeled in a general fashion, it can focus on
interpreting user input and be loosely coupled to the retrieval engine and the feature set.
By reducing the dependency of the relevance feedback module on the feature set and its
value set, the incorporation of novel visual and semantic features as they become
Classified
Results
Feature Extractor
Query Specification
Result Set
Feature Descriptions
Retrieval Engine
Log
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
301
available would be possible. A highly general relevance feedback model would also
support user feedback in terms of an arbitrary number of classes, that is, support the
classification of the result set into the categories relevant and unmarked; relevant,
nonrelevant and unmarked; or even, say, results given an integral ranking between one
and five. The interpretations of each case would, of course, have to be consistent to
maintain semantic integrity.
An interesting effect of modeling relevance feedback is the possibility for the use
of multiple relevance feedback modules with the same retrieval engine. A relevance
feedback controller could then potentially be developed to apply the most suitable
relevance feedback algorithm depending on specific users requirements in the context
of the collection. This would mean that the controller would have the task of identifying
which relevance feedback technique would best approximate the users information need
at the semantic level.
CONCLUSION
There is a rich and diverse collection of relevance feedback techniques described
in the literature. Any multimedia retrieval system aiming to cater to users needs at a
semantic or close to semantic level of performance should certainly incorporate a
relevance feedback strategy as a measure towards narrowing the semantic gap. An
understanding of the generic types of techniques that are available and what bearing
they have on the specific goals of the multimedia retrieval system in question is essential.
The situation is complicated by the fact that in the literature such techniques are often
supported by precision and recall figures and other such objective measures. While
these figures can reveal certain characteristics about the relevance feedback and its
enhancement of the retrieval process, it is safe to say that such figures alone are not
sufficient to compare alternatives, especially in semantic terms. It would appear that at
the moment there are no definitive comparison criteria. However, a subjective consideration of available techniques can be made in accordance with the guidelines outlined and
the conceptual alignment of a relevance feedback technique with the goals of the given
retrieval system. Such an approach has the benefit of being able to eliminate techniques
should they not meet the requirements envisioned for the retrieval system. Should
existing trends continue and relevance feedback systems continue to become more
complex and general, it may be possible to implement a relevance feedback controller, or
a metarelevance feedback module that can use information gathered from the users
feedback to actually select and refine the systems relevance feedback strategy to better
meet the users specific information needs at a semantic level.
REFERENCES
302
Cox, I. J., Miller, M. L., Minka, T. P., & Yianilos, P. N. (1998). An optimized interaction
strategy for bayesian relevance feedback. In IEEE Conf. on Comp. Vis. and Pattern
Recognition, Santa Barbara, California (pp. 553-558).
Dorai, C., Mauthe, A., Nack, F., Rutledge, L., Sikora, T., & Zettl, H. (2002). Media
semantics: Who needs it and why? In Proceedings of the 10th ACM International
Conference on Multimedia (pp. 580 -583).
Geman, D., & Moquet, R. (1999). A stochastic feedback model for image retrieval.
Technical report, Ecole Polytechnique, 91128 Palaiseau Cedex, France.
Grootjen, F. & van der Weide, T. P. (2002). Conceptual relevance feedback. In IEEE
International Conference on Systems, Man and Cybernetics (Vol. 2, pp. 471-476).
Hersh, W. (1994) Relevance and retrieval evaluation: Perspectives from medicine.
Journal of the American Society for Information Science, 45(3), 201-206.
Ishikawa, Y., Subramanya, R., & Faloutsos, C. (1998). MindReader: Querying databases
through multiple examples. In Proceedings of the 24th International Conference
on Very Large Data Bases, VLDB (pp. 218-227).
Kang, H.-B. (2003). Emotional event detection using relevance feedback. In Proceedings
of the International Conference on Image Processing (Vol. 1, pp. 721-724).
Koskela, M., Laaksonen, J., & Oje, E. (2002). Implementing relevance feedback as
convolutions of local neighborhoods on self-organizing maps.
In Proceedings of the International Conference on Artificial Neural Networks
(pp. 981-986), Madrid, Spain.
Laaksonen, J., Koskela, M., Laakso, S.P., & Oje, E. (2000). PicSOM Content-based image
retrieval with self organizing maps. Pattern Recognition Letters, 21, 1199-1207.
Liu, M., & Wan, C. (2003). Weight updating for relevance feedback in audio retrieval. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 03) (Vol. 5, pp. 644-647).
Lu, Y., Hu, C., Zhu, X., Zhang, H., & Yang, Q. (2000). A unified framework for semantics
and feature based relevance feedback in image retrieval systems. In Proceedings
of the eighth ACM international conference on Multimedia, Marina del Rey,
California (pp. 31-37).
MacArthur, S., Brodley, C., & Shyu, C.-R. (2000). Relevance feedback decision trees in
content-based image retrieval. In Proceedings of the IEEE Workshop on Contentbased Access of Image and Video Libraries (pp. 68-72).
Mizzaro, S. (1998). How many relevances in information retrieval? Interacting with
Computers, 10(3), 305-322.
Mller, H., Squire, D., & Pun, T. (2004). Learning from user behaviour in image retrieval:
Application of market basket analysis. International Journal of Computer Vision,
56(1-2), 65-77.
Muneesawang, P., & Guan, L. (2002). Video retrieval using an adaptive video indexing
technique and automatic relevance feedback. In Proceedings of the
IEEE Workshop on Multimedia Signal Processing (pp. 220-223).
Nastar, C., Mitschke, M., & Meilhac, C. (1998). Efficient query refinement for image
retrieval. In Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, Santa Barbara, California (pp. 547-552).
Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information
Sciences, 11, 341-356.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
303
Rui, Y., & Huang, T. S. (1999). A novel relevance feedback technique in image retrieval.
In Proceedings of the Seventh ACM International Conference on Multimedia
(Part 2), Orlando, Florida (pp. 67-70).
Rui, Y., Huang, T., & Mehrotra, S. (1997). Content-based image retrieval with relevance
feedback in MARS. In Proceedings of the IEEE International Conference on
Image Processing (pp. 815-818).
Ruthven, I., & van Rijsbergen, C. J. (1996). Context generation in information retrieval.
In Proceedings of the Florida Artificial Intelligence Research Symposium.
Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image
databases. In IEEE Transaction on Knowledge and Data Engineering, 13(3), 337351.
Su, L. T. (1994). The relevance of recall and precision in user evaluation. Journal of the
American Society for Information Science, 45(3), 207-217.
Willie, R. (1982). Restructuring lattice theory: An approach based on hierarchies of
concepts. In I. Rival (Ed.), Ordered sets, 445-470. D. Reidel Publishing Company.
Wilson, C., & Srinivasan, B. (2002). Multiple feature relevance feedback in content based
image retrieval using probabilistic inference networks. In Proceedings of the First
International Conference on Fuzzy Systems and Knowledge Discovery
(FSKD02a), Singapore (pp. 651-655).
Wilson, C., Srinivasan, B., & Indrawan, M. (2001). BIR - The Bayesian network image
retrieval system. In Proceedings of the IEEE International Symposium on Intelligent Multimedia, Video and Speech Processing (ISIMP2001) (pp. 304-307).
Hong Kong SAR, China.
Wilson, P. (1973). Situational relevance. Information Retrieval and Storage, 9, 457-471.
Zhang, H., Chen, Z., Li, M., & Su, Z. (2003). Relevance feedback and learning in contentbased image search. In World Wide Web: Internet and Web Information Systems,
6, (pp. 131-155). The Netherlands: Kluwer Academic Publishers.
Zhou, X. S., & Huang, T. S. (2001a). Comparing discriminating transformations and SVM
for learning during multimedia retrieval. In Proceedings of the Ninth ACM International Conference on Multimedia (pp. 137-146). Ottawa, Canada.
Zhou, X. S., & Huang, T. S. (2001b). Exploring the nature and variants of relevance
feedback. In IEEE Workshop on Content-Based Access of Image and Video
Libraries (CBAIVL 2001), 94-101.
Zhuang, Y., Yang, J., Li, Q., & Pan, Y. (2002). A graphic-theoretic model for incremental
relevance feedback in image retrieval. In International Conference on Image
Processing, 1, 413-416.
Zutshi, S., Wilson, C., Krishnaswamy, S., & Srinivasan, B. (2003). Modelling relevance
feedback using rough sets. In Proceedings of the Fifth International Conference
on Advances in Pattern Recognition (ICAPR 2003) (pp. 495-500).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
304
Section 4
Managing Distributed
Multimedia
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 305
Chapter 13
EMMO:
ABSTRACT
306
INTRODUCTION
Todays multimedia content formats such as HTML (Raggett et al., 1999), SMIL
(Ayars et al., 2001), or SVG (Ferraiolo et al., 2003) primarily encode the presentation of
content but not the information content conveys. But this presentation-oriented
modeling only permits the hard-wired presentation of multimedia content exactly in the
way specified; for advanced operations like retrieval and reuse, automatic composition,
recommendation, and adaptation of content according to user interests, information
needs, and technical infrastructure, valuable information about the semantics of content
is lacking.
In parallel to research on the Semantic Web (Berners-Lee et al., 2001; Fensel, 2001),
one can therefore observe a shift in paradigm towards a semantic modeling of multimedia
content. The basic media of which multimedia content consists are supplemented with
metadata describing these media and their semantic interrelationships. These media and
descriptions are processed by stylesheets, search engines, or user agents providing
advanced functionality on the content that can exceed mere hard-wired playback.
Current semantic multimedia modeling approaches, however, largely treat the
contents basic media, the semantic description, and the functionality offered on the
content as separate entities: the basic media of which multimedia content consists are
typically stored on web or media servers; the semantic descriptions of these media are
usually stored in databases or in dedicated files on web servers using formats like RDF
(Lassila & Swick, 1999) or Topic Maps (ISO/IEC JTC 1/SC 34/WG 3, 2000); the functionality on the content is normally realized as servlets or stylesheets running in application
servers or as dedicated software running at the clients such as user agents.
This inherent separation of media, semantic description, and functionality in
semantic multimedia content modeling, however, hinders the realization of multimedia
content sharing as well as collaborative applications which are gaining more and more
importance, such as the sharing of MP3 music files (Gnutella, n.d.) or learning materials
(Nejdl et al., 2002) or the collaborative authoring and annotation of multimedia patient
records (Grimson et al., 2001). The problem is that exchanging content today in such
applications simply means exchanging single media files. An analogous exchange of
semantically modeled multimedia content would have to include content descriptions
and associated functionality, which are only coupled loosely to the media and usually
exist on different kinds of servers potentially under control of different authorities, and
which are thus not easily moveable.
In this chapter, we give an illustrated introduction to Enhanced Multimedia
MetaObjects (Emmo), a semantic multimedia content modeling approach developed with
collaborative and content sharing applications in mind. Essentially, an Emmo constitutes
a self-contained piece of multimedia content that merges three of the contents aspects
into a single object: the media aspect, that is, the media which make up the multimedia
content, the semantic aspect which describes the content, and the functional aspect by
which an Emmo can offer meaningful operations on the content and its description that
can be invoked and shared by applications. Emmos in their entirety including media,
content description, and functionality can be serialized into bundles and are
versionable: essential characteristics that enable their exchangeability in content
sharing applications as well as the distributed construction and modification of Emmos
in collaborative scenarios.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 307
Furthermore, this chapter illustrates how we employed Emmos for two concrete
collaborative and content sharing applications in the domains of cultural heritage and
digital music archives.
The chapter is organized as follows: we begin with an overview of Emmos and show
their difference to existing approaches for multimedia content modeling. We then
introduce the conceptual model behind Emmos and outline a distributed Emmo container
infrastructure for the storage, exchange, and collaborative construction of Emmos. We
then apply Emmos for the representation of multimedia content in two application
scenarios. We conclude this paper with a summary and give an outlook to our current
and future work.
BACKGROUND
In this section, we provide a basic understanding of the Emmo idea by means of an
illustrating example. We show the uniqueness of this idea by relating Emmos to other
approaches to multimedia content modeling in the field.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
308
Paris
location
instance-of
part-of
Vienna
Salzburg
location location
Family
Member
Friend
Austria
part-of
renderAsSlideShow(persons,
locations, dates)
is-a
is-a
instance-of
France
part-of
Functionality
Person
Location
renderAsMap(persons,
location, dates)
instance-of instance-of
Mary
Paul
Peter
location
depicts
depicts
depicts
depicts
depicts
Picture 1
Shot at 07/21/2003
Picture 2
Shot at 07/25/2003
Picture 3
Shot at 07/28/2003
Picture 4
Shot at 08/1/2003
instance-of
instance-of
Media
picture001.jpg
thumbnail001.jpg
instance-of
Photograph
picture002.jpg
thumbnail002.jpg
picture003.jpg
thumbnail003.jpg
picture004.jpg
thumbnail004.jpg
For content description, Emmos apply an expressive concept graph-like data model
similar to RDF and Topic Maps. In this graph model, the description of the content
represented by an Emmo is not performed directly on the media that are contained in the
Emmo; instead, the model abstracts from physical media making it possible to subsume
several media objects which constitute only different physical manifestations of logically one and the same medium under a single media node. This is a convenient way to
capture alternative media. In Figure 1, for example, each media node Picture 1 Picture
4 subsumes not only a photo but also its corresponding thumbnail image.
Apart from media, nodes can also represent abstract concepts. By associating an
Emmos media objects with such concepts, it is possible to create semantically rich
descriptions of the multimedia content the Emmo represents. In Figure 1, for instance,
it is expressed that the logical media nodes Picture 1 Picture 4 constitute photos
taken in Paris, Vienna, and Salzburg showing Peter and Paul, Paul and Mary, and Mary,
respectively. The figure further indicates that nodes can be augmented with primitive
attribute values for closer description: the pictures of the photo album are furnished with
the dates at which they have been shot.
By associating concepts with each other, it is also possible to express domain
knowledge within an Emmo. It is stated in our example that Peter, Paul, and Mary are
Persons, that Paul and Mary are family members, that Peter is a friend, that Paris is located
in France, and that Vienna and Salzburg are parts of Austria.
The Emmo model does not predefine the concepts, association types, and primitive
attributes available for media description; these can be taken from arbitrary, domainspecific ontologies. While they thus constitute a very generic, flexible, and expressive
approach to multimedia content modeling, Emmos are not ready-to-use formalism but
require an agreed common ontology before they can be employed in an application.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 309
Finally, Emmos also address the functional aspect of content. An Emmo can offer
operations that can be invoked by applications in order to work with the content the
Emmo represents in a meaningful manner. As shown to the top right of Figure 1, our
example Emmo provides two operations supporting two different rendition options for
the photo album, which are illustrated by the screenshots of Figure 2. As indicated by
the left screenshot, the operation renderAsSlideshow()might know how to given a
set of persons, locations, as well as time periods of interest render the photo album
as a classic slideshow on the basis of the contained pictures and their semantic
description by generating an appropriate SMIL presentation. As indicated by the right
screenshot, the operation renderAsMap() might also know how to given the same data
render the photo album as a map with thumbnails pointing to the locations where
photographs have been taken by constructing an SVG graph.
One may think of many further uses of operations. For example, operations could
also be offered for rights clearance, displaying terms of usage, and so forth.
Emmos have further properties: an Emmo can be serialized and shared in its entirety
in a distributed content sharing scenario including its contained media, the semantic
description of these media, and its operations. In our example, this means that Paul can
accord Peter the photo album Emmo as a whole for instance, via email or a file-sharing
peer-to-peer infrastructure and Peter can do anything with the Emmo that Paul can also
do, including invoking its operations.
Emmos also support versioning. Every constituent of an Emmo is versionable, an
essential prerequisite for applications requiring the distributed and collaborative
authoring of multimedia content. This means that Peter, having received the Emmo from
Paul, can add his own pictures to the photo album while Paul can still modify his local
copy. Thereby, two concurrent versions of the Emmo are created. As the Emmo model
is able to distinguish both versions, Paul can merge them into a final one when he receives
Peters changes.
Related Approaches
The fundamental idea underlying the concept of Emmos presented beforehand is
that an Emmo constitutes an object unifying three different aspects of multimedia
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
310
content, namely the media aspect, the semantic aspect, and the functional aspect. In the
following, we fortify our claim that this idea is unique.
Interrelating basic media like single images and videos to form multimedia content
is the task of multimedia document models. Recently, several standards for multimedia
document models have emerged (Boll et al., 2000), such as HTML (Ragett et al., 1999),
XHTML+SMIL (Newmann et al., 2002), HyTime (ISO/IEC JTC 1/SC 34/WG 3, 1997),
MHEG-5 (ISO/IEC JTC 1/SC 29, 1997), MPEG-4 BIFS and XMT (Pereira & Ebrahimi, 2002),
SMIL (Ayars et al., 2001), and SVG (Ferraiolo et al., 2003). Multimedia document models
can be regarded as composite media formats that model the presentation of multimedia
content by arranging basic media according to temporal, spatial, and interaction relationships. They thus mainly address the media aspect of multimedia content. Compared to
Emmos, however, multimedia document models neither interrelate multimedia content
according to semantic aspects nor do they allow providing functionality on the content.
They rely on external applications like presentation engines for content processing.
As a result of research concerning the Semantic Web, a variety of standards have
appeared that can be used to model multimedia content by describing the information it
conveys on a semantic level, such as RDF (Lassila & Swick, 1999; Brickley & Guha, 2002),
Topic Maps (ISO/IEC JTC 1/SC 34/WG 3, 2000), MPEG-7 (especially MPEG-7s graph
tools for the description of content semantics (ISO/IEC JTC 1/SC 29/WG 11, 2001)), and
Conceptual Graphs (ISO/JTC1/SC 32/WG 2, 2001). These standards clearly cover the
semantic aspect of multimedia content. As they also offer means to address media within
a description, they undoubtedly refer to the media aspect of multimedia content as well.
Compared to Emmos, however, these approaches do not provide functionality on
multimedia content. They rely on external software like database and knowledge base
technology, search engines, user agents, and so forth, for the processing of content
descriptions. Furthermore, media descriptions and the media described are separate
entities potentially scattered around different places on the Internet, created and
maintained by different and unrelated authorities not necessarily aware of each other and
not necessarily synchronized whereas Emmos combine media and their semantic
relationships into a single indivisible unit.
There exist several approaches that represent multimedia content by means of
objects. Enterprise Media Beans (EMBs) (Baumeister, 2002) extend the Enterprise Java
Beans (EJBs) architecture (Matena & Hapner, 1998) with predefined entity beans for the
representation of basic media within enterprise applications. These come with rudimental
access functionality but can be extended with arbitrary functionality using the inheritance mechanisms available to all EJBs. Though addressing the media and functional
aspects of content, EMBs in comparison to Emmo are mainly concerned with single media
content and not with multimedia content. Furthermore, EMBs do not offer any dedicated
support for the semantic aspect of content.
Adlets (Chang & Znati, 2001) are objects that represent individual (not necessarily
multimedia) documents. Adlets support a fixed set of predefined functionality which
enables them to advertise themselves to other Adlets. They are thus content representations that address the media as well as the functional aspect. Different from Emmos,
however, the functionality supported by Adlets is limited to advertisement and there is
no explicit modeling of the semantic aspect.
Tele-Action Objects (TAOs) (Chang et al., 1995) are object representations of
multimedia content that encapsulate the basic media of which the content consists and
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 311
interlink them with associations. Though TAOs thus address the media aspect of
multimedia content in a way similar to Emmos, they do not adequately cover the semantic
aspect of multimedia content: only a fixed set of five association types is supported
mainly concerned with temporal and spatial relationships for presentation purposes.
TAOs can further be augmented with functionality. Such functionality is, in contrast to
the functionality of Emmos, automatically invoked as the result of system events and not
explicitly invoked by applications.
Distributed Active Relationships (Daniel et al., 1998) define an object model based
on the Warwick Framework (Lagoze et al., 1996). In the model, Digital Objects (DOs),
which are interlinked with each other by semantic relationships, act as containers of
metadata describing multimedia content. DOs thus do not address the media aspect of
multimedia content but focus on the semantic aspect. The links between containers can
be supplemented with arbitrary functionality. As a consequence, DOs take account of
the functional aspect as well. Different from Emmos, however, the functionality is not
explicitly invoked by applications but implicitly whenever an application traverses a link
between two DOs.
MediaSelector
0..1
1..*
0..*
0..*
FullSelector
1
MediaProfile
+audioChannels : int
+bandWidth : float
+bitRate : int
+colorDomain : String
+contentType : String
+duration : float
+fileFormat : String
+fileSize : int
+fontSize : int
+fontStyle : String
+frameRate : double
+height : int
+profileID : String
+qualityRate : float
+resolution : int
+samplingRate : double
+width : int
TemporalSelector
SpatialSelector
TextualSelector
CompositeSelector
+beginMs : int
+durationMs : int
+startX : int
+startY : int
+endX : int
+endY : int
+beginChar : int
+endChar : int
+compositionType : int
MediaInstance
1..*
+inlineMedia : Byte[]
+locationDescription : String
+mediaURL : URL
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
312
Media Aspect
Addressing the media aspect of multimedia content, an Emmo encapsulates the
basic media of which the content it represents is composed. Figure 3 presents the excerpt
of the conceptual model which is responsible for this.
Closely following the MPEG-7 standard and its multimedia description tools (ISO/
IEC JTC 1/SC 29/WG 11, 2001), basic media are modeled by media profiles (represented
by the class MediaProfile in Figure 3) along with associated media instances (represented by the class MediaInstance). Media profiles hold low-level metadata describing
physical characteristics of the media such as the storage format, file size, and so forth.;
the media data itself is represented by media instances, each of which may directly embed
the data in form of a byte array or, if that is not possible or feasible, address its storage
location by means of a URI. Moreover, if a digital representation is not available, a textual
location description can be specified, for example the location of analog tapes in some
tape archive. Figure 3 further shows that a media profile can have more than one media
instances. In this way, an Emmo can be provided with information about alternative
storage locations of media.
Basic media represented by media profiles and media instances are attached to an
Emmo by means of a connector (see class Connector in Figure 3). A connector does not
just address a basic medium via a media profile; it may also refer to a media selector (see
base class MediaSelector) to address only a part of the medium. As indicated by the
various subclasses of MediaSelector, it is possible to select media parts according to
simple textual, spatial, temporal and textual criteria, as well as an arbitrary combination
of these criteria (see class CompositeSelector). It is thus possible to address the upper
right part of a scene in a digital video starting from second 10 and lasting until second
30 within an Emmo without having to extract that scene and put it into a separate media
file using a video editing tool.
Semantic Aspect
Out of the basic media which it contains, an Emmo forges a piece of semantically
modeled multimedia content by describing these media and their semantic interrelationships. The class diagram of Figure 4 gives an overview over the part of the Emmo model
that provides these semantic descriptions. As one can see, the basic building blocks of
the semantic descriptions, the so-called entities, are subsumed under the common base
0..*
0..*
LogicalMediaPart
Emmo
Association
OntologyObject
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 313
+OID : String
+name : String
+description : String
+creationDate : long
+modifiedDate : long
+creator : String
AttributeValue
+value : Object
1
0..*
0..*
+value
0..*
+predecessor
0..*
+type
+attribute
OntologyObject
0..*
class Entity. The Emmo model distinguishes four kinds of entities: namely, logical media
parts, associations, ontology objects, and Emmos themselves, represented by assigned
subclasses. These four kinds of entities have a common nature but each extends the
abstract notion of an entity with additional characteristic features.
Figure 5 depicts the characteristics that are common to all kinds of entities. Each
entity is globally and uniquely identified by its OID, realized by means of a universal
unique identifier (UUID) (Leach, 1998) which can be easily created even in distributed
scenarios. To enhance human readability and usability, each entity is further augmented
with additional attributes like a name and a textual description. Moreover, each entity
holds information about its creator and its creation and modification date.
Figure 5 further expresses that entities may receive an arbitrary number of types. A
type is a concept taken from an ontology and represented by an ontology object in the
model. Types thus constitute entities themselves. By attaching types, an entity gets
meaning and is classified in an application-dependent ontology. As mentioned before,
the Emmo model does not come with a predefined set of ontology objects but instead
relies on applications to agree on common ontology before the Emmo model can be used.
In the example of Figure 6, the entity Picture 3 of kind logical media part (depicted
as a rectangle), which represents the third picture of our example photo album of the
Picture 3
digital image
photograph
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
314
Picture 3
07/28/2003
date
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 315
Picture 3
pred
succ
pred
succ
Picture 3
(attribute value
date added)
Picture 3
(attribute value
aperture added)
pred
succ
pred
succ
Picture 3
(attribute values
aperture and
date merged)
locations. One version augments the logical media part with a date attribute value to
denote the creation date of the picture whereas the other provides an attribute value
describing the aperture with which the picture was taken. Finally, as shown by the logical
media part at the right side of the figure, these two versions were merged again into a
fourth that now holds both attribute values.
Having explained the common characteristics shared by all entities, we are now able
to introduce the peculiarities of the four concrete kinds of entities: logical media parts,
ontology objects, associations, and Emmos.
Ontology Objects
Ontology objects are entities that represent concepts of an ontology. We have
already described how ontology objects are used to define entity types and to augment
entities with attribute values. By relating entities such as logical media parts to ontology
objects, they can be given a meaning. As it can be seen from the class diagram of Figure
10, the Emmo model distinguishes two kinds of ontology objects represented by two
subclasses of OntologyObject: Concept and ConceptRef. Whereas an instance of
Concept serves to represent a concept of an ontology that is fully captured within the
Emmo model, ConceptRef allows one to reference concepts of ontologies specified in
external ontology languages such as RDF Schema (Brickley & Guha, 2002). The latter is
a pragmatic tribute to the fact that we have not developed an ontology language for
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
316
LogicalMediaPart
+ontType : int
+masterProfileID : String
1
ConceptRef
0..*
Concept
+objOID : String
+ontStandard : String
Connector
+source
1
+target
0..*
0..*
Association
Emmos yet and therefore rely on external languages for this purpose. References to
concepts of external ontologies additionally need a special ID (objOID) uniquely
identifying the external concept referenced and a label indicating the format of the
ontology (ontStandard); for example, RDF Schema.
Associations
Associations are entities that establish binary directed relationships between
entities, allowing the creation of complex and detailed descriptions of the multimedia
content represented by the Emmo. As one can see from Figure 11, each association has
exactly one source entity and one target entity. The kind of semantic relationship
represented by an association is defined by the associations type which is like the
types of other entities an ontology object representing the concept that captures the
type in an ontology. Different from other entities, however, an association is only
permitted to have one type as it can express only a single kind of relationship.
Since associations are first-class entities, they can take part as sources or targets
in other associations like any other entities. This feature permits the creation of very
complex content descriptions, as it facilitates the reification of statements (statements
about statements) within the Emmo model.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 317
Peter
thinks
Mary
Picture 3
fancies
Emmos
Emmos themselves, finally, constitute the fourth kind of entities. An Emmo is
basically a container that encapsulates arbitrary entities to form a semantically modeled
piece of multimedia content (see the aggregation between the classes Emmo and Entity
in the introductory outline of the model in Figure 4). As one and the same entity can be
contained in more than one Emmo, it is possible to encapsulate different, contextdependent, and even contradicting, views onto the same content within different Emmo;
as Emmo are first-class entities, they can be contained within other Emmos and take part
in associations therein, allowing one to build arbitrarily nested Emmo structures for the
logical organization of multimedia content. These are important characteristics especially useful for the authoring process, as they facilitate reuse of existing Emmos and the
content they represent.
Figure 13 shows an example where a particular Emmo encapsulates another. In the
figure, Emmos are graphically shown as ellipses. The example depicts an Emmo modeling
a private photo gallery that up to the moment holds only a single photo album (again
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
318
domain
Journey to Europe
Paris
Paul
Vienna
Mary
vacation
depicts
location
Picture 1
Picture 2
Picture 3
digital image
photograph
modeled by an Emmo): namely, the photo album of the journey to Europe we used as a
motivating example in the section illustrating the Emmo idea. Via an association, this
album is classified as vacation within the photo gallery. In the course of time, the photo
gallery might become filled with additional Emmos representing further photo albums;
for example, one that keeps the photos of a summer vacation in Spain. These Emmos can
be related to each other. For example, an association might express that the journey to
Europe took place before the summer vacation in Spain.
Functional Aspect
Emmos also address the functional aspect of multimedia content. Emmos may offer
operations that realize arbitrary content-specific functionality which makes use of the
media and descriptions provided with the media and semantic aspects of an Emmo and
which can be invoked by applications working with content. The class diagram of Figure
14 shows how this is realized in the model. As expressed in the diagram, an Emmo may
aggregate an arbitrary number of operations represented by the class of the same name.
Each operation has a designator, that is, a name that describes its functionality, which
is represented by an ontology object. Similar to attributes, the motivation behind using
concepts of an ontology as operation designators instead of simple string identifiers is
that this allows one to express restrictions on the usage of operations within an ontology;
for example, the types of Emmo for which an operation is available, the types of the
expected input parameters, and so forth.
The functionality of an operation is provided by a dedicated implementation class
whose name is captured by an operations implClassName attribute to permit the
dynamic instantiation of the implementation class at runtime. There are not many
restrictions for such an implementation class: the Emmo model merely demands that an
implementation class realizes the OperationImpl interface. OperationImpl enforces the
implementation of a single method only: namely, the method execute() which expects
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 319
OntologyObject
+Designator
0..*
Operation
+implClassName : String
0..*
<<instantiates>>
interface
OperationImpl
+execute(in e : Emmo, in args : Object[]) : Object
Journey to Europe
RenderAsMap:
OperationImpl
renderAsSlideMap
the Emmo on which an operation is executed as its first parameter followed by a vector
of arbitrary operation-dependent parameter objects. Execute() performs the desired
functionality and, as a result, may return an arbitrary object.
Figure 15 once more depicts the Emmo modeling the photo album of the journey to
Europe that we already know from Figure 13, but this time enriched with the two
operations already envisioned in the second section: one that traverses the semantic
description of the album returns an SMIL presentation that renders the album as a slide
show, and another that returns an SVG presentation that renders the same album as a map.
For both operations, two implementation classes are provided that are attached to the
Emmo and differentiated via their designators renderAsSlideShow and renderAsMap.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
320
Access, manipulate,
traverse, query,
Emmo Container 1
Import/Export Emmos
Persistent storage
of media and semantic
relationships
Emmo Container 2
Persistent storage
of media and semantic
relationships
EMMO_1284
e98ea567ea
d456778872
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 321
relational DBMS for persistent storage as well; we opted for an object-oriented DBMS,
however, because of these systems suitability for handling complex graph structures
like Emmos.
The second implication of a decentralized infrastructure is that Emmos must be
transferable between the different Emmo containers operated by users that want to share
or collaboratively work on content. This requires Emmo containers to be able to
completely export Emmos into bundles encompassing their media, semantic, and functional aspects, and to import Emmos from such bundles, which is explained in more detail
in the following two subsections.
In the current state of implementation, Emmo containers are rather isolated components, requiring applications to explicitly initiate the import and export of Emmos and to
manually transport Emmo bundles between different Emmo containers themselves. We
are building a peer-to-peer infrastructure around Emmo containers that permits the
transparent search for and transfer of Emmos across different containers.
Exporting Emmos
An Emmo container can export an Emmo into a bundle whose overall structure is
illustrated by Figure 17.
The bundle is basically a ZIP archive which captures all three aspects of an Emmo:
the media aspect is captured by the bundles media folder. The basic media files of which
the multimedia content modeled by the Emmo consists are stored in this folder.
!""#$!%&'(')*+(,,+-../('0'-
-1(2-230).+4.%-
!%&'(')*+(,,+-../('0'-
-1(2-230).+4.%-5678
709:+1;<
71/:0
=<<>?@@AAA5>=B<B53B7@>0C8@1CDB>1@>:3<CD122-5E>F
=<<>?@@AAA5>=B<B53B7@>0C8@1CDB>1@<=C7490:822-5E>F
B>1D0<:B9;
D19/1D:9F5E0D
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
322
The semantic aspect is captured by a central XML file whose name is given the OID
of the bundled Emmo. This XML file captures the semantic structure of the Emmo, thus
describing all of the Emmos entities, the associations between them, the versioning
relationships, and so forth.
Figure 18 shows a fragment of such an XML file. It is divided into a <components>
section declaring all entities and media profiles relevant for the current Emmo and a
<links> section capturing all kinds of relationships between these entities and media
profiles, such as types, associations, and so forth.
The functional aspect of an Emmo is captured by the bundles operations folder
in which the binary code of the Emmos operations is stored. Here, our choice for Java
as the implementation language for Emmo containers comes in handy again, as it allows
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 323
The strong mode is the normal mode for an entity. The bundle holds all information
about an entity including its types, attribute values, immediate predecessor and
successor versions, media profiles (in case of a logical media part), contained
entities (in case of an Emmo), and so forth.
The hollow mode is applicable to Emmos only. The hollow mode indicates that the
bundle holds all information about an Emmo except the entities it contains. The
hollow mode appears in bundles where it was chosen not to recursively export
encapsulated Emmo. In this case, encapsulated Emmos receive the hollow mode;
the entities encapsulated by those Emmos are excluded from the export.
The weak mode indicates that the bundle contains only basic information about
an entity, such as its OID, name, and description but no types, attribute values, and
so forth. Weak mode entities appear in bundles that have been exported without
versioning information. In this case, the immediate predecessor and successor
versions of exported entities are placed into the bundle in weak mode; indirect
predecessor and successor versions are excluded from the export.
The particular mode of an entity within a bundle is marked with the mode attribute
in the entitys declaration in the bundles XML file (see again Figure 18).
Importing Emmos
When importing an Emmo bundle exported in the way described in the previous
subsection, an Emmo container essentially inserts all media files, entities, and operations
included in the bundle into its local database. In order to avoid duplicates, the container
checks whether an entity with the same OID or whether a media file or JAR file already
exists in the local database before insertion. If a file already exists, the basic strategy of
the importing container is that the local copy prevails.
However, the different export variants for Emmos and the different modes in which
entities might occur in a bundle as well as the fact that in a collaborative scenario
Emmos might have been concurrently modified without creating new versions of entities
demand a more sophisticated handling of duplicate entities on the basis of a timestamp
protocol. Depending on the modes of two entities with the same OID in the bundle, and
the local database and the timestamps of both entities, essentially the following treatment
is applied:
A greater mode (weak < hollow < strong) in combination with a more recent
timestamp always wins. Thus, if the local entity has a greater mode and a newer
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
324
timestamp, it prevails, and the entity in the bundle is ignored. Similarly, if the local
entity has a lesser mode and an older timestamp, the entity in the bundle completely
replaces the local entity in the database.
If the local entity has a more recent timestamp but a lesser mode, additional data
available for the entity in the bundle (entity types, attribute values, predecessor
or successor versions, encapsulated entities in case of Emmos, or media profiles
in case of logical media parts) complements the data of the local entity, thereby
raising its mode.
In case of same modes but a more recent timestamp of the entity in the bundle, the
entity in the bundle completely replaces the local entity in the database.
In case of same modes but a more recent timestamp of the entity in the local
database, the entity in the database prevails and the entity in the bundle is ignored.
APPLICATIONS
Having introduced and described the Emmo approach to semantic multimedia
content modeling and the Emmo container infrastructure, this section illustrates how
these concepts have been practically applied in two concrete multimedia content sharing
and collaborative applications. The first application named CULTOS is in the domain of
cultural heritage and the second application introduces a semantic jukebox.
CULTOS
CULTOS is an European Union (EU)-funded project carried out from 2001 to 2003
with 11 partners from EU-countries and Israel1. It has been the task of CULTOS to develop
a multimedia collaboration platform for authoring, managing, retrieving, and exchanging
Intertextual Threads (ITTs) (Benari et al., 2002; Schellner et al., 2003) knowledge
structures that semantically interrelate and compare cultural artifacts such as literature,
movies, artworks, and so forth. This platform enables the community of intertextual
studies to create and exchange multimedia-enriched pieces of cultural knowledge that
incorporate the communitys different cultural backgrounds an important contribution to the preservation of European cultural heritage.
ITTs are basically graph structures that describe semantic relationships between
cultural artifacts. They can take a variety of forms, ranging from spiders over centipedes
to associative maps, like the one shown in Figure 19.
The example ITT depicted in the figure highlights several relationships of the poem
The Fall by Tuvia Ribner to other works of art. It states that the poem makes reference
to the 3rd book of Ovids Metamorphoses and that the poem is an ekphrasis of the
painting Icarus Fall by the famous Dutch painter Breugel.
The graphical representation of an ITT bears strong resemblance to well-known
techniques for knowledge representation such as concept graphs or semantic nets,
although it lacks their formal rigidity. ITTs nevertheless get very complex, as they
commonly make use of constructs such as encapsulation and reification of statements
that are challenging from the perspective of knowledge representation.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 325
by Ovid
Metamorphoses
Book 3
The Fall
by Ribner
Text
Ekphrasis
by Breugel
Icaruss Fall
Painting
Text
Referencing
Metamorphoses
Text
by Ovid
The Fall of
Adam&Eve
Book 3
The Fall
by Ribner
NewTestament
Genesis,ch.II
Opposed
Representation
Text
Cultural Concept
Ekphrasis
Icaruss Fall
by Breugel
Painting
believes
The Fall
by Ribner
Text
B. Zoa
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
326
Emmo3
Opposed
Representation
Metamorphoses
by Ovid
Referencing
The Fall
of Adam&Eve
Connector 2
http://.../Metamorphoses.pdf
Emmo1
Connector 5
Emmo2
FallAdam&Eve.doc
Icarus Fall
By Breugel
The Fall
by Ribner
Connector 1
Cultural
Concept
Connector 3
Ekphrasis
http://.../IcarusFall.jpg
The Fall
by Ribner
http://.../TheFall.doc
Connector 4
believes
TheFall.doc
:RenderingImplementation
B.Zoa
Rendering
expressiveness to capture ITTs. Figure 21 shows how the complex ITT of Figure 20 could
be represented using Emmos. Due to the fact that associations as well as Emmos
themselves are first-class entities, it is even possible to cope with reification of
statements as well as with encapsulation of ITTs.
Secondly, the media aspect of Emmos allows researchers to enrich ITTs that so far
expressed interrelationships between cultural artefacts on an abstract level with digital
media about these artefacts, such as a JPEG image showing Breugels painting, Icarus
Fall. The ability to consume these media while browsing an ITT certainly enhances the
comprehension of the ITT and the relationships described therein.
Thirdly, with the functional aspect of Emmos, functionality can be attached to ITTs.
For instance, an Emmo representing an ITT in CULTOS offers operations to render itself
in an HTML-based hypermedia view.
Additionally, our Emmo container infrastructure outlined in the previous section
provides a suitable foundation for the realization of the CULTOS platform. Their ability
to persistently store Emmos as well as their interfaces which enable applications to finegrainedly traverse and manipulate the stored Emmos and invoke their operations make
Emmo containers an ideal ground for the authoring and browsing applications for ITTs
that had to be implemented in the CULTOS project. Figure 22 gives a screenshot of the
authoring tool for ITTs that has been developed in the CULTOS project which runs on
top of an Emmo container.
Moreover, their decentralized approach allows the setup of independent Emmo
containers at the sites of different researchers; their ability to import and export Emmos
with all the aspects they cover facilitates the exchange of ITTs, including the media by
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 327
which they are enriched as well as the functionality they offer. This enables researchers
to share and collaboratively work on ITTs in order to discover and establish new links
between artworks as well as different personal and cultural viewpoints, thereby paving
the way to novel insights to a subject. The profound versioning within the Emmo model
further enhance this kind of collaboration, allowing researchers to concurrently create
different versions of an ITT at different sites, to merge these versions, and to highlight
differences between these versions.
Semantic Jukebox
One of the most prominent (albeit legally disputed) multimedia content sharing
applications is the sharing of MP3 music files. Using peer-to-peer file sharing infrastructures such as Gnutella, many users gather large song libraries on their home PCs which
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
328
they typically manage with one of the many jukebox programs available, such as Apples
iTunes (Apple Computer, n.d.). The increasing use of ID3 tags (ID3v2, n.d.) optional
free text attributes capturing metadata like the interpreter, title, and the genre of a song
- within MP3 files for song description alleviates the management of such libraries.
Nevertheless, ID3-based song management quickly reaches its limitations. While
ID3 tags enable jukeboxes to offer reasonably effective search functionality for songs
(provided the authors of ID3 descriptions spell the names of interprets, albums, and
genres consistently), more advanced access paths to song libraries are difficult to realize.
Apart from other songs of the same band or genre, for instance, it is difficult to find songs
similar to the one that is currently playing. In this regard, it would also be interesting to
be able to navigate to other bands in which artists of the current band played as well or
with which the current band appeared on stage together. But such background knowledge cannot be captured with ID3 tags.
Using Emmos and the Emmo container infrastructure, we have implemented a
prototype of a semantic jukebox that considers background knowledge about music. The
experience we have gained from this prototype shows that the Emmo model is well-suited
to represent knowledge-enriched pieces of music in a music sharing scenario. Figure 23
gives a sketch of such a music Emmo which holds some knowledge about the song
Round Midnight.
Figure 23. Knowledge about the song Round Midnight represented by an Emmo
Composition
Round Midnight
Thelonious Monk
Round Midnight
composed by
assigned to
Artist
Record
has Manifestation
Miles Davis
Round about
Midnight
Round Midnight
:java.util.Date
played by
Connector 1
10/26/1955
http://.../roundmid.mp3
date of issue
Performance
:RenderAsTimelineinSVG
Rendering
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 329
Its media aspect enables the depicted Emmo to act as a container of MP3 music files.
In our example, this is a single MP3 file with the song Round Midnight that is connected
as a media profile to the logical media part Round Midnight in the center of the figure.
The Emmos semantic aspect allows us to express rich background knowledge
about music files. For this purpose, we have developed a basic ontology for the music
domain featuring concepts such as Artist, Performance, Composition, and Record
that all appear as ontology objects in the figure. The ontology also features various
association types which allow us to express that Round Midnight was composed by
Thelonious Monk and the particular performance by Miles Davis can be found on the
record Round about Midnight.
The ontology also defines attributes for expressing temporal information like the
issue date of a record.
The functional aspect, finally, enables the Emmo to support different renditions of
the knowledge it contains. To demonstrate this, we have realized an operation that, being
passed a time interval as its parameter, produces an SVG timeline rendition (see
screenshot of Figure 24) arranging important events like the foundation of bands, the
birthdays and days of death of artists, and so forth, around a timeline. More detailed
information for each event can be gained by clicking on the particular icons on the
timeline.
Further operations could be imagined; for example, operations that provide rights
clearance functionality for the music files contained in the Emmo, which is a crucial issue
in music sharing scenarios.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
330
CONCLUSION
Current approaches to semantic multimedia content modeling typically regard the
basic media which the content comprises, the description of these media, and the
functionality of the content as conceptually separate entities. This leads to difficulties
with multimedia content sharing and collaborative applications. In reply to these
difficulties, we have proposed Enhanced Multimedia Meta Objects (Emmos) as a novel
approach to semantic multimedia content modeling. Emmos coalesce the media of which
multimedia content consists, their semantic descriptions, as well as functionality of the
content into single indivisible objects. Emmos in their entirety are serializable and
versionable, making them a suitable foundation for multimedia content sharing and
collaborative applications. We have outlined a distributed container infrastructure for
the persistent storage and exchange of Emmos. We have illustrated how Emmos and the
container infrastructure were successfully applied for the sharing and collaborative
authoring of multimedia-enhanced intertextual threads in the CULTOS project and for the
realization of a semantic jukebox.
We strive to extend the technological basis of Emmos. We are currently developing
a query algebra, which permits declarative querying of all the aspects of multimedia
content captured by Emmos, and integrating this algebra within our Emmo container
implementation. Furthermore, we are wrapping the Emmo containers as services in a peerto-peer network in order to provide seamless search for and exchange of Emmos in a
distributed scenario. We also plan to develop a language for the definition of ontologies
that is adequate for use with Emmos. Finally, we are exploring the handling of copyright
and security within the Emmo model. This is certainly necessary as Emmos might not just
contain copyrighted media material but also carry executable code with them.
REFERENCES
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMMO 331
Benari, M., Ben-Porat, Z., Behrendt, W., Reich, S., Schellner, K., & Stoye, S. (2002).
Organizing the knowledge of arts and experts for hypermedia presentation. Proceedings of the Conference of Electronic Imaging and the Visual Arts, Florence,
Italy.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American.
Boll, S., Klas, W., & Westermann, U. (2000). Multimedia document formats - sealed fate
or setting out for new shores? Multimedia - Tools and Applications, 11(3).
Brickley, D., & Guha, R.V. (2002). Resource description framework (RDF) vocabulary
description language 1.0: RDF Schema. W3C Working Draft, World Wide Web
Consortium (W3C).
Chang, H., Hou, T., Hsu, A., & Chang, S. (1995). Tele-Action objects for an active
multimedia system. Proceedings of the International Conference on Multimedia
Computing and Systems (ICMCS 1995), Ottawa, Canada.
Chang, S., & Znati, T. (2001). Adlet: An active document abstraction for multimedia
information fusion. IEEE Transactions on Knowledge and Data Engineering,
13(1).
Daniel, R., Lagoze, D., & Payette, S. (1998). A metadata architecture for digital libraries.
Proceedings of the Advances in Digital Libraries Conference, Santa Barbara,
California.
Fensel, D. (2001). Ontologies: A silver bullet for knowledge management and electronic
commerce. Heidelberg: Springer.
Ferraiolo, J., Jun, F., & Jackson, D. (2003). Scalable vector graphics (SVG) 1.1. W3C
Recommendation, World Wide Web Consortium (W3C).
Gnutella (n.d.). Retrieved 2003 from http://www.gnutella.com
Grimson, J., Stephens, G., Jung, B., et al. (2001). Sharing health-care records over the
internet. IEEE Internet Computing, 5(3).
ID3v2 (n.d.). [Computer software]. Retrieved 2004 from http://www.id3.org
ISO/IEC JTC 1/SC 29 (1997). Information technology - Coding of hypermedia information - part 5: support for base-level interactive applications. ISO/IEC International Standard 13522-5:1997, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/IEC JTC 1/SC 29/WG 11 (2001). Information technology - Multimedia content
description interface - part 5: Multimedia description schemes. ISO/IEC Final
Draft International Standard 15938-5:2001, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/IEC JTC 1/SC 34/WG 3 (1997). Information technology - Hypermedia/time-based
structuring language (HyTime). ISO/IEC International Standard 15938-5:2001,
International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/IEC JTC 1/SC 34/WG 3 (2000). Information technology - SGML applications - topic
maps. ISO/IEC International Standard 13250:2000, International Organization for
Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/JTC1/SC 32/WG 2 (2001). Conceptual graphs. ISO/IEC International Standard,
International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
332
Lagoze, C., Lynch, C., & Daniel, R. (1996). The warwick framework: A container
architecture for aggregating sets of metadata. Technical Report TR 96-1593,
Cornell University, Ithaca, New York.
Lassila, O., & Swick, R.R. (1999). Resource description framework (RDF) model and
syntax specification. W3C Recommendation, World Wide Web Consortium (W3C).
Leach, P. J. (1998, February). UUIDs and GUIDs. Network Working Group Internet-Draft,
The Internet Engineering Task Force (IETF).
Matena, V., & Hapner, M. (1998). Enterprise Java Beans TM. Specification Version 1.0,
Sun Microsystems Inc.
Nejdl, W., Wolf, B., Qu, C., et al. (2002). EDUTELLA: A P2P networking infrastructure
based on RDF. Proceedings of the Eleventh International World Wide Web
Conference (WWW 2002), Honolulu, Hawaii.
Newmann, D., Patterson, A., & Schmitz, P. (2002). XHTML+SMIL profile. W3C Note,
World Wide Web Consortium (W3C).
Pereira, F., & Ebrahimi T., (Eds.) (2002). The MPEG-4 book. CA: Pearson Education
Reich, S., Behrendt, W., & Eichinger, C. (2000). Document models for navigating digital
libraries. Proceedings of the Kyoto International Conference on Digital Libraries, Orlando, Kyoto, Japan.
Raggett, D., Le Hors, A., & Jacobs, I. (1999). HTML 4.01 specification. W3C Recommendation, World Wide Web Consortium (W3C).
ENDNOTE
1
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
333
Chapter 14
Semantically Driven
Multimedia Querying
and Presentation
Isabel F. Cruz, University of Illinois, Chicago, USA
Olga Sayenko, University of Illinois, Chicago, USA
ABSTRACT
Semantics can play an important role in multimedia content retrieval and presentation.
Although a complete semantic description of a multimedia object may be difficult to
generate, we show that even a limited description can be explored so as to provide
significant added functionality in the retrieval and presentation of multimedia. In this
chapter we describe the DelaunayView that supports distributed and heterogeneous
multimedia sources and proposes a flexible semantically driven approach to the
selection and display of multimedia content.
INTRODUCTION
The goal of a semantically driven multimedia retrieval and presentation system is
to explore the semantics of the data so as to provide the user with a rich selection criteria
and an expressive set of relationships among the data, which will enable the meaningful
extraction and display of the multimedia objects. The major obstacle in developing such
a system is the lack of an accurate and simple way of extracting the semantic content that
is encapsulated in multimedia objects and in their inter-relationships. However, metadata
that reflect multimedia semantics may be associated with multimedia content. While
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
BACKGROUND
A multimedia presentation system relies on a number of technologies for describing, retrieving and presenting multimedia content. XML (Bray et al., 2000) is a widely
accepted standard for interoperable information exchange. MPEG-7 (Martinez, 2003;
Chang et al., 2001) makes use of XML to create rich and flexible descriptions of multimedia
content. DelaunayView relies on multimedia content descriptions for the retrieval and
presentation of content, but it uses RDF (Klyne & Carroll, 2004) rather than XML. We
chose RDF over XML because of its richer modeling capabilities, whereas in other
components of the Delaunay View system we have used XML (Cruz & Huang, 2004).
XML specifies a way to create structured documents that can be easily exchanged
over the Web. An XML document contains elements that encapsulate data. Attributes
may be used to describe certain properties of the elements. Elements participate in
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
335
hierarchical relationships that determine the document structure. XML Schema (Fallside,
2001) provides tools for defining elements, attributes, and document structure. One can
define typed elements that act as building blocks for a particular schema. XML Schema
also supports inheritance, namespaces, and uniqueness.
MPEG-7 (Martnez, 2003) defines a set of tools for creating rich descriptions of
multimedia content. These tools include Descriptors, Description Schemes (DS)
(Salembier & Smith, 2001) and the Description Definition Language (DDL) (Hunter,
2001). MPEG-7 descriptions can be expressed in XML or in binary format. Descriptors
represent low-level features such as texture and color that can be extracted automatically.
Description Schemes are composed of multiple Descriptors and Description Schemes to
create more complex descriptions of the content. For example, the MediaLocator DS
describes the location of a multimedia item. The MediaLocator is composed of the
MediaURL descriptor and an optional MediaTime DS: the former contains the URL that
points to the multimedia item, while the latter is meaningful in the case where the
MediaLocator describes an audio or a video segment. Figure 1 shows an example of a
MediaLocator DS and its descriptors RelTime and Duration that, respectively, describe
the start time of a segment relative to the beginning of the entire piece and the segment
duration.
The Resource Description Framework (RDF) offers an alternative approach to
describing multimedia content. An RDF description consists of statements about
resources and their properties. An RDF resource is any entity identifiable by a URI. An
RDF statement is a triple consisting of subject, predicate, and object. The subject is the
resource about which the statement is being made. The predicate is the property being
described. The object is the value of this property. RDF Schema (RDFS) (Brickley &
Guha, 2001) provides mechanisms for defining resource classes and their properties. If
an RDF document conforms to an RDF schema (expressed in RDFS), resources in the
document belong to classes defined in the schema. A class definition includes the class
name and a list of class properties. A property definition includes domain the subject
of the corresponding RDF triple and range the object. The RDF Query Language
(RQL) (Karvounarakis et al., 2002) is a query language for RDF and RDFS documents. It
supports a select-from-where structure, basic queries and iterators that combine the
basic queries into nested and aggregate queries, and generalized path expressions.
MPEG-7 Description Schemes and DelaunayView differ in their approach to multimedia semantics. In MPEG-7, semantics are represented as a distinct description scheme
that is narrowly aimed at narrative media. The Semantic DS includes description schemes
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
for places, objects, events, and agents people, groups, or other active entities
that operate within a narrative world associated with a multimedia item. In DelaunayView
any description of multimedia content can be considered a semantic description for the
purposes of multimedia retrieval. DelaunayView recognizes that, depending on the
application, almost any description may be semantically valuable. For example, an aerial
photo of the arctic ice sheet depicts some areas of intact ice sheet and others of open
water. Traditional image processing techniques can be applied to the photo to extract
light and dark regions that represent ice and open water respectively. In the climate
research domain, the size, shape, and locations of these regions constitute the semantic
description of the image. In MPEG-7 however, this information will be described with the
StillRegion DS, which does not carry semantic significance.
Beyond their diverse perspectives on the nature of the semantics that they
incorporate, MPEG-7 and DelaunayView use different approaches to representing semantic descriptions: MPEG-7 uses XML, while DelaunayView uses RDF. An XML document
is structured according to the tree paradigm: each element is a node and its children are
the nodes that represent its subelements. An RDF document is structured according to
the directed graph paradigm: each resource is a node and each property is a labeled
directed edge from the subject to the object of the RDF statement. Unlike XML, where
schema and documents are separate trees, an RDF document and its schema can be
thought of as a single connected graph. This property of RDF enables straightforward
implementation of more powerful keyword searches as a means of selecting multimedia
for presentation. Thus using RDF as an underlying description format gives users more
flexibility in selecting content for presentation.
Another distinctive feature between MPEG-7 and DelaunayView is the focus of the
latter on multimedia presentation. A reference model for intelligent multimedia presentation systems encompasses an architecture consisting of control, content, design,
realization, and presentation display layers (Bordegoni et al., 1997). The user interacts
with the control layer to direct the process of generating the presentation. The content
layer includes the content selection component that retrieves the content, the media
allocation component that determines in what form content will be presented, and
ordering components. The design layer produces the presentation layout and further
defines how individual multimedia objects will be displayed. The realization layer
produces the presentation from the layout information provided by the design layer. The
presentation display layer displays the presentation. Individual layers interact with a
knowledge server that maintains information about customization.
LayLab demonstrates an approach to multimedia presentation that makes use of
constraint solving (Graf, 1995). This approach is based on primitive graphical constraints
such as under or beside that can be aggregated into complex visual techniques (e.g.,
alignment, ordering, grouping, and balance). Constraint hierarchies can be defined to
specify design alternatives and to resolve overconstrained states. Geometrical placement heuristics are constructs that combine constraints with control knowledge.
Additional work in multimedia presentation and information visualization can be
found in Baral et al. (1998), Bes et al. (2001), Cruz and Lucas (1997), Pattison and Phillips
(2001), Ram et al. (1999), Roth et al. (1996), Shih and Davis (1997), and Weitzman and
Wittenburg (1994).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
337
A PRAGMATIC APPROACH TO
MULTIMEDIA PRESENTATION
In our approach to the design of a multimedia presentation system, we address the
following challenges. Multimedia content is resident in distributed, heterogeneous, and
autonomous sources. However, it is often necessary to access content from multiple
sources. The data models and the design of the sources vary widely and are decided upon
autonomously by the various entities that maintain them. Our approach accommodates
this diversity by using RDFS to describe the multimedia sources in a simple and flexible
way. The schemata are integrated into a single global schema that enables users to
access the distributed and autonomous multimedia sources as if they were a single
source. Another challenge is that the large volume of multimedia objects presented to
the user makes it difficult to perceive and understand the relationships among them. Our
system gives users the ability to construct customized layouts, thus making the semantic
relationships among multimedia objects more obvious.
Case Study
This case study illustrates how multimedia can be retrieved and presented in an
integrated view workspace using as example of a bill of materials for the aircraft industry.
A bill of materials is a list of parts or components required to build a product. In Figure
2, the manufacturing of commercial airplanes is being planned using a coordinated
visualization composed of three views: a bipartite graph, a bar chart, and a slide sorter.
The bipartite graph illustrates the part-subpart relationship between commercial aircrafts
and their engines, the bar chart displays the number of engines currently available in the
Figure 2. A coordinated integrated visualization
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
inventory of a plant or plants, and the slide sorter shows the maps associated with the
manufacturing plants.
First, the user constructs a keyword query using the Search Workspace to obtain
a data set. This process may be repeated several times to get data sets related to airplanes,
engines, and plants. The user can preview the data retrieved from the query, further refine
the query, and name the data set for future use.
Then, relationships are selected (if previously defined) or defined among the data
sets, using metadata, a query, or user annotations. In the first two cases, the user selects
a relationship that was provided by the integration layer. An example of such a
relationship would be the connection that is established between the attribute engine of
the airplane data set (containing one engine used in that airplane) and the engine data
set. Other more complex relationships can be established using an RQL query.
Yet another type of relationship can be a connection that is established by the user.
This interface is shown in Figure 3. In this figure and those that follow, the left panel
contains the overall navigation mechanism associated with the interface, allowing for
any other step of the querying or visualization process to be undertaken. Note that we
chose the bipartite component to provide visual feedback when defining binary relationships. This is the same component that is used for the display of bipartite graphs.
The next step involves creating the views, which are built using templates. A data
set can be applied to different templates to form different views. The interface of Figure
4 illustrates a slide sorter of the maps where the manufacturers of aircraft engines are
located. In this process, data attributes of the data set are bound to visual attributes of
the visual template. For example, the passenger capacity of a plane can be applied to the
height of a bar chart. The users also can further change the view to conform to their
preferences, for example, by changing the orientation of a bar chart from vertical to
horizontal. The sorter allows the thumbnails to be sorted by the values of any of the
attributes of the objects that are depicted by the thumbnails. Individual views can be laid
out anywhere on the panel as shown in Figure 5. The user selects the kind of dynamic
interaction between every pair of views by using a simple customization panel.
Figure 3. Relation workspace
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
339
In the integrated view, the coordination between individual views has been
established. By selecting a manufacturing plant in the slide sorter, the bar displays the
inventory situation of the selected plant; for example, the availability of each type of
airplane engine. By selecting more plants in the sorter, the bar chart can display the
aggregate number of available engines over several plants for each type of airplane
engine.
There are two ways of displaying relationships: they can be either represented
within the same visualization (as in the bipartite graph of Figure 2) or as a dynamic
relationship between two different views, as in the interaction between the bar chart and
sorter views. Other interactions are possible in our case study. For example, the bipartite
graph can also react to the user selections on the sorter. As more selections of plants
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
are performed on the sorter, different types of engines produced by the selected
manufacturer(s) appear highlighted. Moreover, the bipartite graph view can be
refreshed to display only the relationship between the corresponding selected items in
the two data sets.
System Architecture
Data Layer
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
341
Integration Layer
The integration layer combines all multimedia sources into a single integrated
virtual source. In the context of this layer, a multimedia source is a local source and its
source schema is a local schema. The integrated virtual source is described by the global
schema, which is obtained as a result of the integration of the sources. DelaunayView uses
foreign key relationships to connect individual sources into the integrated virtual source.
Implicit foreign key relationships exist between local sources, but they only become
apparent when all local sources are considered as a whole. The global schema is built
by explicitly defining foreign key relationships. A sequence of foreign key definitions
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
yields a graph where the local schemata are the subgraphs and the foreign key
relationships are the edges that connect them.
The foreign key relationships are defined with the help of the graphical integration
tool of Figure 9. This tool provides a simple graphical representation of the schemata that
are present in the system and enables the user to specify foreign key relationships
between them. When the user imports a source into the system, its schema is represented
on the left-hand side panel as a box. Individual schemata are displayed on the right-hand
side pane as trees. The user defines a foreign key by selecting a node in each schema that
participates in the relationship, and connecting them by an edge. Figure 9 shows how
a foreign key relationship is defined between airplane and engine schemata. The edge
between engine and name represents that relationship. The graphical integration tool
generates an RDF document that describes all the foreign key relationships defined by
the user.
The integration layer contains the mediator engine and the schema repository. The
mediator engine receives queries from the presentation layer, issues queries to the data
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
343
sources, and passes results back to the presentation layer. The schema repository
contains the description of the global schema and the mappings from global to local
schemata. The mediator engine receives queries in terms of the global schema, global
queries, and translates them into queries in terms of the local schemata of the individual
sources, local queries, using the information available from the schema repository. We
demonstrate how local queries are obtained by the following example.
Example 2: The engine database and the airplane database are two local sources
and engine name connects the local schemata. In the airplane schema (Figure 10), engine
name is a foreign key and is represented by the property power-plant and in the engine
schema (Figure 11) it is the key and is represented by the property name. Mappings from
the global schema (Figure 12) to the local schema have the form ([global name], ([local
name], [local schema])). We say that a class or a property in the global schema, x, maps
to a local schema S when (x, (y, S)) is in the set of the mappings. For this example, this
set is:
(airplane, (airplane, S1)),
(type, (type, S1)),
(power-plant, (power-plant, S1)),
(power-plant, (name, S2)),
(engine, (engine, S2)),
(thrust, (thrust, S2)),
(name, (name, S2))
All the mappings are one-to-one, except for the power-plant property that connects
the two schemata; power-plant belongs to a set of foreign key constraints maintained
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
by the schema repository. These constraints are used to connect the set of results from
the local queries.
The global query QG returns the types of airplanes that have engines with thrust
of 115,000 lb:
select B
from {A}type{B}, {A}power-plant{C}, {C}thrust{D}
where D = 115000 lbs
The mediator engine translates Q G into QL1, which is a query over the local schema
S1, and QL2, which is a query over the local schema S2. The from clause of QG contains
three path expressions: {A}type{B}, which contains property type that maps to S1 ,
{A}power-plant{C}, which contains property power-plant that maps both to S1 and to
S2 , and {C}thrust{D}, which contains property thrust that maps to S2 .
To obtain the from clause of a local query, the mediator engine selects those path
expressions that contain classes or properties that map to the local schema. The from
clause of QL1 is: {A}type{B}, {A}power-plant{C}. Similarly, the where clause of a local
query contains only those variables of the global where clause that appear in the local
from clause. D, which is the only variable in the global where clause, does not appear
in the from clause of QL1, so the where clause of QL1 is absent.
The select clause of a local query includes variables that appear in the global select
clause and in the local from clause. B is a part of the global select clause and it appears
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
345
in the from clause of QL1, so it will appear in the select clause as well. In addition to the
variables from the global select clause, a local select clause contains variables that are
necessary to perform a join of the local results in order to obtain the global result. These
are the variables is the local from clause that refer to elements of the foreign key
constraint set. C refers to the value of power-plant, which is the only foreign key
constraint, so C is included in the local select clause. Therefore, QL1 is as follows:
select B, C
from {A}type{B}, {A}power-plant{C}
The from clause of QL2 should include {A}power-plant{C} and {C}thrust{D};
power-plant maps to name and thrust maps to thrust in S2. Since D in the global where
clause maps to S2, the local where clause contains D and the associated constraint: D
= 115000 lbs. The global select clause does not contain any variables that map to S2,
so the local select clause contains only the foreign key constraint variable C. The
intermediate version of QL2 is:
select C
from {C}thrust{D}, {A}power-plant{C}
where D = 115000 lbs
The intermediate version of QL2 contains {A}power-plant{C} because power-plant
maps to name in S2. However, {A}power-plant{C} is a special case because it is a foreign
key constraint: we must check whether variables A and C refer to resources that map to
S2. A refers to airplane, therefore it does not map to S2 and {A}power-plant{C} should
be removed from QL2. The final version of QL2 is:
select C
from {C}thrust{D}
where D = 115000 lbs
In summary, the integration layer connects the local sources into the integrated
virtual source and makes it available to the presentation layer. The interface between the
integration and presentation layers includes the global schema provided by the integration layer, the queries issued by the presentation layer, and the results returned by the
integration layer.
Presentation Layer
The presentation layer enables the user to query the distributed multimedia
sources and to create complex multicomponent coordinated layouts to display the query
results. The presentation layer sends user queries to the integration layer and receives
the data sets, which are the query results. A view is created when a data set is attached
to a presentation template that determines how the images in the data set are to be
displayed. The user specifies the position and the orientation of the view and the
dynamic interaction properties of views in the integrated layout.
Images and metadata are retrieved from the multimedia sources by means of RQL
queries to the RDF multimedia annotations stored at the local sources. In addition to RQL
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
queries, the user may issue keyword searches. A keyword search has three components:
the keyword, the criteria, and the source. Any of the components is optional. A keyword
will match the class or property names in the schema. The criteria match the values of
properties in the metadata RDF document. The source restricts the results of the query
to that multimedia source. The data sets returned by the integration layer are encapsulated in the data descriptors that associate the query, layout, and view coordination
information with the data set. The following example illustrates how a keyword query gets
translated into an RQL query:
Example 3: The keyword search where keyword = airplane, criteria = Boeing,
and source = aircraftDataSource returns resources that are of class airplane or have
a property airplane, have the property value Boeing, and are located in the source
aircraftDataSource. This search is translated into the RQL query of Figure 13 and sent
to source aircraftDataSource by the integration layer.
DelaunayView includes predefined presentation templates that allow the user to build
customized views. The user chooses attributes of the data set that correspond to
template visual attributes. For example, a view can be defined by attaching the arctic
photo dataset (see Example 1) to the slide sorter template, setting the order-by property
of the view to the timestamp attribute of the data set, and setting the image source
property of the view to the reference attribute of the data set. When a tuple in the data
set is to be displayed, image references embedded in it are resolved and images are
retrieved from multimedia sources.
The user may further customize views by specifying their orientation, position
relative to each other, and coordination behavior. Views are coordinated by specifying
a relationship between the initiating view and the destination view. The initiating view
notifies the destination view of initiating events. An initiating event is the change of
view state caused by a user action; selecting an image in a slide sorter, for example.
The destination view responds to initiation events by changing its own state
according to the reaction model selected by the user. Each template defines a set of
initiating events and reaction models.
In summary, semantics play a central role in DelaunayView architecture. The data
layer makes semantics available as the metadata descriptions and the local schemata. The
integration layer enables the user to define the global schema that adds to the semantics
provided by the data layer. The presentation layer uses semantics provided by the data
and integration layers for source querying, view definition, and view coordination.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
347
FUTURE WORK
Our future work will further address the decentralized nature of the data layer.
DelaunayView can be viewed as a single node in a network of multimedia sources. This
network can be considered from two different points of view. From a centralized
perspective, the goal is to create a single consistent global schema with which queries
can be issued to the entire network as if it were formed by a single database. From a
decentralized data acquisition point of view, the goal is to answer a query submitted at
one of the nodes. The network becomes relevant when data are required that are not
present at the local node.
In the centralized approach, knowledge of the entire global schema is required to
answer a query while in the decentralized approach only the knowledge of paths to the
required information is necessary. In the centralized approach, the global schema is
static. Local sources are connected to each other one by one, resulting in the global
schema that must be modified when a local schema is changed. Under the decentralized
approach, the integration process can be performed at the time the query is created
(automatically in an ideal system) by discovering the data available at the other nodes.
A centralized global schema must resolve inconsistencies in schema and data in a
globally optimal manner. Under the decentralized approach inconsistencies have to be
resolved only at the level of that node.
The goal of our future work will be to extend DelaunayView to a decentralized peerto-peer network. Under this architecture, the schema repository will connect to its
neighbors to provide schema information to the mediator engine. Conceptually, a request
for schema information will be recursively transmitted throughout the network to retrieve
the current state of the distributed global schema, but our implementation will adapt
optimization techniques from the peer-to-peer community to make schema retrieval
efficient. The implementation of the mediator engine and the graphical integration tool
will be modified to accommodate the new architecture.
Another goal is to incorporate MPEG-7 Feature Extraction Tools into the framework. Feature extraction can be incorporated into the implementation of the graphical
integration tool to perform automatic feature extraction on the content of the new sources
as they are added to the system. This capability will add another layer of metadata
information that will enable users to search for content by specifying low-level features.
CONCLUSIONS
We have discussed our approach to multimedia presentation and querying from a
semantic point of view, as implemented by our DelaunayView system. Our paper describes
how multimedia semantics can be used to enable access to distributed multimedia
sources and to facilitate construction of coordinated views. Semantics are derived from
the metadata descriptions of multimedia objects in the data layer. In the integration layer,
schemata that describe the metadata are integrated into a single global schema that
enables users to view a set of distributed multimedia sources as a single unified source.
In the presentation layer, the system provides a framework for creating customizable
integrated layouts that highlight semantic relationships between the multimedia objects.
The user can retrieve multimedia data sets by issuing RQL queries or keyword searches.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
The datasets thus obtained are mapped to presentation templates to create views. The
position, the orientation, and the dynamic interaction of views can be interactively
specified by the user. The view definition process involves the mapping of metadata
attributes to the graphical attributes of a template. The view coordination process
involves the association of metadata attributes from two datasets and the specification
of how the corresponding views interact. By using the metadata attributes, both the view
definition and the view coordination processes take advantage of the multimedia
semantics.
ACKNOWLEDGMENTS
This research was supported in part by the National Science Foundation under
Awards ITR-0326284 and EIA-0091489.
We are grateful to Yuan Feng Huang and to Vinay Bhat for their help in implementing
the system, and to Sofia Alexaki, Vassilis Christophides, and Gregory Karvounarakis
from the University of Crete for providing timely technical support of the RDFSuite.
REFERENCES
Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., & Tolle, K. (2000). The
RDFSuite: Managing voluminous RDF description bases. Technical report,
Institute of Computer Science, FORTH, Heraklion, Greece. Online at http://
www.ics.forth.gr/proj/isst/ RDF/RSSDB/rdfsuite.pdf
Baral, C., Gonzalez, G., & Son, T. C. (1998). Design and implementation of display
specifications for multimedia answers. In Proceedings of the 14th International
Conference on Data Engineering, (pp. 558-565). IEEE Computer Society.
Bes, F., Jourdan, M., & Khantache, F. A. (2001) Generic architecture for automated
construction of multimedia presentations. In the Eighth International Conference
on Multimedia Modeling.
Bordegoni, M., Faconti, G., Feiner, S., Maybury, M., Rist, T., Ruggieri, S., et al. (1997).
A standard reference model for intelligent multimedia presentation systems.
Computer Standards and Interfaces, 18(6-7), 477-496.
Bray, T., Paoli, J., Sperberg-McQueen, C., & Maler, E. (2000). Extensible markup
language (XML) 1.0 (second edition). W3C Recommendation 6 October 2000.
Online at http://www.w3.org/TR/2000/REC-xml-20001006
Brickley, D., & Guha, R. (2001). RDF vocabulary description language 1.0: RDF schema.
W3C Recommendation 10 February 2004. Online at http://www.w3.org/TR/2004/
REC-rdf-schema-20040210
Cruz, I. F., & Huang, Y. F. (2004). A layered architecture for the exploration of
heterogeneous information using coordinated views. In Proceedings of the IEEE
Symposium on Visual Languages and Human-Centric Computing (to appear).
Cruz, I. F., & James, K. M. (1999). User interface for distributed multimedia database
querying with mediator supported refinement. In International Database Engineering and Application Symposium (pp. 433-441).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
349
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Section 5
Emergent Semantics
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Chapter 15
Emergent Semantics:
An Overview
ABSTRACT
The semantic gap is recognized as one of the major problems in managing multimedia
semantics. It is the gap between sensory data and semantic models. Often the sensory
data and associated context compose situations which have not been anticipated by
system architects. Emergence is a phenomenon that can be employed to deal with such
unanticipated situations. In the past, researchers and practitioners paid little attention
to applying the concepts of emergence to multimedia information retrieval. Recently,
there have been attempts to use emergent semantics as a way of dealing with the
semantic gap. This chapter aims to provide an overview of the field as it applies to
multimedia. We begin with the concepts behind emergence, cover the requirements of
emergent systems, and survey the existing body of research.
INTRODUCTION
Managing media semantics should not necessarily involve semantic descriptions
or classifications of media objects for future use. Information needs, for a user, can be
task dependent, with the task itself evolving and not known beforehand. In such
situations, the semantics and structure will also evolve, as the user interacts with the
content, based on an abstract notion of the information required for the task. That is,
users can interpret multimedia content, in context, at the time of information need. One
way to achieve this is through a field of study known as emergent semantics.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
352
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
EMERGENT SYSTEMS
Both complete order (regularity) and complete chaos (randomness) are very simple.
Complexity occurs between the two, at a place known as the edge of chaos (Langton,
1990). Emergence results in complex systems, forming spontaneously from the interactions of many simple units. In nature, emergence is typically expressed in self-assembly,
such as (micro-level) crystal formation and (macro-level) weather systems. These
systems form naturally without centralized control. Similarly, emergence is useful in
computer systems, when centralized control is impractical. The resources needed in these
systems are primarily simple building blocks capable of interacting with each other and
their environment (Holland, 2000). However, we are not interested in all possible complex
systems that may form. We are interested in systems that might form useful semantic
structures. We need to set up environments where the emergence is likely to result in
complex semantic representation or expression (Whitesides & Grzybowski, 2003;
Crutchfield, 1993; Potgeiter & Bishop, 2002). It is therefore necessary to understand the
characteristics and issues involved in emergent information systems.
We lead our discussion through the example of an ant colony. An ant colony is
comprised, primarily, of many small units known as ants. Each ant can only do simple
tasks; for example, walk, carry, lay a pheromone trail, follow a trail, and so forth. However,
the colony is sophisticated enough to thoroughly explore and manage its environment.
Several characteristics of emergent systems are demonstrated in the ant colony
metaphor: interaction, synthesis and self-organization. The main emergent phenomenon is self-organization, expressed in specialized ants being where the colony needs
them, when appropriate. These ants and others, the ant interactions, the synthesis and
self-organization, compose the ant colony. See Bonabeau and Theraulaz (2000) for more
details.
This section describes the characteristics and practical issues of emergent systems.
They constitute our requirements. These include, but are not limited to, interaction,
synthesis, self-organization, knowledge representation, context and evaluation.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
354
Information
If humans are to evaluate the emergence, they must either observe system behaviour
(phenotype) or a knowledge representation (genotype). The system representation must
be translatable to terms a human can understand, or to an intermediate representation that
can provide interaction. Typically, for this to be possible, the domain needs to be well
known. Unanticipated events might not be translated well. Though we deal with the
unanticipated, we must communicate in terms of the familiar. Emergence must be in terms
of the system being interpreted. Otherwise we run the risk of infinite regression
(Crutchfield, 1993). The environment, context and user should be included as part of the
system. We need semantic structures, which contain the result of emergence, to be part
of the system.
Context will either determine which of the many interpretations are appropriate or
constrain the interpretation formation. Context is taken mainly from the user or from the
application domain. Spatial and temporal positioning of features can also provide
context, depending on the domain. The significance of specialized information, such as
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
geographical position or time point, would be part of application domains such as fire
fighting or astronomy. It is known in film theory as the Kuleshov effect. Reordering shots
in a scene affects interpretation (Davis, Dorai, & Nack, 2003). Context supplies the
system with constraints on relationships between entities. It can also affect the
granularity and form of semantic output: classification, labelled multimedia objects,
metadata, semantic networks, natural language description, or system behaviour. Different people will want to know different things.
Mechanism
The defining characteristic of useful emergent systems is that simple units can
interact to provide complex and useful structures. Interaction1 is the notion that units2
in the system will interact to form a combined entity, which has properties that no unit
has separately. The interaction is significant. Examining the units in isolation will not
completely explain the properties of the whole. For two or more units to interact some
mechanism must exist which enables them to interact. The mere presence of two salient
units doesnt mean that they are able to interact.
Before we can reap the benefits of units interacting, we need units. These units
might be implied by the data. Explicit selection of units by a central controller would not
be part of an emergent process. Emergence involves implicit selection of the right units
to interact. The environment should make it likely for salient units to interact. Possibly
all units interact, with the salient units interacting more. In different contexts, different
units will be the salient units. The context should change which units are more likely to
interact, or the significance of their interaction.
Formation
Bridge laws, linking micro and macro properties, are emergent laws if they are not
semantically implied by initial micro conditions and micro laws (McLaughlin, 2001).
Synthesis involves a group of units composing a recognisable whole. Most
systems do analysis which involves top-down reduction and a control structure
performing analysis. Synthesis is essentially the interaction mechanism seen at another
level or from a different perspective. A benefit of emergence is that the system designer
is freed from having to anticipate everything. Synthesis involves bottom-up emergence,
which results in a complex structure. The unanticipated interaction of simple units might
carry out an unanticipated and complex task. We lessen the need for a high-level control
structure that tries to anticipate all possible future scenarios. Boids (Reynolds, 1987)
synthesizes flocking behaviour in a population of simple units. Each unit in the flock
follows simple laws, knowing only how to interact with its closest neighbours. The
knowledge of forming a flock isnt stored in any unit in the flock. Unanticipated obstacles
are avoided by the whole flock, which reforms if split up.
Self-Organization involves a population of units which appear to determine their
own collective form and processes. Self-assembly is the autonomous organization of
components into patterns or structures without human intervention (Whitaker, 2003;
Whitesides & Grzybowski, 2003). Similar though less complex, self-organization occurs
in artificial life (Waldrop, 1992). It attempts to mimic biological systems by capturing
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
356
an abstract model of evolution. Organisms have genes, which specify simple attributes
or behaviours. Populations of organisms interact to produce complex systems.
ISSUES
Evaluation
The Semantic Gap is industry jargon for the gap between sensory information and
the complex model in a humans mind. The same sensory information provides some of
the units which participate in computational emergence. The semantic structures, which
are formed, are the systems complex model. Since emergence is not something
controlled, we cannot make sure that the systems complex model will be the same as the
humans complex model. The ant colony is not controlled, though we consider it
successful. If the ant colony self-organized in a different way, we might consider that
structure successful as well. There may be many acceptable, emergent semantic
structures. We need to know whether the semantic emergence is appropriate, to either
the user or task. Therefore, we need to evaluate the emergence, either through direct
communication of the semantic structure or though system behaviour.
Scalability
This notion of scale is slightly different to traditional notions. We can scale with
respect to domain and richness of data. Most approaches for semantics constrain the
domain of knowledge, such that the constraints themselves provide ground truths. If a
system tries to cater for more domains, it loses some of the ground truths. Richness of
data refers to numbers of units and types of units available. If the amount of data is too
small, we might not have enough interaction to create a meaningful structure. A higher
amount of data increases the number of units and types available. Unit pairings increase
exponentially with increasing units. A system where all units try to interact with all other
units might stress the process power of the system. A system without all possible
interactions might miss the salient interactions. It is also uncertain whether increasing
data richness will lead to finer granularity of semantic structure or lesser ability to settle
on a stable structure.
Augmentation
Especially for iterative processes, it might be useful to incrementally add knowledge
back to the system. The danger here is that in order to reapply what has been learned
the system will have to recognise situations which have occurred before with different
sensory characteristics; for example, two pictures of the same situation taken from
different angles.
CURRENT RESEARCH
Having described the requirements in abstract, we will now describe the tools and
techniques which address the requirements.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Information
Knowledge representation, for emergence, includes ontology, metadata and genetic algorithm strings. The use of templates and grammars, which can communicate
semantics in terms of the media, arent emergent techniques as their semantic structure
is predefined and the multimedia content anticipated.
Metadata (data about data) can be used as an alternative semantic description of
multimedia content. MPEG-7 has a description stream, which is associated with the
multimedia stream by using temporal operators. The description resides with the data.
However, it is difficult in advance to provide metadata for every possible future
interpretation of an event. The metadata can instead be derived from emergent semantic
structures. If descriptions are needed, natural language can be derived from predicates
associated with modeled concepts (Kojima, Tamura, & Fukunaga, 2002).
Classically, ontology is the study of being. The computer industry uses the term
to refer to fact bases, repositories of properties and relations between objects, and
semantic networks (such as Princetons WordNet). Some ontology is used for reference,
with multimedia objects or direct sensory inputs being used to index the ontology
(Hoogs, 2001; Kuipers, 2000). Other ontology attempts to capture how humans communicate their own cognitive structures. The Semantic Web attempts to use ontology to
access the semantics implicit in human communication (Maedche, 2002). Semantic
networks consist of a skeleton of low-level data which can be augmented by adding
semantic annotation nodes (Nack, 2002). The low-level data consists of the multimedia
or ground truths, which can act as units in an emergent system. The annotation nodes
can contain the results of emergence, and they are not permanent. This has the
advantage of providing metadata-like properties, which can also be changed for
different contexts.
In genetic algorithms, knowledge representation (genotype) lies in evolving strings.
The strings can contain units and the operators that act on them (Gero & Ding, 1997). The
genotypes evolve over several generations, with the successful3 genes selected to
generate the next generation. Knowledge and context are acquired across generations.
Context is taken from the domain, the data instances, or the user. A users context
is mainly taken from their interaction history. Their personal history and current mental
state are harder to measure. The users role during context gathering can be active (direct
manipulation) or passive (observation).
Direct Manipulation
The user can actively communicate context to the system. Santini, Gupta, and Jain
(2001) ask their users to organize images in a database. They use the example of a portrait.
If the portrait is in a cluster of paintings, then the semantic is painting. If it is in a cluster
of people, the semantic is people or faces. The same image can be a reference to
different referents, which can be intangible ideas as well as tangible objects.
CollageMachine (Kerne, 2002) is a Web browsing tool which tries to predict user
browsing intentions. The system tries to predict possible lines of user inquiry and selects
multimedia components of those to display. Reorganization of those components, by the
user, is used by the system to adjust its model.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
358
Observation
The context of a multimedia instance is taken from past and future subjects of user
attention. The entire path taken, or group formed, by a user provides an interpretation
for an individual node. The whole provides the context for the part. Emergence depends
on what the user thinks the data are, though the user does not need to know how they
draw conclusions from observing the data. The emergence of semantics can be made by
observing human and machine agent interaction (Staab, 2002). Context, at each point in
the users path, is supplied by their navigation (Grosky, Sreenath, & Fotouhi, 2002). The
users interpretation can be different from the authors intentions. The Web is considered a directed graph (nodes: Web pages, edges: links). Adjacent nodes are considered
likely to have similar semantics, though attempts are made to detect points of interest
change. The meaning of a Web page (and multimedia instances in general) emerges
through use and observation.
Grouping
In order to model what humans can observe, it is often helpful to model human
vision. In computer vision, grouping algorithms (based on human vision) are used to
form higher-level structures from units within an image (Engbers & Smeulders, 2003). An
algorithm can be an emergent technique if it can adapt dynamically to context.
Context can come from sources other than users. Multiple media can be associated
with data events to help in disambiguating semantics (Nakamura & Kanade, 1997).
Context in genetic algorithms is sensed over many generations, if one interprets that
better performance in the environment is response to context. The domain, in schemata
agreement, is partly defined by the parties involved.
Mechanism
Mechanisms of automatic, implicit unit selection and interaction are yet to be
developed for semantic emergence. This is a gap in the literature that will need to be filled.
Current mechanisms involve the user as a unit. The semantics emerge through interaction
of the users own context with multimedia components (Santini & Jain, 1999; Kerne, 2002).
The user decides which things interact, either actively or passively. In genetic algorithms, fitness functions decide how gene strings evolve (Gero & Ding, 1997). Genetic
algorithms can be used to lessen the implicit selection problem by reducing the search
spaces of how units interact, which units interact, and which things are considered units.
A similar situation arises with evaluation of emerged semantics. The current
thinking is that humans are needed to evaluate accuracy or reasonableness. In genetic
algorithms, the representation (genotype) can be evaluated indirectly by testing the
phenotype (expression).
Simply having all the necessary sensory (and other) information present will not
necessarily result in interaction occurring. Information from ontology could be used in
decision making, or in suggesting other units for interaction. Explicitly identifying units
for interaction might be a practical nonemergent step. Units can be feature patterns rather
than individual features (Fan, Gao, Luo, & Hacid, 2003). Templates can be used to search
for units suggested by the ontology. Well-known video structures can be used to locate
salient units within video sequences (Russell, 2000; Dorai & Venkatesh, 2001; Venkatesh
& Dorai, 2001). Data can provide context by affecting the perception or emotions of the
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
observer. Emotions can be referents. Low-level units, such as tempo and colour, in the
multimedia instance, act as symbols which reference them.
Formation
In genetic algorithms, synthesis occurs between generations. The genotype is selforganizing. With direct manipulation and user observation, synthesis and organization
come in the form of users putting things together.
In schemata agreement, region synthesis leads to self-organization. It is designed
to be adaptable to unfamiliar schemata. Agreement can be used to capture relationships
later. Emergence occurs as pairs of nodes in decentralised P2P (peer-to-peer) systems
attempt to form global semantic agreements by mapping their respective schemata
(Aberer, Cudre-Mauroux, & Hauswirth, 2003). Regions of similar property emerge as
nodes in the network are connected pair wise, and as other nodes link to them (Langley,
2001). Unfortunately, this work does not deal with multimedia. There has been recent
interest in combining the areas of multimedia, data mining and knowledge discovery.
However, the semantics here are not emergent. There is also data mining research into
multimedia using Self-Organizing Maps (SOM), but this is not concerned with semantics
(Petrushin, Kao, & Khan, 2003; Simoff & Zaiane, 2000).
CONCLUSION
A major difficulty in researching the concept of emergent semantics in multimedia
is that there are no complete systems integrating the various techniques. While there is
work in knowledge representation, with respect to both semantics and multimedia, to the
best of our knowledge, theres very little in interaction, synthesis and selforganization. There is the work on schemata agreement (nonmultimedia) and some work
on Self-Organizing Maps (nonsemantic), but nothing combining them. The little that has
been done involves users to provide context and genetic algorithms to reduce problem
spaces.
One of the gaps to be filled is developing interaction mechanisms, which enable
possibly unanticipated data to interact with each other and their environment. Even if
we can trust the process, we are still dependent on its inputs the simple units that
interact. The set of units needs to be sufficiently rich to enable acceptable emergence.
Ideally, salient features (even patterns) should naturally select themselves during
emergence, though this may require participation of all units, placing a high computational load on the system. Part of the problem, for emergence techniques, is that the simple
interactions must occur in parallel, and in numbers great enough to realise selforganization. The future will probably have more miniaturized systems, capable of true
parallelism in quantum computers. A cubic millimetre of the brain holds the equivalent
of 4 km of axonal wiring (Koch, 2001). Perhaps greater parallelism will permit interaction
of all available units.
There is motivation for research into nonverbal computing, where the users are
illiterate (Jain, 2003). Without user ability to issue and access abstract concepts, the
concepts must be inferred. Experiential computing (Jain, 2003; Sridharan, Sundaram, &
Rikasis, 2003) allows users to interact with the system environment, without having to
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
360
build a mental model of the environment. They seek a symbiosis formed from human and
machine, taking advantage of their respective strengths. These systems are insight
facilitators. They help us make sense of our own context by engaging our senses directly,
as opposed to being confronted by an abstract description. Experiential computing, while
in its infancy now, might in the future enable implicit relevance feedback. The users
interactions with the system could cause both emergence and verification of semantics.
REFERENCES
Aberer, K., Cudre-Mauroux, P., & Hauswirth, M. (2003). The chatty Web: Emergent
semantics through gossiping. Paper presented at the WWW2003, Budapest, Hungary.
Bonabeau, E., & Theraulaz, G. (2000). Swarm smarts. Scientific American, 282(3), 54-61.
Crutchfield, J. P. (1993). The calculi of emergence. Paper presented at the Complex
Systems - from Complex Dynamics to Artificial Reality, Numazu, Japan.
Davis, M., Dorai, C., & Nack, F. (2003). Understanding media semantics. Berkeley, CA:
ACM Multimedia 2003 Tutorial.
Dorai, C., & Venkatesh, S. (2001, September 10-12). Bridging the semantic gap in content
management systems: Computational media aesthetics. Paper presented at the
COSIGN 2001: Computational Semiotics (pp. 94-99), CWI Amsterdam.
Engbers, E. A., & Smeulders, A. W. M. (2003). Design considerations for generic
grouping in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 445-457.
Fan, J., Gao, Y., Luo, H., & Hacid, M.-S. (2003). A novel framework for semantic image
classification and benchmark. Paper presented at the ACM SIGKDD, Washington,
DC.
Gero, J. S., & Ding, L. (1997). Learning emergent style using and evolutionary approach.
In B.Varma & X. Yao (Eds.), paper presented at the ICCIMA (pp. 171-175), Gold
Coast, Australia.
Grosky, W. I., Sreenath, D. V., & Fotouhi, F. (2002). Emergent semantics and the
multimedia semantic Web. SIGMOD Record, 31(4), 54-58.
Hillmann, D. (2003). Using Dublin core. Retrieved February 16, 2004, from http://
dublincore.org/documents/usageguide/
Holland, J. H. (2000). Emergence: From chaos to order (1st ed.). Oxford: Oxford
University Press.
Hoogs. (2001, 10-12 October). Multi-modal fusion for video understanding. Paper
presented at the 30th Applied Imagery Pattern Recognition Workshop (pp. 103108), Washington, DC.
Jain, R. (2003). Folk computing. Communications of the ACM, 46(3), 27-29.
Kerne, A. (2002). Concept-context-design: A creative model for the development of
interactivity. Paper presented at the Creativity and Cognition, Vol. 4 (pp. 92-122),
Loughborough, UK.
Koch, C. (2001). Computing in single neurons. In R. A. Wilson & F. C. Keil (Eds.), The
MIT Encyclopedia of the Cognitive Sciences (pp. 174-176). Cambridge, MA: MIT
Press.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Kojima, A., Tamura, T., & Fukunaga, K. (2002). Natural language description of human
activities from video images based on concept hierarchy of actions. International
Journal of Computer Vision, 150(2), 171-184.
Kuipers, B. J. (2000). The spatial semantic hierarchy. Artificial Intelligence, 119, 191233.
Langley, A. (2001). Freenet. In A. Oram (Ed.), Peer-to-peer: Harnessing the benefits of
a disruptive technology (pp. 123-132). Sebastopol, CA: OReilly.
Langton, C. (1990). Computation at the edge of chaos: Phase transitions and emergent
computation. Physica D, 42(1-3), 12-37.
Maedche. (2002). Emergent semantics for ontologies. IEEE Intelligent Systems, 17(1),
85-86.
McLaughlin, B. P. (2001). Emergentism. In R. A. Wilson & F. C. Keil (Eds.), The MIT
Encyclopedia of the Cognitive Sciences (pp. 267-269). Cambridge, MA: MIT Press.
Minsky, M. L. (1988). The society of mind (1st ed.). New York: Touchstone (Simon &
Schuster).
Nack, F. (2002). The future of media computing. In S. Venkatesh & C. Dorai (Eds.), Media
computing (159-196). Boston: Kluwer.
Nakamura, Y., & Kanade, T. (1997, November). Spotting by association in news video.
Paper presented at the Fifth ACM International Multimedia Conference (pp. 393401) Seattle, Washington.
OED. (2003). Oxford English Dictionary. Retrieved February 2004, from
dictionary.oed.com/entrance.dtl
Petrushin, V. A., Kao, A., & Khan, L. (2003). The Fourth International Workshop on
Multimedia Data Mining, MDM/KDD 2003. Vol. 6(1). (pp. 106-108).
Potgeiter, A., & Bishop, J. (2002). Complex adaptive systems, emergence and engineering: The basics. Retrieved February 20, 2004, from http://people.cs.uct.ac.za/
~yng/Emergence.pdf
Reynolds, C. (1987). Flocks, herds, and schools: A distributed behavioral model.
Computer Graphics, 21(4), 25-34.
Russell, D. (2000). A design pattern-based video summarization technique. Paper
presented at the Proceedings of the 33rd Hawaii International Conference on
System Sciences (p. 3048).
Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image
databases. IEEE Transactions on Knowledge and Data Engineering, 13(3), 337351.
Santini, S., & Jain, R. (1999, Jan). Interfaces for emergent semantics in multimedia
databases. Paper presented at the SPIE, San Jose, California.
Simoff, S. J., & Zaiane, O. R. (2000). Report on MDM/KDD2000: The First International
Workshop on Multimedia Data Mining. SIGKDD Explorations, 2(2), 103-105.
Sridharan, H., Sundaram, H., & Rikasis, T. (2003, November 7). Computational models for
experiences in the arts, and multimedia. Paper presented at the ACM Multimedia
2003, First ACM Workshop on Experiential Teleprescence, Berkeley, CA, USA.
Staab, S. (2002). Emergent semantics. IEEE Intelligent Systems, 17(1), 78-79.
Venkatesh, S., & Dorai, C. (2001). Computational media aesthetics: Finding meaning
beautiful. IEEE Multimedia, 10-12.
Waldrop, M. M. (1992). Life at the edge of chaos. In Complexity, (pp. 198-240). New York:
Touchstone (Simon & Schuster).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
362
ENDNOTES
1
2
3
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
363
Chapter 16
ABSTRACT
The computation of emergent semantics for blending media into creative compositions
is based on the idea that meaning is endowed upon the media in the context of other
media and through interaction with the user. The interactive composition of digital
content in modern production environments remains a challenging problem since
much critical semantic information resides implicitly within the media, the relationships
between media models, and the aesthetic goals of the creative artist. The composition
of heterogeneous media types depends upon the formulation of integrative structures
for the discovery and management of semantics. This semantics emerges through the
application of generic blending operators and a domain ontology of pre-existing
media assets and synthesis models. In this chapter, we will show the generation of
emergent semantics from blending networks in the domains of audio generation from
synthesis models, automated home video editing, and information mining from
multimedia presentations.
INTRODUCTION
Today, there exists a plethora of pre-existing digital media content, synthesis
models, and authored productions that are available for the creation of new media
productions for games, presentations, reports, illustrated manuals, and instructional
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
materials for distance education. Technologies from sophisticated authoring environments for nonlinear video editing, audio synthesis, and information management systems are increasingly finding their way into a new class of easy to use, partially
automated, authoring tools. This trend in media production is expanding the life cycle
of digital media from content-centric authoring, storage, and distribution to include usercentric semantics for performing stylized compositions, information mining, and the
reuse of the content in ways not envisioned at the time of the original media creation. The
automation of digital media production at a semantic level remains a challenging problem
since much critical information resides implicitly within the media, the relationships
between media, and the aesthetic goals of the creative artist. A key problem in modern
production environments is therefore the discovery and management of media semantics
that emerges from the structured blending of pre-existing media assets. This chapter
introduces a model-based framework for media blending that supports the creative
composition of media elements from pre-existing resources.
The vast quantity of pre-existing media from CDs, the Internet, and local recordings
that are currently available has motivated recent research into automation technologies
for digital media (Davis, 1995; Funkhouser et al., 2004; Kovar & Gleicher, 2003).
Traditional authoring tools require extensive training before the user becomes proficient
and normally consume enormous time to compose relatively simple productions even by
skilled professionals. This contrasts with the needs of the non-professional media author
who would prefer high level insights into how media elements can be transformed to
create the target production, as well as tools to automate the composition from semantically meaningful models. Such creative insights arise from the ability to flexibly
manipulate information and discover new relationships relative to a given task. However,
current methods of information retrieval and content production do not adequately
support exploration and discovery in mixed media (Santini, Gupta, & Jain, 2001). A key
problem for media production environments is that the task semantics for content
repurposing depends upon both the media types and the context of the current task. In
this chapter we claim that many semantics based operations, including summarization,
retrieval, composition, and synchronization can be represented as a more general
operation called, media blending. Blending is an operation that occurs across two or
more media elements to yield a new structure called, the blend. The blend is formed by
inheriting partial semantics from the input media and generating an emergent structure
containing information from the current task and the source media. Thus the semantics
of the blend emerges from interactions among the media descriptions, the task to be
performed, and the creative input of the user.
Automated support for managing the semantics of media content would be beneficial for diverse applications, such as video editing (Davis, 1995; Kellock & Altman, 2000),
sound synthesis (Rolland & Pachet, 1995), and mining information from presentations
(Dorai, Kermani, & Stewart, 2001). A common characteristic among these domains that
will be emphasized in this chapter is the need to manage multiple media sources at the
semantic level. For sound production, there is a rich set of semantics associated with
sound effects collections and audio synthesis models that typically come with semantically labeled control parameters. In the case of automatic home video editing, the
control logic is informed by the relationships between music structure and video cuts as
described in film theory to yield a production with a particular composition style (Sharff,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
365
1982). In the case of presentation mining from e-learning content, there is an association
between pedagogical structures within a lecture video and other content resources, such
as textbooks and slide presentations, that can be used to inform a search engine when
responding to a students query. In each case, the user-centric production of media
involves the dynamic blending of information from different media. In this chapter, we
will show the construction of blending networks for user-centric media processing in the
domains of audio generation from sound synthesis models, automated home video
editing, and presentation mining.
BACKGROUND
Models are fundamental for the construction of blending networks (Veale &
ODonoghue, 2000). Blending networks have their origins in frame based reasoning
systems and have recently been applied in cognitive linguistics to link discourse analysis
with fundamental structures of cognition. According to Conceptual Integration Theory
from cognitive linguistics, thought and language depend upon our capabilities to
manipulate webs of mappings between mental spaces (Fauconnier, 1997). These mental
space mappings form the basis for the understanding of metaphors and other forms of
discourse as conceptual blends. Similarly, the experiential qualities of media constitute
a form of discourse that can only be understood through the creation of deep models of
media (Staab, Maedche, Nack, Santini, & Steels, 2002). Prior work on conceptual blending
provides a theoretical framework for the extension of blending theory to digital media.
Consequently, audio perception, video appreciation, and information mining may be
viewed as a form of media discourse. Accordingly, the claim in this chapter is that
principles of conceptual blending derived from analysis of language usage may also be
applied to the processing of media. In the remainder of this section we will describe three
scenarios for media blending, then review the literature on conceptual blending and
metaphor. The following sections will relate these structures to concrete examples of
media blending.
Audio Models
Intricate relations between audio perception and cognition associated with sound
production techniques pose interesting challenges regarding the semantics that emerge
from combinations of constituent elements. The semantics of sound tend to be more
flexible than the semantics associated with graphics and have a more tenuous relationship to the world of objects and events than do graphics. For example, the sound of a
crunching watermelon can be indicative of a cool refreshing indulgence on a summer day,
or it can add juicy impact to a punch for which the sound is infamously used in film
production. The art of sound effects production depends heavily on the combination,
reuse, and recontextualization of libraries of prerecorded material. Labeling sounds in a
database in a way that supports reuse in flexible semantic contexts is a challenge.
A current trend in audio is the move toward structured representations (Rolland &
Pachet, 1995) that we will call models. A sound model is a parameterized algorithm for
generating a class of sounds as shown schematically in Figure 1. Models are useful in
media production partly because of their low memory/bandwidth requirements, since it
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Control Parameters
Map Controls to Synth Params
Model
Synth Parameters
Synthesizer algorithm
Audio Signal
takes much less memory to parameterize a synthesis model than it does to code the raw
audio data. Also, models meet the requirement from interactive media, such as games, that
audio be generated in real time in response to unpredictable events in an interactive
environment.
The sound model design process involves building an algorithm from component
signal generators and transformers (modulators, filters, etc.) that are patched together
in a signal flow network that generates audio at the output. Models are designed to meet
specifications on a) the class of sounds the model needs to cover and b) the method of
controlling the model through parameters that are exposed to the user. Models are
associated with semantics in a way that general-purpose synthesizers are not, because
they are specialized to create a much narrower range of sounds. They also take on
semantics by virtue of the real-time interactivity they have responsive behaviors
in a way that recorded sounds do not.
As media objects, models present interesting opportunities and challenges for
effective exploitation in graphical, audio, or mixed media. A database of sound models
is different from a database of recorded sounds in that the accessible sounds in the
database are (i) not actually present, but potential and (ii) infinite in variability due to the
dynamic parameterization that recorded sounds do not afford. Model building is a labor
intensive job for experts, so exploiting a database of pre-existing sound models potentially has tremendous value.
Another trend in audio, as well as other forms of digital media, is to attempt to
automatically extract semantics from raw media data. The utility of being able to identify
a baby crying or a window breaking in an audio stream should be self-apparent, as
should the difficulty of the task. Typically, audio analysis is based on adaptive
association between low-level signal features (such as spectral centroid, basis vectors,
zero crossings, pitch, and noise measures) and labels provided by a supervisor, or
based on an association with data from another media stream such as video. The difficulty
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
367
lies in the fact that there is no such thing as the semantics, and any semantics there
may be, are dependent upon contexts both in and outside of the media itself. The humanin-the-loop and the intermediate representations between physical attributes and deep
semantics that models offer can be effective bridges across this gap.
Video Editing
The blending of two or more media that combines perceptual aspects from each
media to create a new effect is a common technique in film production. In film editing, the
visual presentation of the scene tells the story from the characters point of view. Music
is added to convey information about the characters emotional state, such as fear,
excitement, calm, and joy. Thus, when the selection of video cut points along key visual
events are synchronized with associated features in the music, the audience experiences
the blended media according to the emergent semantics of the cinematic edit (Sharff,
1982).
The non-professional media author of a home video may know what style of editing
they prefer, but lack the detailed knowledge, or time, to perform the editing operations.
Similarly, they may know what music selections to add to the edited video, but lack the
tools and insight to match the beat, tempo, and other features from the music with suitable
events in the video content. The challenge for semi-automated video editing tools is to
combine the stylistic editing logic with metadata descriptions of the selected music and
video, then opportunistically blend the source media to create the final product (Kellock
& Altman, 2000; Davis, 1995).
Presentation Mining
The utilization of media semantics is important not only for audio synthesis and
video editing, but also for information intensive tasks, such as composing and subsequently mining multimedia presentations. There are a rapidly growing number of
corporate media archives, multimedia presentations, and modularized distance learning
courseware which contain valuable information that remains inaccessible outside the
original production context. For instance, a common technique for authoring modular
courseware is to produce a series of short, self contained multimedia presentations for
topics in the syllabus, then customize the composition of these elements for the target
audience (Thompson Learning, n.d.; WebCT, n.d.). The control logic for the sequencing
and navigation through the course content is specified through the use of description
languages. However this normally does not include a semantic description of pedagogical events, domain models, or dependencies among media resources that would aid in the
exploration of the media by the user. Once the course is constructed, it becomes very
difficult to modify or adapt the content to new contexts.
Recorded corporate presentations and distance learning lectures are notoriously
difficult to search for information or reuse in a different context. This difficulty arises from
the fact that the semantics of the presentation is fixed at the time of production. The media
blending framework is designed to support the discovery and generation of emergent
semantics through the use of ontologies for modeling domain information, composition
logic, and media descriptions.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Sources as descriptors. Dog barking, tires screeching, gun shot. The benefit of
using sources as semantic descriptors is that the descriptions come from lay
language that everybody speaks. Sources are very succinct descriptions and come
with a rich set of relationships to other objects that we know about. A drawback
to sources as descriptors is that some sounds have no possible, or at least obvious,
physical cause (e.g., the sound of an engine changing in size). Even if a physical
source is responsible for a sound, it may be impossible to identify. Similarly, any
given sound may have many unrelated possible sources. Finally, a given source
can have acoustically unrelated sounds associated with it, for example, a train
generates whistles, steam, rolling, and horn sounds.
Actions and events as descriptors. Dog barking, tires screeching, gun shot.
Russolos early musical noise machines, or intonurumori, had onomatopoetic
names allied with actions, including howler, roarer, crackler, rubber, hummer,
gurgler, hisser, whistler, burster, croaker, and rustler (Russolo, 1916). The benefit
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
369
of actions and events as descriptors is that they can often be assigned even when
source identification is impossible (a screech is descriptive whether the sound is
from tires or a child). Actions and events are also familiar to a layperson for
describing sounds (scraping, falling, pounding, screaming, sliding, rolling, coughing, clicking). A drawback is that in some cases it may be difficult or impossible for
sounds to be described this way. Unrelated sounds can also have the same
description in terms of actions and events. Finally, the description can be quite
subjective one persons gust is anothers blow.
Source attributes as sound descriptors. Big dog, metal floor, hollow wood. Such
descriptions are often easier to obtain than source identification and are still useful
even when source identification is impossible. These attributes are often scalar,
which makes them quantitative and easier to deal with for a computer. The
drawbacks are that it may be difficult to assign attributes for some sounds, many
sounds may have the same attributes, and the assignment can be quite subjective.
Sounds may also belong together simply because they frequently co-occur in the
environment or in man-made media. A beach sounds class could include crashing
waves, shouting people, and dogs barking. Loose categories such as indoor and
outdoor are often useful especially in media production. A recording of a dog
barking indoors would be useless for an outdoor scene.
When producers have the luxury of a high budget and are creating their own sound
effects, sounds are typically constructed from a combination of recorded material,
synthetic material, and manipulation. Typical manipulation techniques include time
reversal, digital effects, such as filtering, delay, pitch shifting, and overlaying many
different tracks to create an audio composite. To achieve the desired psychological
impact for an event with sound, it is often the case that recordings of actual sounds
generated by the real events are useless. For example, the sounds of a real punch or a
real gun firing are entirely inadequate for creating the impression of a punch or a gun shot
in cinema. The sounds that one might want to use as starting material to construct the
effects come from unrelated material with possibly unrelated semantic labels in the stored
database (Mott, 1990).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Semantic labels tend to commit a sound in a database to a certain usage unless the
database users know how to work around the labels to suit their new media context. On
the other hand, low-level physical signal attributes are not very helpful at providing
human-usable knowledge about a sound, either. In the 1950s, Pierre Schaeffer made a
valiant attempt at coming up with a set of generic source-independent sound descriptors.
Rough English translations of the descriptors include mass, dynamics, timbre, melodic
profile, mass profile, grain, and pace. He hoped that any sound could be described by
a set of values for these descriptors. He was never satisfied with the results of his
taxonomical attempts.
More recently, Dennis Smalley has developed his theory of Spectromorphology
(Smalley, 1997) with terms that are more directly related to aural perception: onsets
(departure, emergence, anacrusis, attack upbeat, downbeat), continuants (passage,
transition, prolongation, maintenance, statement), terminations (arrival, disappearance,
closure, release, resolution), motions (push/drag, flow, rise, throw/fling, drift, float, fly),
and growth (unidirectional, such as ascent, planar, descent and reciprocal, such as
parabola, oscillation, undulation). This terminology has had some success in the
analysis of electroacoustic works of music, which are notoriously difficult due to the
unlimited sonic domain from which they draw their material and because of the lack of
definitive reference to extra-sonic semantics. In fact, the clearest limitation of
spectromorphology is its inability to address the referential dimension of much contemporary music.
371
The footsteps model is used in the following example to illustrate the central
principles of media blending networks. Consider an audio designer who has been given
the task of creating the sound of two people passing on a stairway. In the model library
there are separate models for a person going up the stars and for going down the stars,
but there is no model for two people passing. The key moment for the audio designer is
the event when two people meet on the stairway so that both complete the step at the
same time. This synchronization is not a part of either input model, but it has a semantic
meaning that is crucial for the overall event.
The illustrations of blending networks use diagrams to represent the models and
relationships. In these diagrams, models are represented by circles; parameters by points
in the circles; and connections between parameters by lines. Each model may be realized
as a complex software object that can be modified at the time the blending network is
constructed. Thus the sound designer would use a high level description language to
specify the configuration of the models and their connections. The configuration
description is then compiled into the blending network which could then be run to
produce the desired sound.
The Footstep network contains two input models corresponding to the audio model
for walking up the stairs and the model for walking down the stairs. Each model in Figure
2 is distinct, however they have semantically similar parameters. The starting time for
climbing the stairs is t 1, the starting time for descending the stairs is t 2, the person going
up is p1, and the person going down is p2.
The two audio models have parameters that are semantically labeled. The crossmodel mapping that connects corresponding parameters in the input models is illustrated
by dashed lines in Figure 3. In addition to the starting times, t i, and the persons, pi, that
are specified explicitly in the input models, connections are established between other
similar pairs of parameters, such as walking speed, si, and location, li.
The two input models inherit information from an abstract model for walking that
includes percussion sounds, walking styles, and material surfaces. This forms a generic
model that expresses the common features associated with the two inputs. The common
features may be simple parameters, such as start time, person, speed, and location as in
Figure 4. More generally, the generic model may be used to specify the components and
relationships in more complex models as a domain ontology, as we shall see later.
The blending framework in Figure 5 contains a fourth model which is typically called
the blend. The two stair components in the input models are mapped onto a single set
of stairs in the blend. The local times, t1 and t 2, are mapped onto a common time t in the
blend. However, the two people and their locations are mapped according to the local time
of the blend. Therefore, the first input model represents the audio produced while going
up the stairs, whereas the second model represents the audio produced while going
down. The projection from these input models onto the blend preserves time and location.
The Footstep network exhibits in the blend model various emergent structures that
are not present in the inputs. This emergent structure is derived from several mechanisms
available through the dynamic construction of the network. For example, the composition
of elements from the inputs causes relations to become available in the blend that do not
exist in either of the inputs. According to this particular construction, the blend contains
two moving individuals instead of the single individual in each of the inputs. The
individuals are moving in opposite directions, starting from opposite ends of the stairs,
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
p1
Input 1
p2
t1
Input 2
t2
Input 1
p2
p1
l1
t1 s
1
Input 2
t2 ls2
2
Figure 4. Inclusion of the generic model for the Footstep input models
Generic Space
Input 1
p1
t1
p2
Input 2
t2
and their positions and relative temporal patterns can be compared at any time that they
are on the stairs.
At this point the construction of the blending network is complete and constitutes
a meta-model for the two people walking in opposite directions on the same stairs. Since
this is a generative model, we can now run the scenario dynamically. In the blend there
is new structure: There is no encounter in either of the input models, but the blend
contains the synchronized stepping of the two individuals. The input models continue
to exist in their original form, therefore information about time, location, and walking
speed in the blend space can be projected back to the input models for evaluation there.
This final configuration with projection of the blend model back to the input models is
illustrated in Figure 5.
Blending Theory
Blending is an operation that occurs across two or more input spaces to yield a new
space, the blend. The blend is formed by inheriting partial structure from the input spaces
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
373
Figure 5. Space mapping for the blending model in the integrated footstep network
Generic Space
Input 1
p1
p2
t1
t2
p1'
Input 2
p'2
t'
Blend Space
Input Spaces: a pair of inputs, I1 and I2, to the network along with the models for
processing the inputs. In the Footstep network, the inputs were the audio models
for generating the footstep sounds.
Cross Space Mapping: direct links between corresponding elements in the input
spaces I1 and I2 or a mapping that relates the elements in one input to the
corresponding elements in the other.
Generic Space: defines the common structure and organization shared by the
inputs and specifies the core cross-space mapping between inputs. The domain
ontology for the input models is included in the generic space.
Blend Space: information from the inputs I1 and I2 is partially projected onto a fourth
space containing selective relations from the inputs. Additionally, the blend model
inherits structure from the ontologies used in the generic model, as well as specific
functions derived from the context of the current user task. The two footstep
models were integrated into a single blend model with projection of parameter
values.
Emergent Structure: The blend contains new information not explicitly present in
the inputs that becomes available as a result of processing the blending network.
In the Footstep network, the synchronization of the footsteps at the meeting point
in the blend model was the key emergent structure.
The key contribution of Conceptual Integration Theory has been the elaboration
of a mechanism for double scope binding for the explanation of metaphor and the
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
processing of natural language discourse. The basic diagram for the double scope
binding from the two input models onto the blend model was previously illustrated in
Figure 5. This network is formed using the generic model to perform cross-space mapping
between the two input models, then projecting selected parameters of the input models
onto the blend model. Once the complete network has been composed, the parameter
values are bound and information is continuously propagated while dynamically running
the network. We will next discuss the computational theory for blending, then examine
the double scope binding configuration applied to audio synthesis, video editing, and
presentation mining.
Computational Theory
The blending framework for discovering emergent semantics in media consists of
three main components: ontologies that provide a shared description of the domain;
operators that apply transformations to the inputs and perform computations on the
input models; and an integration mechanism that helps the user discover emergent
structure in the media.
Ontologies
Ontologies are a key enabling technology for semantic media. An ontology may be
defined as a formal and consensual specification of a conceptualization that provides a
shared understanding of a domain. Moreover, the ontology provides an understanding
that can be communicated across people and application systems. Ontologies may be of
several types ranging from the conceptual specification of a domain to an encoding of
computer programs and their relationships. In addition to providing a structured representation, ontologies offer the promise of a shared and common understanding of a domain
that can be communicated between people and application systems. Thus, the use of
ontologies brings together two essential elements for discovering semantics in media:
Operators
The linkages among the two input spaces and the media blend in Figure 5 are
supported by a set of core operators. Two of these operators are called projection and
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
375
compression. Projection is the process in which information in one space is mapped onto
corresponding information in another space. In the video editing domain, projection
occurs through the mapping of temporal structures from music onto the duration and
sequencing of video clips. The hierarchical structure of music and the editing instructions for the video can both be modeled as a graph. Since each model is represented by
a graph structure, projection amounts to a form of graph mapping. In general, the mapping
between models is not direct, so the ontology from the generic space is used to construct
the transformation that maps information between input spaces.
Compression is the process in which detail from the input spaces is removed in the
blend space in order to provide a condensed description that can be easily manipulated.
Compression is achieved in an audio morph through the low dimensional control
parameters for the transformation between to two input sounds. The system performs
these operations in the blended space and projects the results back to any of the available
input spaces.
Integration Mechanism
The consequence of defining the operators for projection and compression is that
a new integration process called, running the blend becomes possible within this
framework (Fauconnier, 1997). Running the blend is the process of reversing the direction
of causality, thereby using the blend to drive the production of inferences in either of
the input spaces. In the blend space of the video editing example, the duration of an edited
video clip is related to the loudness of the music and the start and stop times are determined
by salient beats in the music. The process of running the blend causes these constraints
to propagate back to the music input model to determine the loudness value and the timing
of salient beats that satisfy the editing logic of the music video blend. Thus the process
of running the blend means that operations applied in the blend model are projected back
to the inputs to derive emergent semantics, such as music driven video editing.
In the case of mining presentations for information, preprocessing by the system
analyzes the textbook to extract terms and relations, which are then added as concept
instances of a domain ontology within the textbook model. The user seeks to query the
courseware for information that combines the temporal sequencing of the video lecture
models with the structured organization of the textbook model. This integrated view of
the media is constructed by invoking a blending model, such as find path or find
similar to translate the user query into primitives that are suitable for the input models.
Once the lecture presentation network has been constructed, the user can run the blend
to query the input models for a path through the lecture video that links any two given
topics from the textbook. The integration mechanism of this blend provides a parameterized model of a path that can be used to navigate through the media, mine for relationships, or compose answers to the original query. The emergent semantics of this media
blending model is a path through the video content that exhibits relationships derived
from the textbook. Due to the double scope binding of this network, the blending model
can also be used to project information from the video onto the textbook, thereby
imposing the temporal sequencing of the video presentation onto the hierarchical
organization of the textbook. Additional blending networks, such as the find similar
blend, can be used to integrate information from the two input sources to discover
similarities.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
377
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
representational space, there are typically an infinite number of paths that can be
traversed to reach one from the other, some of which will be effective in a given usage
context, others possibly not.
The work on this issue has tended to focus on various ways of interpolating
between spectral shapes of recorded sounds (Slaney et al., 1996). This approach works
well when the source and target sounds are static so that the sounds can be transformed
into a spectral representation. Similar to the case with graphical morphs, corresponding
points can be identified on the two objects in this space. A combination of space warping
and interpolation are used to move between the target and source.
A much deeper and informative representation of a sound is provided in terms of
a sound model. There are several different ways that models can be used to define a
morph. The more the model structures can be exploited, the richer are the possible
emergent semantics.
If two different sounds can be generated by the same model, then a morph can be
trivially defined by selecting a path from the parameter setting that generates one to a
setting that generates the other. In this case, the blend space is the same as the model
for the two sounds, so although the morph may be more interesting than the spectral
variety discussed above, no new semantics can be said to emerge.
If we are given two sounds, each with a separate model capable of generating the
sounds, then the challenge is to find a common representational space in which to create
a path that connects the two sound objects. One possible solution would be to define
a set of feature detectors (e.g., spectral measurements, pitch, measures of noisiness)
that would provide a kind of description of any sound. This solves the problem of finding
a common space in which both source and target can be represented. Next, a region of
the feature space that the two model sound classes have in common needs to be
identified, and paths from the source and target need to be delineated such that they
intersect in that region. If the model ranges do not intersect in the feature space, then
a series of models with ranges that form a connected subspace needs to be created to
support such a path so that a morph can be built using a series of models as illustrated
in Figure 7. This process requires knowledge about the sound generation capabilities of
each model at a given point in feature space.
We mentioned earlier that a model is defined not only by the sounds within its range,
but in the paths it can take through the range as determined by the control
parameterizations. The dynamic behavior defined by the possible paths play a key role
in any semantics the model might be given. The connected feature space region defines
a path between the source and target sounds in a particular way that will create and
constrain a semantic interpretation. However, in this case, the new model is less than
satisfying because as a combination of other models, only one of which is active at a time,
it cant actually generate sounds that were not possible with the extant models.
Moreover, if the kludging together of models is actually perceived as such, then new
semantics fail to arise.
Another way to solve the problem would be to embed the two different models in
to a blended structure where each original model can be viewed as a special case given
by specific parameter settings of the meta-model. This could be done trivially by building
a meta-model that merely mixes the audio output from each model separately, with a
parameter that controls the relative contribution from each submodel. Again we have a
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
379
Figure 7. A morph in feature space performed using one model that can generate the
source sound, another model that can generate the target sound, and a path passing
through a point in feature space that both models are capable of generating. If the
source-generating model and the target-generating model do not overlap in feature
space, intermediate models can be used so that a connected path through feature space
is covered.
trivial morph that would not be very satisfying, and because sound mixes from
independent sources are generally perceived as mixes rather than as a unified sound by
a single source, the semantics of the individual component models would presumably be
clearly perceptible.
There are, however, much richer ways of embedding two models into a blended
structure such that each submodel is a sufficient description of the meta-model under
specific parameter settings. The blended structure wraps the two submodels that
generate the morphing source and target sounds and exposes a single reduced set of
parameters. There must exist at least one setting for the meta-model parameters such that
the original morphing target sound is produced, and one such that the original morphing
source sound is produced in order to create the transformation from source sound to
target sound. The meta-model parameterization defines the common space in which both
the original sounds exist and in which any number of paths may be constructed
connecting the two. We discussed this situation earlier, except in this case, the metamodel is genuinely new, and has its own set of capabilities and constraints defined by
the relationship between the structure of the two original models, but present in neither.
New semantics emerge from the domain ontology, mappings between models, and the
integration network created in the blend.
As a concrete example of an audio morph with emergent semantics, consider two
different sounds: one the result of waveshaping on a sinusoid, the other the result of
amplitude modulation of a sampled noise source as illustrated in Figure 8. Each structure
creates a distinctive kind of distortion of the input signal. One way of combining these
two models into a meta-model is shown in Figure 9. To combine these two models, we use
knowledge about the constituent components of the models, which could be exploited
automatically if they were represented as a formal ontology as discussed above. In
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 8. Two kinds of signal distortion. a) This patch puts the recorded sample through
a non-linear transfer function (tanh). The amount of effect determines how nonlinear the shaping is, with zero causing the original sample to be heard unchanged.
b) A sinusoidal amplitude modulation of the recorded sample.
(a)
(b)
particular, knowing the input and output types and ranges for signal modifying units, and
knowing specific parameter values for which the modifiers have no effect on the signal
(the null condition), we can structure the model for morphing. Knowledge of null
conditions, in particular, was used so that the effect of one submodel on the other would
be nullified at the extreme values of the morphing parameter. Using knowledge of the
modifier units signal range expectations and transformations permits the models to be
integrated at a much deeper structural level than treating the models as black boxes would
permit.
Most importantly, blending the individual model structures creates a genuinely new
model capable of a wide range of sounds that neither submodel was capable of generating
alone, yet including the specific sounds from each submodel that were the source and
the target sounds for the morph. A new range of sounds implies new semantic possibilities.
New semantics can be said to arise in another aspect as well. In the particular blend
illustrated above, most of the parameters exposed by the original submodels are still
available for independent control. At the extreme values for the morphing parameter, the
original controls have the same effect that they had in their original context. However,
at the in between values for the morphing parameter, the controls from the submodel
have an effect on the sound output that is entirely new and dependent upon the particular
submodel blend that is constructed. This emergent property is not present in the trivial
morph described earlier which merely mixed the audio output of the two submodels
individually. Since a morph between two objects is not completely determined by the
endpoints, but by the entire path through the blend space, there is a creative role for a
human-in-the-loop to complete the specification of the morph according the usage
context.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
381
Figure 9. Both the waveshaping (WS) and the amplitude modulation (AM) models
embedded in a single meta-model. When the WS vs. AM morphing parameter is at
one extreme or the other we get only the effect of either the WS or the AM model
individually. When the morph parameter is at an in between state, we get a variety of
new combinations of waveshaping of the AM signal and/or amplitude modulation of
the waveshaped signal (depending on the other parameter settings).
We have shown, that given knowledge about how elementary units function within
models in the form of an ontology, structures for different models can be combined in
such a way that gives rise to new sound ranges and new handles for control. Semantics
emerge that are related to those of the model constituents, but in rich and complex ways.
There are, in general, many ways that sound models may be combined to form new
structures. Some combinations may work better in certain contexts than others. How
desired semantics can be used to guide the construction process is a topic that warrants
further study.
sound that are routinely used to edit film (Sharff, 1982). The mechanisms that underlie
cinematic editing can be described as a blending network that leverages the cognitive
perceptions of audio and imagery to create a compelling story. The Video Edit example
borrows from such cinematic editing techniques to construct a blending network for the
semi-automatic editing of home video. In this network, the generic model is an encoding
of the cinematic editing rules relating music and video, and the input models represent
structural features of the music and visual events in the video.
Encoding of aesthetic decisions for editing video to music is key for creating the
blending model. Traditional film production techniques start by creating the video track,
then add sound effects and music to enhance the affective qualities of the video. This
is a highly labor intensive process that does not lend itself well to automation. In the case
of the casual user with a raw home video, the preferred editing commands emphasize
functional operations, such as selecting the overall style of cinematic editing, choosing
to emphasize people in the video, selecting the music, and deciding how much native
audio to include in the final production.
The generic model for the music and video inputs is a collection of editing units that
describe simple relations between fragments of audio and video. Each unit captures
partial information associated with a cinematic editing rule, thus the units in the generic
model can be composed in a graph structure to form more complex editing logic. One
example of an insertion unit specifies that the length of a video clip to be inserted should
be inversely proportional to the loudness of the music. During the construction of the
blending network the variables for video length and music loudness are bound and
specific values are propagated during the subsequent running of the blend to
dynamically produce the final edited video. Another example of a transition unit specifies
how two video clips are to be spliced together. When this unit is added to the graph
structure, it specifies the type of transition between video clips, the duration of the
transition, and the inclusion of audio or graphical special effects. Yet another insertion
unit may relate the timing and visual characteristics of people in the video to various
structural features in the music. The generic model may therefore be viewed as an
ontology of simple editing units that can be composed into a graph structure by the
blending model for subsequent editing of the music and video inputs.
The video input model contains the raw video footage plus the shots detected in
the video, where each shot is a sequence of contiguous video frames containing similar
images. The raw video frames are then analyzed in terms of features for color, texture,
motion, as well as simple models for the existence of people in the shot or other salient
events. This analysis provides the metadata for a model representing the input video for
use in the subsequent video editing. For example, the video model includes techniques
for finding parts of the shots which contain human faces. The information about faces
can be combined with the editing logic to create a final production which emphasizes the
people in the video. In this way, the system can automatically construct a people-oriented
model of the input video.
The model for the input music needs to support the editing logic of the generic model
for cinematic editing and the high level commands from the user interface. The basic
music model is composed of a number of parameters, including the tempo, rhythm, and
loudness envelope for the music. The combined inputs of the video model and the music
model in Figure 10 are integrated with the cinematic styles in the blending model to
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
383
Figure 10. The blending network for automatic editing of video according to the
affective structures in the music and operators for different cinematic styles.
Styles
Music
Raw Video
Music Analysis
Music
Data
Music
Description
Video Analysis
Video
Description
Video
Data
Composition Logic
Media Production
produce a series of editing decisions when to make cuts, which kinds of transitions
to use, what effects to add, and when to add them.
The composition logic in the blend model integrates information from three places:
a video description produced by the video analysis, a music description produced by the
music analysis, and information about the desired editing style as selected by the user.
The composition logic uses the blending model to combine these three inputs in order
to make the best possible production from the given material one which is as stylish
and artistically pleasing as possible. It does this by representing the blended media
construction as a graph structure and opportunistically selecting content to complete
the media graph. This process results in the emergent semantics of a music video which
inherits partial semantics from the music and from the video.
The Presentation
The Presentation example illustrates the use of emergent structure to facilitate
information retrieval from online e-learning courseware. A simple form of emergent
structure is a path that combines concept relationships from a textbook with temporal
sequencing from the video presentation. The path structure can then be manipulated to
gain insight into the informational content of the courseware by performing all of the
standard operations afforded by paths, such as traversal, compression, expansion,
branching, and the measurement of distance.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
385
Figure 11. Media integration in the Presentation network for the Find Path blend.
ontology
Generic Model
term
Textbook Model
topic
text
topic
terms
ontology
Path Blend
term
topic
media
slide
transcript
segment
context
ontology
sequence
Lecture Model
video
time
Emergent Structure
terms
path
video
onto the abstract concepts of the domain ontology. The ontology is subsequently
converted to a graph structure for efficient search. The textbook has explicit structure
due to the hierarchical organization of chapters and topics, as well as the table of
contents, and index. There is also implicit structure in the linear sequencing of topics and
the convention among textbooks that simpler material comes before more complex
material.
The lecture model provides the second input which represents the lecture video,
transcripts, and accompanying slide presentation. The metadata for the lecture model can
be derived automatically through the analysis of perceptual events in the video to
classify shots according to the activity of the instructor. The text of the transcripts can
be analyzed to extract terms and indexed for text based queries. The slide presentation
and associated video time stamps provide an additional source of key terms and images
that can be cross mapped to the textbook model.
The generic model for the Presentation network contains the core domain ontology
of terms and relations used in the course. As we shall see later, the cross-space mapping
between the textbook model and the lecture model occurs at the level of extracted terms
and their locations in the respective media. The concepts in the core ontology thus
provide a unified indexing scheme for the term instances that occur as a result of media
processing in the two input models.
The blend model for the find path scenario receives projections of temporal
sequence information from the lecture video and term relations from the textbook. When
the user issues a query to find a path in the lecture video that goes from topic A to topic
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
B, the blend model first accesses the textbook model to expand the query terms associated
with the topics A and B. The graph representation of the textbook model is then searched
for a path linking these topics. Once a path is found among the textbook terms, the original
query is expanded into a sequence of lecture video queries, one for each of the terms plus
local context from the textbook. The blend model then evaluates each video query and
assembles the selected video clips into the final path structure (see insert in Figure 11).
At this point, the blend model has fully instantiated the path as a blend of the two inputs.
Once the path has been constructed, the user can run the blend to perform various
operations on the path. Note that the blending network has added mappings between the
input models, but has not modified the original models. Thus, the path blend can, for
instance, be used to project the temporal sequencing from the time stamps of the lecture
video back onto the textbook model to construct a navigational path in the textbook with
sequential dependencies from the video.
Operators
Mappings between models in the Presentation network in Figure 11 support a set
of core operators for information retrieval in mixed media. Two of these operators are
called projection and compression. As seen in previous examples, projection is the
process in which information in one model is mapped onto corresponding information
in another model. Since both input models are represented by a graph structure, where
links between nodes are relations, projection between inputs amounts to a form of graph
matching to identify corresponding elements. These elements are then bound so that
information can pass directly between the models. A second source of binding occurs
between each input model and the emergent structure that is constructed in the blend
model. This double scope binding enables the efficient projection of information within
the network.
Compression is another core operator of the blending network that supports media
management through semantics. For example, traditional methods for constructing a
video summary require the application of specialized filters to identify relevant video
segments. The segments are then composed to form the final summary. Instead of
operating directly on the input media, compression operates on the emergent path
structure and projects the results back to the input media. Thus by operating on the path
blend, one can derive the shortest time path, the most densely connected path, or the path
with the fewest definitions from the lecture video. The system performs these operations
on the blended model and projects the results back to either of the available input models
to determine the query result.
The consequence of defining the operators for projection and compression is that
a new process called, running the blend becomes possible within this framework.
Running the blend is the process of reversing the direction of causality within the
network, thereby using the blend to drive the production of inferences in either of the
input spaces. In the find path example, the application of projection and compression
on the path blend means that the user can manage the media using higher level semantics.
Moreover, all of the standard operations on paths, such as contracting, expanding,
reversing, etc. can now be performed and their consequences projected back onto the
input spaces. Finally, the user can perform a series of queries in the blend and project
the results back to the inputs to view the results.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
387
Integration Mechanism
The network of models in the Presentation blend provides an integration mechanism
for the multimedia resources. Once the network is constructed, it is possible to process
user queries by running the blend. In the find path scenario, the student began with a
request to find a relationship between Dynamic Programming (DP) and Greedy Algorithms (GA). The system searches the domain ontology of the textbook model to discover
a set of possible paths linking DP with GA, subject to user preferences and event
descriptions. The user preferences, event descriptions, and relations among the path
nodes in the ontology are used to formulate a focused search for similar content in the
video presentation. The resultant temporal sequence of video segments is added to the
emergent path structure in the blend model.
As discussed previously, when the student requested to find a path from topic DP
to topic GA, a conceptual blend was formed which combined the ontology from the
textbook with the temporal sequencing of topics from the lecture. The result was a
chronological path through the sequence of topics linking DP to GA. This path can now
be used in an intuitive way to compress time, expand the detail, select alternative routes,
or combine with another path. The resultant path through a sequence of interrelated
media segments in the find path blend is the emergent structure arising from the
processing of the users query. Thus, one can now start from the constructed path and
project information back onto the input spaces to mine for additional information that was
previously inaccessible. For example, one could use the path to select a sequence of text
locations in the textbook that correspond to the same chronological presentation of the
topics that occurs in the lecture. Thus, the blending network effectively uses the
instructors knowledge about the pedagogical sequencing of topics to provide a
navigational guide through the textbook.
We have designed a system for indexing mixed media content using text, audio,
video, and slides and the segmentation of the content into various lecture components.
The GUI for the ontology based exploration and navigation of lecture videos is shown
in Figure 12. By manipulating these lecture components in the media blend, we are able
to present media information to support insight generation and aid in the recovery of
comprehension failures during the viewing of lecture videos. This dynamic composition
of cross media blends provides a malleable representation for generating insights into
the media content by allowing the user to manage the media through the high level
semantics of the media blend.
FUTURE TRENDS
The development of the Internet and the World Wide Web has lead to the
globalization of text based exchanges of information. The subsequent use of web
services for the automatic generation of web pages from databases for both human and
machine communication is being facilitated by the development of Semantic Web
technologies. Similarly, we now have the capacity to capture and share multimedia
content on a large scale. Clearly, the plethora of pre-existing digital media and the
popularization of multimedia applications among non-professional users will drive the
demand for authoring tools that provide a high level of automation.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Figure 12. User interface for the Presentation network. Display contains the following
frames (clockwise from top left corner): Path Finder, Video Player, Slide, Slide Index,
Textbook display, and a multiple timeline display of search results presented as
gradient color hotspots on a bar chart.
Preliminary attempts toward the use of models for the generation of sound effects
for games and film, as well as the retrieval of video from databases has been primarily
directed toward human-to-human communication. The increasing use of generative
models for media synthesis and the ability to dynamically construct networks for
combining these models will create new ways for people to experience media. Since the
semantics of the media is not fixed, but arises from the media and the way that it is used,
the discovery of emergent semantics through ontology based operations is becoming
a significant trend in multimedia research. The convergence of generative models,
automation, and ontologies will also facilitate the exchange of media information between
machines and support the development of a media semantic web.
In order to realize these goals further progress is needed in the following technologies:
Use of semantic descriptions for the composition of models into blending networks.
Formalization of ontologies for domain knowledge, synthesis units, and relationships among generative models.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
389
The trend toward increasing the automation of media production through the
creation of media models relies upon the ability to manage the semantics that emerges
from user-centric operations on the media.
CONCLUSIONS
In this chapter we have presented a framework for media blending that has proved
useful for discovering emergent semantics. Concrete examples drawn from the domains
of video editing, sound synthesis and the exploration of multimedia content for lecture
based courseware have been used to illustrate the key components of the framework.
Ontologies for sound synthesis components and the perceptual relations among sounds
were used to describe how emergent properties arise from the morphing of two audio
models into a new model. From the domain of automatic home video editing, we have
described how the basic operators of projection and compression lead to the emergence
of a stylistically edited music video with combined semantics of the source music and
video. In the video presentation example, we have shown how multiple media specific
ontologies can be used to transform high level user queries into detailed searches in the
target media.
In each of the above cases, the discovery and/or generation of emergent semantics
involved the integration of descriptions from four distinct spaces. The two input spaces
contain the models and metadata descriptions of the source media that are to be
combined. The generic space contains the domain specific information and mappings
that relate elements in the two input spaces. Finally, the blend space is where the real work
occurs for combining information from the other spaces to generate a new production
according to audio synthesis designs, cinematic editing rules, or navigational paths in
presentations as discussed in this chapter.
REFERENCES
Benitez, A. B., & Chang, S. F. (2002). Multimedia knowledge integration, summarization
and evaluation. Proceedings of the 2002 International Workshop on Multimedia
Data Mining, Edmonton, Alberta, Canada (pp. 39-50).
Corman, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to
algorithms. Cambridge, MA: MIT Press.
Davis, M. (1995). Media Streams: An iconic visual language for video Representation.
In Baecher, R. M., Grudin, J., Buxton, W. A. S., & Greenberg, S. (Eds.), Readings
in human-computer interaction: Toward the year 2000 (2nd ed.) (pp. 854-866). San
Francisco: Morgan Kaufmann Publishers.
Dorai, C., Kermani, P., & Stewart, A. (2001, October). E-learning media navigator.
Proceedings of the 9th ACM International Conference on Multimedia, Ottawa,
Canada (pp. 634-635).
Fauconnier, G. (1997). Mappings in thought and language. Cambridge, UK: Cambridge
University Press.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Funkhouser, T., Kazhdan, M., Shilane, P., Min, P., Kiefer, W., Tal, A., et al. (2004,
August). Modeling by example. ACM Transactions on Graphics (SIGGRAPH
2004).
Kellock, P., & Altman, E. J. (2000). System and method for media production. Patent WO
02/052565.
Kovar, L., & Gleicher, M. (2003). Flexible automatic motion blending with registration
curves. Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on
Computer Animation, San Diego, California (pp. 214-224).
Mott, R. L. (1990). Sound Effects: Radio, TV and film. Focal Press.
Nack, F., & Hardman, L. (2002). Towards a syntax for multimedia semantics. CWI
Technical Report, INS-R0204, April.
Rolland, P.-Y., & Pachet, F. (1995). Modeling and applying the knowledge of synthesizer
patch programmers. In G. Widmer (Ed.), Proceedings of the IJCAI-95 International
Workshop on Artificial Intelligence and Music, 14th International Joint Conference on Artificial Intelligence, Montreal, Canada. Retrieved June 1, 2004, from
http://citeseer.ist.psu.edu/article/rolland95modeling.html
Russolo, L. (1916). The art of noises. Barclay Brown (translation). New York: Pendragon
Press.
Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image
databases. IEEE Transaction of Knowledge and Data Engineering, 337-351.
Sharff, S. (1982). The elements of cinema: Toward a theory of cinesthetic impact. New
York: Columbia University Press.
Slaney, M., Covell, M., & Lassiter, B. (1996). Automatic audio morphing. Proceedings
of IEEE International Conference Acoustics, Speech and Signal Processing,
Atlanta, 1-4. Retrieved June 1, 2004, from http://citeseer.nj.nec.com/
slaney95automatic.html
Smalley, D. (1997). Spectromorphology: Explaining sound shapes. Organized Sound,
2(2), 107-126.
Staab, S., Maedche, A., Nack, F., Santini, S., & Steels, L. (2002). Emergent semantics. IEEE
Intelligent Systems: Trends & Controversies, 17(1), 78-86.
Thompson Learning (n.d.). Retrieved June 1, 2004, from http://www.thompson.com/
Veale, T., & ODonoghue, T. (2000). Computation and blending. Cognitive Linguistics,
11, 253-281.
WebCT (n.d.). Retrieved June 1, 2004, from http://www.webct.com
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Glossary 391
Glossary
B Frame: One of three picture types used in MPEG video. B pictures are bidirectionally
predicted, based on both previous and following pictures. B pictures usually use
the least number of bits. B pictures do not propagate coding errors since they are
not used as a reference by other pictures.
Bandwidth: There are physical constraints on the amount of data that can be transferred
through a specific medium. The constraint is measured in terms of the amount of
data that can be transferred over a measure of time, and is known as the bandwidth
of the particular medium. Bandwidth is measured in bps (bits per second).
Bit-rate: The rate at which a presentation is streamed, usually expressed in Kilobits per
second (Kbps).
bps: Bits-Per-Second
Compression/Decompression: A method of encoding/decoding signals to reduce the
data rate needed allows transmission (or storage) of more information than the
media would otherwise be able to support.
Extensible Markup Language: see XML
Encoding/Decoding: Encoding is the process of changing data from one form into
another according to a set of rules specified by a codec. The data is usually a file
containing audio, video or still image. Often the encoding is done to make a file
compatible with specific hardware (such as a DVD Player) or to compress or reduce
the space the data occupies.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
392 Glossary
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Glossary 393
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
394 Glossary
P Frame: A P-frame is a video frame encoded relative to the past reference frame. A
reference frame is a P- or I-frame. The past reference frame is the closest preceding
reference frame.
Pixel: A picture element; images are made of many tiny pixels. For example, a 13-inch
computer screen is made of 307,200 pixels (640 columns by 480 rows).
Protocol: A protocol is a standard of technical conventions allowing communication
between different electronic devices. It consists of a set of rules for the communication between devices
Query: Queries are the primary mechanism for retrieving information from a database and
consist of questions presented to the database in a predefined format. Many
database management systems use the Structured Query Language (SQL) standard
query format.
QuickTime: A desktop video standard developed by Apple Computer. QuickTime
Animation was created for lossless compression of animated movies and QuickTime
Video was created for lossy compression of desktop video.
Scene: A meaningful segment of the video.
Schema: A Schema is a mechanism similar to defining data types.
SQL (Structured Query Language): A specialized programming language for sending
queries to databases.
Streaming: Multimedia files are typically large. A user does not want to wait until the
entire file is received through an Internet connection. Streaming makes possible a
portion of a files content can be viewed before the entire file is received. The data
of the file is continuously sent from the server. While loading, the user can begin
viewing the streamed data.
TCP (Transmission Control Protocol): TCP protocol ensures the safe transmission of
data between two hosts. Information is transmitted in packets.
TCP/IP: The TCP and IP protocols in combination is the basic protocol of the Internet.
URL (Uniform Resource Locator): The standard way to give the address of any resource
on the Internet that is part of the World Wide Web (WWW).
Web (WWW) (World Wide Web): The universe of hypertext servers (HTTP servers)
which are the servers that allow text, graphics, sound files, and so forth, to be mixed
together.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Glossary 395
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Technology, Perth, on the extraction of high-level elements from feature film, which he
completed in 2003. His research interests include computational media aesthetics, with
application to mining multimedia data for meaning and computationallyassisted, domain
specific, multimedia authoring.
Edward Altman has spent the last decade developing a theory of media blending as a
senior scientist in the Media Semantics Department at the Institute for Infocomm
Research (I2R) and before that as a visiting researcher at ATR Media Integration &
Communications Research Laboratories. His current research involves the development
of Web services, ontology management, and media analysis to create interactive
environments for distance learning. He was instrumental in the development of
semiautomated video editing technologies for muvee Technologies. He received a PhD
from the University of Illinois in 1991 and performed post-doctoral research in a joint
computer vision and cognitive science program at the Beckman Institute for Advanced
Science and Technology.
Susanne Boll is assistant professor for multimedia and Internet-technologies, Department of Computing Science, University of Oldenburg, Germany. In 2001, Boll received
her doctorate with distinction at the Technical University of Vienna, Austria. Her studies
were concerned with the flexible multimedia document model ZYX, designed and realized
in the context of a multimedia database system. She received her diploma degree with
distinction in computer science at the Technical University of Darmstadt, Germany
(1996). Her research interests lie in the area of personalization of multimedia content,
mobile multimedia systems, multimedia information systems, and multimedia document
models. The research projects that Boll is working on include a framework for personalized multimedia content generation and development of personalized (mobile) multimedia
presentation services. She has been publishing her research results at many international
workshops, conferences and journals. Boll is an active member of SIGMM of the ACM
and German Informatics Society (GI).
Cheong Loong Fah received a B Eng from the National University of Singapore and a
PhD from the University of Maryland, College Park, Center for Automation Research
(1990 and 1996, respectively). In 1996, he joined the Department of Electrical and
Computer Engineering, National University of Singapore, where he is currently an
assistant professor. His research interests are related to the basic processes in the
perception of three-dimensional motion, shape, and their relationship, as well as the
application of these theoretical findings to specific problems in navigation and in
multimedia systems, for instance, in the problems of video indexing in large databases.
Isabel F. Cruz is an associate professor of computer science at the University of Illinois
at Chicago (UIC). She holds a PhD in computer science from the University of Toronto.
In 1996, she received a National Science Foundation CAREER award. She is a member of
the National Research Councils Mapping Science Committee (2004-2006). She has been
invited to give more than 50 talks worldwide, has served on more than 70 program
committees, and has more than 60 refereed publications in databases, Semantic Web,
visual languages, graph drawing, user interfaces, multimedia, geographic information
systems, and information retrieval.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Ajay Divakaran received a BE (with Honors) in electronics and communication engineering from the University of Jodhpur, Jodhpur, India (1985), and an MS and PhD from
Rensselaer Polytechnic Institute, Troy, New York (1988 and 1993, respectively). He was
an assistant professor with the Department of Electronics and Communications Engineering, University of Jodhpur, India (1985-1986). He was a research associate at the
Department of Electrical Communication Engineering, Indian Institute of Science, in
Bangalore, India (1994-1995). He was a scientist with Iterated Systems Inc., Atlanta,
Georgia (1995-1998). He joined Mitsubishi Electric Research Laboratories (MERL) in 1998
and is now a senior principal member of the technical staff. He has been an active
contributor to the MPEG-7 video standard. His current research interests include video
analysis, summarization, indexing and compression, and related applications. He has
published several journal and conference papers, as well as four invited book chapters
on video indexing and summarization. He currently serves on program committees of key
conferences in the area of multimedia content analysis.
Thomas S. Huang received a BS in electrical engineering from the National Taiwan
University, Taipei, Taiwan, ROC, and an MS and ScD in electrical engineering from the
Massachusetts Institute of Technology (MIT), Cambridge. He was with the faculty of
the Department of Electrical Engineering at MIT (1963-1973) and with the faculty of the
School of Electrical Engineering and Signal Processing at Purdue University, West
Lafayette, Indiana (1973-1980). In 1980, he joined the University of Illinois at UrbanaChampaign, where he is now William L. Everitt distinguished professor of electrical and
computer engineering, research professor at the Coordinated Science Laboratory, and
head of the Image Formation and Processing Group at the Beckman Institute for
Advanced Science and Technology. He is also co-chair of the institutes major research
theme (human computer intelligent interaction). During his sabbatical leaves, he has
been with the MIT Lincoln Laboratory, Lexington, MA; IBM T.J. Watson Research
Center, Yorktown Heights, NY; and Rheinishes Landes Museum, Bonn, West Germany.
He held visiting professor positions at the Swiss Federal Institutes of Technology,
Zurich and Lausanne, Switzerland; University of Hannover, West Germany; INRSTelecommunications, University of Quebec, Montreal, QC, Canada; and University of
Tokyo, Japan. He has served as a consultant to numerous industrial forums and
government agencies both in the United States and abroad. His professional interests
lie in the broad area of information technology, especially the transmission and processing of multidimensional signals. He has published 14 books and more than 500 papers
in network theory, digital filtering, image processing, and computer vision. He is a
founding editor of the International Journal Computer Vision, Graphics, and Image
Processing, and editor of the Springer Series in Information Sciences (Springer Verlag).
Dr. Huang is a member of the National Academy of Engineering; a foreign member of the
Chinese Academies of Engineering and Sciences; and a fellow of the International
Association of Pattern Recognition and of the Optical Society of America. He has
received a Guggenheim Fellowship, an AV Humboldt Foundation Senior US Scientist
Award, and a Fellowship from the Japan Association for the Promotion of Science. He
received the IEEE Signal Processing Societys Technical Achievement Award in 1987 and
the Society Award in 1991. He was awarded the IEEE Third Millennium Medal in 2000.
In addition, in 2000 he received the Honda Lifetime Achievement Award for contributions to motion analysis. In 2001, he received the IEEE Jack S. Kilby Medal. In 2002, he
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
received the King-Sun Fu Prize from the International Association of Pattern Recognition
and the Pan Wen-Yuan Outstanding Research Award.
Jesse S. Jin graduated with a PhD from the University of Otago, New Zealand. He worked
as a lecturer in Otago, a lecturer, senior lecturer and associate professor at the University
of New South Wales, an associate professor at the University of Sydney. He is now the
chair professor of IT at The University of Newcastle. Professor Jins areas of interest
include multimedia technology, medical imaging, computer vision and the Internet. He
has published more than 160 articles and 14 books and edited books. He also has one
patent and is in the process of filing three more. He has received several million dollars
in research funding from government agents (ARC, DIST, etc.), universities (UNSW,
USyd, Newcastle, etc.), industries (Motorola, NewMedia, Cochlear, Silicon Graphics,
Proteome Systems, etc.), and overseas organisations (NZ Wool Board, UGC HK, CAS,
etc.). He established a spin-off company that won the 1999 ATP Vice-Chancellor New
Business Creation Award. He is a consultant to companies such as Motorola, Computer
Associates, ScanWorld, Proteome Systems, HyperSoft.
Ashraf A. Kassim (M81) received his BEng (First Class Honors) and MEng degrees in
electrical engineering from the National University of Singapore (NUS) (1985 and 1987,
respectively). From 1986 to 1988, he worked on the design and development of machine
vision systems at Texas Instruments. He went on to obtain his PhD in electrical and
computer engineering from Carnegie Mellon University, Pittsburgh (1993). Since 1993,
he has been with the Electrical & Computer Engineering Department at NUS, where he
is currently an associate professor and deputy head of the department. Dr Kassims
research interests include computer vision, video/image processing and compression
Wolfgang Klas is professor at the Department of Computer Science and Business
Informatics at the University of Vienna, Austria, heading the multimedia information
systems group. Until 2000, he was professor with the Computer Science Department at
the University of Ulm, Germany. Until 1996, he was head of the Distributed Multimedia
Systems Research Division (DIMSYS) at GMD-IPSI, Darmstadt, Germany. From 1991 to
1992, Dr. Klas was a visiting fellow at the International Computer Science Institute (ICSI),
University of California at Berkeley, USA. His research interests are in multimedia
information systems and Internet-based applications. He currently serves on the editorial board of the Very Large Data Bases Journal and has been a member and chair of
program committees of many conferences.
Shonali Krishnaswamy is a research fellow at the School of Computer Science and
Software Engineering at Monash University, Melbourne, Australia. Her research interests include service-oriented computing, distributed and ubiquitous data mining, software agents and rough sets. She received her masters and PhD in computer science from
Monash University. She is a member of IEEE and ACM.
Neal Leshs research efforts currently focus on human-computer collaborative interface
agents and interactive data exploration. He has recently published papers on a range of
topics including computational biology, data mining, information visualization, humanrobot interaction, planning, combinatorial optimization, storysharing systems, and
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
intelligent tutoring. Before joining MERL, Neal completed a PhD at the University of
Washington and worked briefly as a post-doctoral student at the University of Rochester.
Joo-Hwee Lim received his BSc (Hons I) and MSc (by research) in computer science from
the National University of Singapore (1989 and 1991, respectively). He has joined
Institute for Infocomm Research, Singapore, in October 1990. He has conducted research
in connectionist expert systems, neural-fuzzy systems, handwriting recognition, multiagent systems, and content-based retrieval. He was a key researcher in two international
research collaborations, namely the Real World Computing Partnership funded by METI,
Japan, and the Digital Image/Video Album project with CNRS, France, and School of
Computing, National University of Singapore. He has published more than 50 refereed
international journal and conference papers in his research areas including contentbased processing, pattern recognition, and neural networks.
Namunu C. Maddage is currently pursuing a PhD in computer science in the School of
Computing, National University of Singapore, Singapore. His research interests are in the
areas of music modeling, music structure analysis and audio/music data mining. He
received a BE in 2000 in the Department of Electrical & Electronic Engineering from Birla
Institute of Technology (BIT), Mesra, in India.
Ankush Mittal received a BTech and masters (by research) degrees in computer science
and engineering from the Indian Institute of Technology, Delhi. He received a PhD from
the National University of Singapore (2001). Since October 2003, he has been working
as assistant professor at the Indian Institute of Technology - Roorkee. Prior to this, he
was serving as a faculty member in the Department of Computer Science, National
University of Singapore. His research interests are in multimedia indexing, machine
learning, and motion analysis.
Baback Moghaddam is a senior research scientist of MERL Research Lab at Mitsubishi
Electric Research Labs, Cambridge, MA, USA. His research interests are in computational vision with a focus on probabilistic visual learning, statistical modeling and pattern
recognition with application in biometrics and computer-human interface. He obtained
his PhD in electrical engineering and computer science (EECS) from the Massachusetts
Institute of Technology (MIT) in 1997. Here, he was a member of the Vision and Modeling
Group at the MIT Media Laboratory, where he developed a fully-automatic vision system
which won DARPAs 1996 FERET Face Recognition Competition. Dr. Moghaddam was
the winner of the 2001 Pierre Devijver Prize from the International Association of Pattern
Recognition for his innovative approach to face recognition and received the Pattern
Recognition Society Award for exceptional outstanding quality for his journal paper
Bayesian Face Recognition. He currently serves on the editorial board of the journal
titled Pattern Recognition and has contributed to numerous textbooks on image
processing and computer vision (including the core chapter in Springer Verlags latest
biometric series, Handbook of Face Recognition).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Anne H.H. Ngu is an associate professor with the Department of Computer Science, Texas
State University, San Marcos, Texas. Ngu received her PhD in 1990 from the University
of Western Australia. She has more than 15 years of experience in research and
development in IT with expertise in integrating data and applications on the Web,
multimedia databases, Web services, and object-oriented technologies. She has worked
in different countries as a researcher, including the Institute of Systems Science in
Singapore, Tilburg University, The Netherlands; Telcordia Technologies and MCC in
Austin, Texas. Prior to moving to the United States, she worked as a senior lecturer in
the School of Computer Science and Engineering, University of New South Wales
(UNSW). Currently, she also holds an adjunct associate professor position at UNSW and
summer faculty scholar position at Lawrence Livermore National Laboratory, California.
Krishnan V. Pagalthivarthi is associate professor in the Department of Applied Mechanics, Indian Institute of Technology Delhi, India. Dr.Krishnan received his BTech
from IIT Delhi (1979) and obtained his MSME (1984) and PhD (1988) from Georgia Institute
of Technology. He has supervised several students studying for their MTech,MS
(R),and PhD degrees and has published numerous research papers in various journals.
Silvia Pfeiffer received her masters degree in computer science and business management from the University of Mannheim, Germany (1993). She returned to that university
in 1994 to pursue a PhD within the MoCA (Movie Content Analysis) project, exploring
novel extraction methods for audio-visual content and novel applications using these.
Her thesis of 1999 was about audio content analysis of digital video. Next, she moved
to Australia to work as a research scientist in digital media at the CSIRO in Sydney. She
has explored several projects involving automated content analysis in the compressed
domain, focusing on segmentation applications. She has also actively submitted to
MPEG-7. In January 2001, she had initial ideas for a web of continuous media, the
specifications of which were worked out within the continuous media web research group
that she is heading.
Conrad Parker works as a senior software engineer at CSIRO, Australia. He is actively
involved in various open source multimedia projects, including development of the Linux
and Unix sound editor Sweep. With Dr. Pfeiffer, he developed the mechanisms for
streamable metadata encapsulation used in the Annodex format, and is responsible for
development of the core software libraries, content creation tools and server modules of
the reference implementation. His research focuses on interesting applications of
dynamic media generation, and improved TCP congestion control for efficient delivery
of media resources.
Andr Pang received his Bachelor of Science (Honors) at the University of New South
Wales, Sydney, Australia (2003). He has been involved with the Continuous Media Web
project since 2001, helping to develop the first specifications and implementations of the
Annodex technology and implementing the first Annodex Browser under Mac OS X.
Andr is involved in integrating Annodex support into several media frameworks, such
as the VideoLAN media player, DirectShow, xine, and QuickTime. In his spare time, he
enjoys researching about compilers and programming languages, and also codes on
many different open-source projects.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Viranga Ratnaike is a PhD candidate in the School of Computer Science and Software
Engineering, Faculty of Information Technology, Monash University, Melbourne,
Australia. He holds a Bachelor of Applied Science (Honors) in computer science. After
being a programmer for several years, he decided to return to fulltime study, and pursue
a career in research. His research interests are in emergence, artificial intelligence and
nonverbal knowledge representation.
Olga Sayenko received her BS from the University of Illinois at Chicago (2001). She is
working toward her MS under the direction of Dr. Cruz with the expected graduation date
in July 2004.
Karin Schellner studied computer science at the Technical University of Vienna and
received her diploma degree in 2000. From 1995 to 2001, she was working at IBM Austria.
From 2001 to 2003, she worked at the Department of Computer Science and Business
Informatics at the University of Vienna. Since 2003, she has been member of Research
Studios Austria Digital Memory Engineering. She has been responsible for the concept,
design and implementation of the data model developed in CULTOS.
Ansgar Scherp received his diploma degree in computer science at the Carl von
Ossietzky University of Oldenburg, Germany (2001) with the diploma thesis process
model and development methodology for virtual laboratories. Afterwards, he worked
for two years at the University of Oldenburg where he developed methods and tools for
virtual laboratories. Since 2003 he has been working as a scientific assistant at the
research institute OFFIS on the MM4U (Multimedia for you) project. The aim of this
project is the development of a component-based object-oriented software framework
that offers extensive support for the dynamic generation of personalized multimedia
content.
Xi Shao received a BS and MS in computer science from Nanjing University of Posts and
Telecommunications, Nanjing, PRChina (1999 and 2002, respectively). He is currently
pursuing a PhD in computer science in the School of Computing, National University of
Singapore, Singapore. His research interests include content-based audio/music analysis, music information retrieval, and multimedia communications.
Chia Shen is associate director and senior research scientist of MERL Research Lab at
Mitsubishi Electric Research Labs, Cambridge, MA, USA. Dr. Shens research investigates HCI issue in our understanding of multi-user, computationally augmented interactive surfaces, such as digital tabletops and walls. Her research probes new ways of
thinking in terms of UI design, interaction technique development, and entails the reexamination of the conventional metaphor and underlying system infrastructure, which
have been traditionally geared towards mice and keyboard-based, single-user desktop
computers and devices. Her current research projects include DiamondSpin, UbiTable
and PDH (see www.merl.com/projects for details).
Jialie Shen received his BSc in applied physics from Shenzhen University, China. He is
now a PhD candidate and associate lecturer in the School of Computer Science and
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Engineering at the University of New South Wales (Sydney, Australia). His research
interests include database systems, indexing, multimedia databases and data mining.
Bala Srinivasan is a professor of information technology in the School of Computer
Science and Software Engineering at the Faculty of Information Technology, Monash
Univeristy, Melbourne, Australia. He was formerly an academic staff member of the
Department of Computer Science and Information Systems at the National University of
Singapore and the Indian Institute of Technology, Kanpur,India. He has authored and
jointly edited six technical books and authored and co-authored more than 150 international refereed publications in journals and conferences in the areas of multimedia
databases, data communications, datamining and distributed systems.
He is a founding chairman of the Australiasian database conference. He was awarded the
Monash Vice-Chancellor medal for post-graduate supervision. He holds a Bachelor of
Engineering (Honors) in electronics and communication engineering, and a masters and
PhD, both in computer science.
Qi Tian received his PhD in electrical and computer engineering from the University of
Illinois at Urbana-Champaign (UIUC), Illinois (2002). He received his MS in electrical and
computer engineering from Drexel University, Philadelphia, Pennsylvania (1996), and a
BE in electronic engineering from Tsinghua University, China (1992). He has been an
assistant professor in the Department of Computer Science at the University of Texas at
San Antonio (UTSA) since 2002 and an adjunct assistant professor in the Department
of Radiation Oncology at the University of Texas Health Science Center at San Antonio
(UTHSCSA) since 2003. Before he joined UTSA, he was a research assistant at the Image
Formation and Processing (IFP) Group of the Beckman Institute for Advanced Science
and Technology and a teaching assistant in the Department of Electrical and Computer
Engineering at UIUC (1997- 2002). During the summer of 2000 and 2001, he was an intern
researcher with the Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA.
In the summer 2003, he was a visiting professor of NEC Laboratories America, Inc.,
Cupertino, CA, in the Video Media Understanding Group. His current research interests
include multimedia, computer vision, machine learning, and image and video processing.
He has published about 40 technical papers in these areas and has served on the program
committee of several conferences in the area of content-based image retrieval. He is a
senior member of IEEE.
Svetha Venkatesh is a professor at the School of Computing at the Curtin University of
Technology, Perth, Western Australia. Her research is in the areas of large-scale pattern
recognition, image understanding and applications of computer vision to image and
video indexing and retrieval. She is the author of about 200 research papers in these areas
and is currently co-director for the Center of Excellence in Intelligent Operations
Management.
Utz Westermann is a member of the Department of Computer Science and Business
Informatics at the University of Vienna. He received his diploma degree in computer
science at the University of Ulm, Germany (1998), and his doctoral degree in technical
sciences at the Technical University of Vienna, Austria (2004)4. His research interests
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
lie in the area of context-aware multimedia information systems. This includes metadata
standards for multimedia content, metadata management, XML, XML databases, and
multimedia databases. Utz Westermann has participated in several third-party-funded
projects in this domain.
Campbell Wilson received his masters degree and PhD in computer science from
Monash University. His research interests include multimedia retrieval techniques,
probabilisitic reasoning, virtual reality interfaces and adaptive user profiling. He is a
member of the IEEE.
Ying Wu received his PhD in electrical and computer engineering from the University of
Illinois at Urbana-Champaign (UIUC), Urbana, Illinois (2001). From 1997 to 2001, he was
a research assistant at the Beckman Institute at UIUC. During the 1999 and 2000, he was
with Microsoft Research, Redmond, Washington. Since 2001, he has been an assistant
professor at the Department of Electrical and Computer Engineering of Northwestern
University, Evanston, Illinois. His current research interests include computer vision,
machine learning, multimedia, and human-computer interaction. He received the Robert
T. Chien Award at UIUC, and is a recipient of the NSF CAREER award.
Lonce Wyse heads the Multimedia Modeling Lab at the Institute for Infocomm Research
(I2R). He also holds an adjunct position at the National University in Singapore where
he teaches a course in sonic arts and sciences. He received his PhD in 1994 in cognitive
and neural system from Boston University specializing in vision and hearing systems,
and then spent a year as a Fulbright Scholar in Taiwan before joining I2R. His current
research focus is applications and techniques for developing sound models.
Changsheng Xu received his PhD from Tsinghua University, China (1996). From 1996 to
1998, he was a research associate professor in the National Lab of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences. He joined the Institute for
Infocomm Research (I2R) of Singapore in March 1998. Currently, he is head of the Media
Analysis Lab in I2R. His research interests include multimedia content analysis/indexing/
retrieval, digital watermarking, computer vision and pattern recognition. He is a senior
member of IEEE.
Jie Yu is a PhD candidate of computer science at University of Texas at San Antonio
(UTSA). He received his bachelors degree in telecommunication engineering from Dong
Hua University, China (2000). He has been a research assistant and teaching assistant
in the Department of Computer Science at UTSA since 2002. His current research in image
processing is concerned with the study of efficient algorithms in content-based image
retrieval.
Sonja Zillner studied mathematics at the University Freiburg, Germany, and received her
diploma degree in 1999. Since 2000, she has been a member of the scientific staff of the
Department of Computer Science and Business Informatics at the University of Vienna.
Her research interests lie in the areas of semantic multimedia content modeling and ecommerce. She has participated in the EU project CULTOS (Cultural Units of Learning
Tools and Services).
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Samar Zutshi received his masters degree in information technology from Monash
University, Melbourne, Australia, during which he did some work on agent communication. After a stint in the software industry he is back at Monash doing what he enjoys
research and teaching. He is working in the area of relevance feedback in multimedia
retrieval for his PhD.
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
406 Index
Index
A
abstraction 188
acoustical music signal 117
ADSR 100
amplitude envelope 100
annodex 161
artificial life 355
attributes ranking method 107
audio
model 365
morph 376
production 368
B
beat space segmentation 122
blending
network 370
theory 372
C
censor 93
Cepstrum 102
class relative indexing 40
classical approach 290
clips 173
clustering 137
CMWeb 161
collection DS 186
combining multiple visual features
(CMVF) 3
composite image features 6
computational theory 374
constrained generating procedures
(CGP) 352
content-based
multimedia retrieval 289
music classification 99
music summarization 99
context information extraction 82
continuous media Web 160
D
data layer 340
DelaunayView 334
description scheme (DS) 183
digital cameras 33
digital item 164
distance-based access methods 6
domain 235
dynamic authoring 252
dynamic Bayesian network (DBN) 77
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Index 407
LayLab 336
logical media parts 315
low energy component 102
facial images 35
film semiotics 138
functional aspect 306
G
genotype 352
granularity 187
graphical user interfaces (GUI) 247
ground truth based method 107
H
human visual perception 10
human-computer partnership 223
hybrid dimension reducer 6
hybrid training algorithm 13
I
IETF 161
image
classification 35
databases 1
distortion 19
feature dimension reduction 4
feature vectors 2
similarity measurement 5
iMovie 226
indexing 1
instrument
detection 119
identification 119
integration lLayer 341
Internet 162
Internet engineering task force 161
interpretation models 150
J
just-in-time (JIT) 228
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
408 Index
indexing 37
region detection 117
region indexing (SRI) 34
Web 306
semantics 30, 136, 333
Shockwave file format 249
Sightseeing4U 277
SMIL 162
song structure 121
spatial
access methods (SAMs) 2
relationship 140
relationship operators 148
spectral
centroid 101
contrast feature 103
spectrum
flux 101
rolloff 101
speech 118
Sports4U 278
synchronized multimedia interaction
language 162
system architecture 340
T
temporal
grouping 137
motion activity 80
ordering 88
relationship 140
relationship operators 148
URI 166
timbral textural features 100
time segments 173
time-to-collision (TTC) 80
U
usage information 186
user interaction 187
user interfaces 255
user response 295
user-centric modeling 298
V
video
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Index 409
content 135
data 77
data model 142
edit 381
editing 367
metamodel framework 136
object (VO) 142
retrieval systems 77
semantics 135
VIMET 135
visual keywords 34
W
World Wide Web 160
Z
zero crossing rates 102
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Instant access to the latest offerings of Idea Group, Inc. in the fields of
I NFORMATION SCIENCE , T ECHNOLOGY AND MANAGEMENT!
InfoSci-Online
Database
BOOK CHAPTERS
JOURNAL AR TICLES
C ONFERENCE PROCEEDINGS
C ASE STUDIES
n
n
n
n
n
n
n
n
n
Distance Learning
Knowledge Management
Global Information Technology
Data Mining & Warehousing
E-Commerce & E-Government
IT Engineering & Modeling
Human Side of IT
Multimedia Networking
IT Virtual Organizations
BENEFITS
n Instant Access
n Full-Text
n Affordable
n Continuously Updated
n Advanced Searching Capabilities
Start exploring at
www.infosci-online.com
Idea Group
REFERENCE
The Premier Reference Source for Information Science and Technology Research
ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
DATA WAREHOUSING
AND MINING
INFORMATION SCIENCE
AND TECHNOLOGY
AVAILABLE NOW!
Provides a comprehensive, critical and descriptive examination of concepts, issues, trends, and challenges in this
rapidly expanding field of data warehousing and mining
A single source of knowledge and latest discoveries in the
field, consisting of more than 350 contributors from 32
countries
ENCYCLOPEDIA OF
DATABASE TECHNOLOGIES
AND APPLICATIONS
Offers in-depth coverage of evolutions, theories, methodologies, functionalities, and applications of DWM in such
interdisciplinary industries as healthcare informatics, artificial intelligence, financial modeling, and applied statistics
Supplies over 1,300 terms and definitions, and more than
3,200 references
DISTANCE LEARNING
MULTIMEDIA TECHNOLOGY
AND NETWORKING
ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
More than 450 international contributors provide extensive coverage of topics such as workforce training,
accessing education, digital divide, and the evolution of
distance and online education into a multibillion dollar
enterprise
Offers over 3,000 terms and definitions and more than
6,000 references in the field of distance learning
Excellent source of comprehensive knowledge and literature on the topic of distance learning programs
Provides the most comprehensive coverage of the issues,
concepts, trends, and technologies of distance learning
www.idea-group-ref.com
Idea Group Reference is pleased to offer complimentary access to the electronic version
for the life of edition when your library purchases a print copy of an encyclopedia
For a complete catalog of our new & upcoming encyclopedias, please contact:
701 E. Chocolate Ave., Suite 200 Hershey PA 17033, USA 1-866-342-6657 (toll free) cust@idea-group.com
Multimedia Networking:
Technology, Management
and Applications
Syed Mahbubur Rahman
Minnesota State University, Mankato, USA
File: Edit2
6/6/2005, 9:04:15AM
Page: 1