IRM Press Managing Multimedia Semantics eBook-kB PDF

Managing
Multimedia
Semantics
Uma Srinivasan
CSIRO ICT Centre, Australia
Surya Nepal
CSIRO ICT Centre, Australia
IRM Press
Publisher of innovative scholarly and professional

information technology titles in the cyberage
Hershey London Melbourne Singapore
Acquisitions Editor:
Development Editor:
Senior Managing Editor:
Managing Editor:
Copy Editor:
Typesetter:
Cover Design:
Printed at:
Rene Davies
Kristin Roth
Amanda Appicello
Jennifer Neidig
Michael Jaquish
Jennifer Neidig
Lisa Tosheff
Integrated Book Technology
Published in the United States of America by

IRM Press (an imprint of Idea Group Inc.)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033-1240
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@idea-group.com
Web site: http://www.irm-press.com
and in the United Kingdom by
IRM Press (an imprint of Idea Group Inc.)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 3313
Web site: http://www.eurospan.co.uk
Copyright 2005 by Idea Group Inc. All rights reserved. No part of this book may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including
photocopying, without written permission from the publisher.
Product or company names used in this book are for identification purposes only. Inclusion of the
names of the products or companies does not indicate a claim of ownership by IGI of the
trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Managing multimedia semantics / Uma Srinivasan and Surya Nepal, editors.
p. cm.
Summary: "This book is aimed at researchers and practitioners involved in designing and managing
complex multimedia information systems"--Provided by publisher.
Includes bibliographical references and index.
ISBN 1-59140-569-6 (h/c) -- ISBN 1-59140-542-4 (s/c) -- ISBN 1-59140-543-2 (ebook)
1. Multimedia systems. I. Srinivasan, Uma, 1948- II. Nepal, Surya, 1970QA76.575.M3153 2005
006.7--dc22
2004029850
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in
this book are those of the authors, but not necessarily of the publisher.
Managing
Multimedia Semantics
Table of Contents
Preface ........................................................................................................................... vi
SECTION 1: SEMANTIC INDEXING AND RETRIEVAL OF IMAGES
Chapter 1
Toward Semantically Meaningful Feature Spaces for Efficient Indexing in Large
Image Databases ............................................................................................................. 1
Anne H.H. Ngu, Texas State University, USA
Jialie Shen, The University of New South Wales, Australia
John Shepherd, The University of New South Wales, Australia
Chapter 2
From Classification to Retrieval: Exploiting Pattern Classifiers in Semantic
Image Indexing and Retrieval ......................................................................................... 30
Joo-Hwee Lim, Institute for Infocomm Research, Singapore
Jesse S. Jin, The University of Newcastle, Australia
Chapter 3
Self-Supervised Learning Based on Discriminative Nonlinear Features and Its
Applications for Pattern Classification ......................................................................... 52
Qi Tian, University of Texas at San Antonio, USA
Ying Wu, Northwestern University, USA
Jie Yu, University of Texas at San Antonio, USA
Thomas S. Huang, University of Illinois, USA
SECTION 2: AUDIO AND VIDEO SEMANTICS: MODELS AND STANDARDS
Chapter 4
Context-Based Interpretation and Indexing of Video Data ............................................. 77
Ankush Mittal, IIT Roorkee, India
Cheong Loong Fah, The National University of Singapore, Singapore
Ashraf A. Kassim, The National University of Singapore, Singapore
Krishnan V. Pagalthivarthi, IIT Delhi, India
Chapter 5
Content-Based Music Summarization and Classification ............................................. 99
Changsheng Xu, Institute for Infocomm Research, Singapore
Xi Shao, Institute for Infocomm Research, Singapore
Namunu C. Maddage, Institute for Infocomm Research, Singapore
Qi Tian, Institute for Infocomm Research, Singapore
Chapter 6
A Multidimensional Approach for Describing Video Semantics .................................. 135
Uma Srinivasan, CSIRO ICT Centre, Australia
Surya Nepal, CSIRO ICT Centre, Australia
Chapter 7
Continuous Media Web: Hyperlinking, Search and Retrieval of Time-Continuous
Data on the Web ............................................................................................................. 160
Silvia Pfeiffer, CSIRO ICT Centre, Australia
Conrad Parker, CSIRO ICT Centre, Australia
Andre Pang, CSIRO ICT Centre, Australia
Chapter 8
Management of Multimedia Semantics Using MPEG-7 ................................................ 182
Ajay Divakaran, Mitsubishi Electric Research Laboratories, USA
SECTION 3: USER-CENTRIC APPROACH TO MANAGE SEMANTICS
Chapter 9
Visualization, Estimation and User Modeling for Interactive Browsing of Personal
Photo Libraries .............................................................................................................. 193
Baback Moghaddam, Mitsubishi Electric Research Laboratories, USA
Neal Lesh, Mitsubishi Electric Research Laboratories, USA
Chia Shen, Mitsubishi Electric Research Laboratories, USA
Chapter 10
Multimedia Authoring: Human-Computer Partnership for Harvesting Metadata from
the Right Sources .......................................................................................................... 223
Brett Adams, Curtin University of Technology, Australia
Svetha Venkatesh, Curtin University of Technology, Australia
Chapter 11
MM4U: A Framework for Creating Personalized Multimedia Content ........................ 246
Ansgar Scherp, OFFIS Research Institute, Germany
Susanne Boll, University of Oldenburg, Germany
Chapter 12
The Role of Relevance Feedback in Managing Multimedia Semantics: A Survey ........ 288
Samar Zutshi, Monash University, Australia
Campbell Wilson, Monash University, Australia
Shonali Krishnaswamy, Monash University, Australia
Bala Srinivasan, Monash University, Australia
SECTION 4: MANAGING DISTRIBUTED MULTIMEDIA
Chapter 13
EMMO: Tradeable Units of Knowledge-Enriched Multimedia Content ......................... 305
Utz Westermann, University of Vienna, Austria
Sonja Zillner, University of Vienna, Austria
Karin Schellner, ARC Research Studio Digital Memory Engineering,
Vienna, Austria
Wolfgang Klaus, University of Vienna and ARC Research Studio Digital
Memory Engineering, Vienna, Austria
Chapter 14
Semantically Driven Multimedia Querying and Presentation ...................................... 333
Isabel F. Cruz, University of Illinois, Chicago, USA
Olga Sayenko, University of Illinois, Chicago, USA
SECTION 5: EMERGENT SEMANTICS
Chapter 15
Emergent Semantics: An Overview ............................................................................... 351
Viranga Ratnaike, Monash University, Australia
Chapter 16
Emergent Semantics from Media Blending ................................................................... 363
Edward Altman, Institute for Infocomm Research, Singapore
Lonce Wyse, Institute for Infocomm Research, Singapore
Glossary ......................................................................................................................... 391
About the Authors .......................................................................................................... 396
Index .............................................................................................................................. 406
vi
Preface
Today most documented information is in digital from. Digital information, in

turn, is rapidly moving from textual information to multimedia information that includes
images, audio and video content. Yet searching and retrieving required information is a
challenging and arduous task, because it is difficult to access just the required parts of
information stored in a database. In the case of text documents, the table of contents
serves as an index to different sections of the document. However, creating a similar
index that points to different parts of multimedia content is not an easy task. Manual
indexing of audiovisual content can be subjective, as there are several ways to describe
the multimedia information depending on the user, the purpose of use, and the task that
needs to be performed. The problem gets even murkier, as the purpose for retrieval is
often completely different from the purpose for which the content was created, annotated and stored in a database.
Work in the area of multimedia information retrieval started with techniques that
could automatically index the content based on some inherent features that could be
extracted from one medium at a time. For example, features that can be extracted from
still images are colour, texture and shape of objects represented in the image. In the
case of a video, static features such as colour, texture and shape are no longer adequate
to index visual content that has been created using powerful film editing techniques
that can shape viewers experiences. For audios, the types of features that can be
extracted are pitch, tonality, harmonicity, and so forth, which are quite distinct from
visual features.
Feature extraction and classification techniques draw from a number of disciplines such as artificial intelligence, vision and pattern recognition, and signal processing. While automatic feature extraction does offer some objective measures to index the
content of an image, it is insufficient for the retrieval task, as information retrieval is
based on the rich semantic notions that humans can conjecture in their minds while
retrieving audiovisual information. The other alternative is to index multimedia information using textual descriptions. But this has the problem of subjectivity, as it is hard to
have a generic way to first describe and then retrieve semantic information that is
universally acceptable. This is inevitable as users interpret semantics associated with
the multimedia content in so many different ways, depending on the context and use of
the information. This leads to the problem of managing multiple semantics associated
vii
with the same material. Nevertheless, the need to retrieve multimedia information grows
inexorably, carrying with it the need to have tools that can facilitate search and retrieval
of multimedia content at a semantic or a conceptual level to meet the varying needs of
different users.
There are numerous conferences that are still addressing this problem. Managing
multimedia semantics is a complex task and continues to be an active research area that
is of interest to different disciplines. Individual papers on multimedia semantics can be
found in many journals and conference proceedings. Meersman, Tari and Stevens
(1999), present a compilation of works that were presented at the IFIP Data Semantics
Working Conference held in New Zealand. The working group focused on issues that
dealt with semantics of the information represented, stored and manipulated by multimedia systems. The topics covered in this book include: data modeling and query
languages for multimedia; methodological aspects of multimedia database design, information retrieval, knowledge discovery and mining, and multimedia user interfaces.
The book covers six main thematic areas. These are: Video Data Modeling and Use;
Image Databases; Applications of multimedia systems; Multimedia Modeling; Multimedia Information retrieval; Semantics and Metadata. This book offers a good glimpse
of the issues that need to be addressed from an information systems design perspective. Here semantics is addressed from the point of view of querying and retrieving
multimedia information from databases.
In order to retrieve multimedia information more effectively, we need to go deeper
into the content and exploit results from the vision community, where the focus has
been in understanding inherent digital signal characteristics that could offer insights
into semantics situated within the visual content. This aspect is addressed in Bimbo
(1999), where the focus is mainly on visual feature extraction techniques used for content-based retrieval of images. The topics discussed are image retrieval by colour similarity, image retrieval by texture similarity, image retrieval by shape similarity, image
retrieval by spatial relationships, and finally one chapter on content-based video retrieval. The focus here is on low-level feature-based content retrieval. Although several algorithms have been developed for detecting low-level features, the multimedia
community has realised that content-based retrieval (CBR) research has to go beyond
low-level feature extraction techniques. We need the ability to retrieve content at more
abstract levels the levels at which humans view multimedia information. The vision
research then moved on from low-level feature extraction in still images to segment
extraction in videos. Semantics becomes an important issue when identifying what
constitutes a meaningful segment. This shifts the focus from image and video analysis
(of single features) to synthesis of multiple features and relationships to extract more
complex information from videos. This idea is further developed in Dorai and Venkatesh
(2002), where the theme is to derive high-level semantic constructs from automatic
analysis of media. That book uses media production and principles of film theory as the
bases to extract higher-level semantics in order to index video content. The main chapters include applied media aesthetics, space-time mappings, film tempo, modeling colour
dynamics, scene determination using auditive segmentation, and determining effective
events.
In spite of the realisation within the research community that multimedia research
needs to be enhanced with semantics, research output has been discipline-based. Therefore, there is no single source that presents all the issues associated with modeling,
viii
representing and managing multimedia semantics in order to facilitate information retrieval at a semantic level desired by the user. And, more importantly, research has
progressed by handling one medium at a time. At the user level, we do know that
multimedia information is not just a collection of monomedia types. Although each
media type has its own inherent properties, multimedia information has a coherence
that can only be perceived if we take a holistic approach to managing multimedia semantics. It is our hope that this book fills this gap by addressing the whole spectrum of
problems that need to be addressed in order to manage multimedia semantics, from an
application perspective, that adds value to the user community.
OUR APPROACH TO
ADDRESS THIS CHALLENGE
The objective of the book managing multimedia semantics is to assemble in
one comprehensive volume the research problems, theoretical frameworks, tools and
technologies that contribute towards managing multimedia semantics. The complexity
of managing multimedia semantics has given rise to many frameworks, models, standards and solutions. The book aims to highlight both current techniques and future
trends in managing multimedia semantics.
We systematically define the problem of multimedia semantics and present approaches that help to model, represent and manage multimedia content, so that information systems deliver the promise of providing access to the rich content held in the
vaults of multimedia archives. We include topics from different disciplines that contribute to this field and synthesise the efforts towards addressing this complex problem. It is our hope that the technologies described in the book could lead to the development of new tools to facilitate search and retrieval of multimedia content at a semantic or a conceptual level to meet the varying needs of the user community.
ORGANISATION OF THIS BOOK

The book takes a close look at each piece of the puzzle that is required to address
the multimedia semantic problem. The book contains 16 chapters organised under five
sections. Each section addresses a major theme or topic that is relevant for managing
multimedia semantics. Within a section, each chapter addresses a unique research or
technology issue that is essential to deliver tools and technologies to manage the
multimedia semantics problem.
Section 1: Semantic Indexing and Retrieval of Images

Chapters 1, 2 and 3 deal with semantic indexing, classification and retrieval techniques related to images.
Chapter 1 describes a feature-based indexing technique that uses low-level feature vectors to index and retrieve images from a database. The interesting aspect of the
architecture here is that the feature vector carries some semantic properties of the
image along with low-level visual properties. This is moving one step towards semantic
indexing of images using low-level feature vectors that carry image semantics.
ix
Chapter 2 addresses the semantic gap that exists between a users query and lowlevel visual features that can be extracted from an image. This chapter presents a stateof-the-art review of pattern classifiers in content-based image retrieval systems, and
then extends these ideas from pattern recognition to object recognition. The chapter
presents three new indexing schemes that exploit pattern classifiers for semantic indexing.
Chapter 3 takes the next step in the object recognition problem, and proposes a
self-supervised learning algorithm called KDEM - Kernel Discriminant-EM to speed up
semantic classification and recognition problems. The algorithms are tested for image
classification, hand posture recognition and fingertip tracking.
We then move on from image indexing to context-based interpretation and indexing of videos.
Section 2: Audio and Video Semantics:

Models and Standards
Chapter 4 describes the characterisation of video data using the temporal
behaviour of features, using context provided by the application domain in the situation of a shot. A framework based on Dynamic Bayesian Network is presented to
position the video segment within an application and provide an interpretation within
that context. The framework learns the temporal structure through the fusion of all
features, and removes the cumbersome task of manually designing a rule-based system
for providing the high-level interpretation.
Chapter 5 moves on to audio and presents a comprehensive survey of contentbased music summarisation and classification. This chapter describes techniques used
in audio feature extraction, music representation, and summarisation for both audio and
music videos. The chapter further identifies emerging areas in genre classification,
determining song structure, rhythm extraction, and semantic region extraction in music
signals.
Chapter 6 takes a holistic approach to video semantics, presenting a multidimensional model for describing and representing video semantics at several levels of abstraction from the perceptual to more abstract levels. The video metamodel VIMET
supports incremental description of semantics, and presents a framework that is generic and not definitive, while still supporting the development of application-specific
semantics that exploit feature-based retrieval techniques. Although the chapter addresses video semantics, it provides a nice framework that encompasses several aspects of multimedia semantics.
Chapter 7 presents Continuous Media Web an approach that enables the
searching of time-continuous media such as audio and video using extensions to standard Web-based browsing tools and technology. In particular, the chapter presents the
Annodex file format that enables the creation of webs of audio and video documents
using the continuous media markup language (CMML). Annodex extends the idea of
surfing the web of text documents to an integrated approach of searching, surfing and
managing the web of text and media resources.
Chapter 8 examines the new role of the new MPEG-7 standard in facilitating the
management of multimedia semantics. This chapter presents an overview of the MPEG7 Content description Interface and examines the Descriptions Schemes (DS) and Descriptors (Ds) to address multimedia semantics at several levels of granularity and
abstraction. The chapter presents a discussion on application development using MPEG7 descriptions. Finally the chapter discusses some strengths and weaknesses of the
standard in addressing multimedia semantics.
Section 3: User-Centric Approach to Manage

Semantics
Chapters 9, 10, 11 and 12 move away from a media-centric approach and take a
user-centric perspective while creating and interacting with multimedia content.
Chapter 9 presents a user-centric algorithm for visualisation and layout for content-based image retrieval from a large photo library. The framework facilitates an intuitive visualisation that adapts to the users time-varying notions of content, context and
preferences in navigation and style. The interface is designed as a touch-sensitive,
circular table-top display, which is being used in the Personal Digital Historian project
that enables interactive exploratory story telling.
Chapter 10 deals with a holistic approach to multimedia authoring and advances
the idea of creating multimedia authoring tools for the amateur media creator. The
chapter proposes that in order to understand media semantics, the media author needs
to address a number of issues. These involve a deep understanding of the media
creating process; knowledge of the deeper structures of content; and the surface manifestations in the media within an application domain. The chapter explores software
and human interactions in the context of implementing a multimedia authoring tool in a
target domain and presents a future outlook on multimedia authoring.
Chapter 11 presents MM4U, a software framework to support the dynamic composition and authoring of personalised multimedia content. It focuses on how to
assemble and deliver multimedia content personalised to reflect the users context,
specific background, interest and knowledge, as well as the physical infrastructure
conditions. Further, the application of MM4U framework is illustrated through the
implementation of two applications: a personalised city guide delivered on a mobile
device, and a personalised sports ticker application that combines multimedia events
(audio, video and text-based metadata) to compose a coherent multimedia application
delivered on the preferred device.
Chapter 12 considers the role of the mature relevance feedback technology, which
is normally used for text retrieval, and examines its applicability for multimedia retrieval.
The chapter surveys a number of techniques used to implement relevance feedback
while including the human in the loop during information retrieval. An analysis of these
techniques is used to develop the requirements of a relevance feedback technique that
can be applied for semantic multimedia retrieval. The requirements analysis is used to
develop a user-centric framework for relevance feedback in the context of multimedia
information retrieval.
Section 4: Managing Distributed Multimedia

Chapters 13 and 14 explore multimedia content retrieval and presentation in a
distributed environment.
Chapter 13 addresses the problem that occurs due to the separation of content
from its description and functionality while exchanging or sharing content in a collaborative multimedia application environment. The chapter proposes a content modeling
xi
formalism based on enhanced multimedia metaobjects (Emmo) that can be exchanged in

their entirety covering the media aspect, the semantic aspect and the functional aspect
of the multimedia content. The chapter further outlines a distributed infrastructure and
describe two applications that use Emmo for managing multimedia objects in a collaborative application environment.
Chapter 14 shows how even a limited description of multimedia object can add
semantic value in the retrieval and presentation of multimedia. The chapter describes a
framework DelaunayView that supports distributed and heterogeneous multimedia sources
based on a semantically driven approach for the selection and presentation of multimedia content. The system architecture is composed of presentation, integration and data
layers, and its implementation is illustrated with a case study.
Section 5: Emergent Semantics

The next two chapters explore an emerging research area emergent semantics
where multimedia semantics emerges and evolves dynamically responding to unanticipated situations, context and user interaction.
Chapter 15 presents an overview of emergent semantics. Emergence is the phenomenon of complex structures arising from interactions between simple units. Emergent semantics is symbiosis of several research areas and explores experiential computing as a way for users to interact with the system at a semantic level without having to
build a mental model of the environment.
Chapter 16 provides a practical foundation to this emerging research area. It
explores the computation of emergent semantics from integrative structures that blend
media into creative compositions in the context of other media and user interaction with
the media as they deal with the semantics embedded within the media. The chapter
presents a media blending framework that empowers the media producer to create complex new media assets by leveraging control over emergent semantics derived from
media blends. The blending framework for discovering emerging semantics uses ontologies that provide a shared description of the framework, operators to manage the
computation models and an integration mechanism to enable the user to discover emergent structures in the media.
CONCLUDING REMARKS
In spite of large research output in the area of multimedia content analysis and
management, current state-of-the-art technology offers very little by way of managing
semantics that is applicable for a range of applications and users. Semantics has to be
inherent in the technology rather than an external factor introduced as an afterthought.
Situated and contextual factors need to be taken into account in order to integrate
semantics into the technology. This leads to the notion of emergent semantics which is
user-centered, rather than technology driven methods to extract latent semantics. Automatic methods for semantic extraction tend to pre-suppose that semantics is static,
which is counterintuitive to the natural way semantics evolves. Other interactive technologies and developments in the area of semantic web also address this problem. In
future, we hope to see the convergence of different technologies and research disciplines in addressing the multimedia semantic problem from a user-centric perspective.
xii
REFERENCES
Bimbo, A.D. (1999). In M. Kaufmann (Ed.), Visual information retrieval. San Francisco.
Dorai, C., & Venkatesh, C. (2002). Computational media aesthetics. Boston: Kluwer
Academic Publishers.
Meersman, R., Scott, Z., & Stevens, M. (1999, January 4-8). Database semantics - Semantic issues in multimedia systems, IFIP TC2/WG2.6. Eighth Working Conference on Database Semantics (DS-8), Rotorua, New Zealand.
xiii
Acknowledgments
The editors would like to acknowledge the help of a number of people who contributed in various ways, without whose support this book could not have been published in its current form. Special thanks go to all the staff at Idea Group, who participated from inception of the initial idea to the final publication of the book. In particular,
we acknowledge the efforts of Michele Rossi, Jan Travers and Mehdi Khosrow-Pour
for their continuous support during the project.
No book of this nature is possible without the commitment of the authors. We
wish to offer our heart-felt thanks to all the authors for their excellent contributions to
this book, and for their patience as we went through the revisions. The completion of
this book would have been impossible without their dedication.
Most of the authors of chapters also served as referees for chapters written by
other authors, and they deserve a special note of thanks. We also would like to acknowledge the efforts of other external reviewers: Zahar Al Aghbhari, Saied Tahaghoghi,
A.V. Ratnaike, Timo Volkner, Mingfang Wu, Claudia Schremmer, Santha Sumanasekara,
Vincent Oria, Brigitte Kerherve, and Natalie Colineau.
Last but, not the least, we would like to thank CSIRO (Commonwealth Scientific
and Industrial Research Organization) and the support from the Commercial group, in
particular Pamela Steele, in managing the commercial arrangements and letting us get
on with the technical content.
Finally we wish to thank our families for their love and support throughout the
project.
Uma Srinivasan and Surya Nepal
CSIRO ICT Centre, Sydney, Australia
September 2004
Section 1
Semantic Indexing and
Retrieval of Images
Efficient Indexing in Large Image Databases 1
Chapter 1
Toward Semantically
Meaningful Feature
Spaces for Efficient
Indexing in Large
Image Databases
Anne H.H. Ngu, Texas State University, USA
Jialie Shen, The University of New South Wales, Australia
John Shepherd, The University of New South Wales, Australia
ABSTRACT
The optimized distance-based access methods currently available for multimedia

databases are based on two major assumptions: a suitable distance function is known
a priori, and the dimensionality of image features is low. The standard approach to
building image databases is to represent images via vectors based on low-level visual
features and make retrieval based on these vectors. However, due to the large gap
between the semantic notions and low-level visual content, it is extremely difficult to
define a distance function that accurately captures the similarity of images as perceived
by humans. Furthermore, popular dimension reduction methods suffer from either the
inability to capture the nonlinear correlations among raw data or very expensive
training cost. To address the problems, in this chapter we introduce a new indexing
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
2 Ngu, Shen & Shepherd
technique called Combining Multiple Visual Features (CMVF) that integrates multiple
visual features to get better query effectiveness. Our approach is able to produce lowdimensional image feature vectors that include not only low-level visual properties but
also high-level semantic properties. The hybrid architecture can produce feature
vectors that capture the salient properties of images yet are small enough to allow the
use of existing high-dimensional indexing methods to provide efficient and effective
retrieval.
INTRODUCTION
With advances in information technology, there is an ever-growing volume of
multimedia information from emerging application domains such as digital libraries,
World Wide Web, and Geographical Information System (GIS) systems available online.
However, effective indexing and navigation of large image databases still remains one
of the main challenges for modern computer system. Currently, intelligent image retrieval
systems are mostly similarity-based. The idea of indexing an image database is to extract
the features (usually in the form of a vector) from each image in the database and then
to transform features into multidimensional points. Thus, searching for similarity
between objects can be treated as a search for close points in this feature space and the
distance between multidimensional points is frequently used as a measurement of
similarity between the two corresponding image objects.
To efficiently support this kind of retrieval, various kinds of novel access methods
such as Spatial Access Methods (SAMs) and metric trees have been proposed. Typical
examples of SAMs include the SS-tree (White & Jain, 1996), R+-tree (Sellis, 1987) and grid
files (Faloutsos, 1994); for metric trees, examples include the vp-tree (Chiueh, 1994), mvptree (Bozkaya & Ozsoyoglu, 1997), GNAT (Brin, 1995) and M-tree (Ciaccia, 1997). While
these methods are effective in some specialized image database applications, many open
problems in image indexing still remain.
Firstly, typical image feature vectors are high dimensional (e.g., some image feature
vectors can have up to 100 dimensions). Since the existing access methods have an
exponential time and space complexity as the number of dimensions increases, for
indexing high-dimensional vectors, they are no better than sequential scanning of the
database. This is the well-known dimensional curse problem. For instance, methods
based on R-trees can be efficient if the fan-out of the R-tree nodes remain greater than
two and the number of dimensions is under five. The search time with linear quad trees
is proportional to the size of the hyper surface of the query region that grows with the
number of dimensions. With grid files, the search time depends on the directory whose
size also grows with the number of dimensions.
Secondly, there is a large semantic gap existing between low-level media representation and high-level concepts such as person, building, sky, landscape, and so forth.
In fact, while the extraction of visual content from digital images has a long history, it has
so far proved extremely difficult to determine how to use such features to effectively
represent high-level semantics. This is because similarity in low-level visual feature may
not correspond to high-level semantic similarity. Moreover, human beings perceive and
identify images by integrating different kinds of visual features in a nonlinear way. This
implies that assuming each type of visual feature contributes equally to the recognition
of the images is not supported in the human perceptual system and an efficient contentbased image retrieval system cannot be achieved by considering independent simple
visual feature.
In terms of developing indexing methods for effective similarity searching in large
image respository, we are faced with the problem of producing a composite feature vector
that accurately mimics human visual perception. Although many research works have
claimed to support queries on composite features by combining different features into
an integrated index structure, very few of them explain how the integration is implemented. There are two main problems that need to be addressed here. The first one is that
the integrated features (or composite features) typically generate very high-dimensional
feature space, which cannot be handled efficiently by the existing access methods. The
other problem is the discovery of image similarity measures that reflect semantic
similarity at a high level.
There are two approaches to solving the indexing problem. The first approach is to
develop a new spatial index method that can handle data of any dimension and employ
a k-nearest neighborhood (k-NN) search. The second approach is to map the raw feature
space into a reduced space so that an existing access method can be applied. Creating
a generalized high-dimensional index that can handle hundreds of dimensions is still an
unsolved problem. The second approach is clearly more practical. In this chapter, we
focus on how to generate a small but semantically meaningful feature vector so that
effective indexing structures can be constructed.
The second problem is how to use low-level media properties to represent high-level
semantic similarity. In the human perceptual process, the various visual contents in an
image are not weighted equally for image identification. In other words, the human visual
system has different responses to color, texture and shape information in an image. When
the feature vectors extracted from an image represent these visual features, the similarity
measure for each feature type between the query image and an image in the database is
typically computed by a Euclidean distance function. The similarity measure between the
two images is then expressed as a linear combination of the similarity measures of all the
feature types. The question that remains here is whether a linear combination of the
similarity measures of all the feature types best reflects how we perceive images as similar.
So far, no experiments have been conducted to verify this belief.
The main contribution of this work is in building a novel dimension reduction
scheme, called CMVF (Combining Multiple Visual Features), for effective indexing in
large image database. The scheme is designed based on the observation that humans use
multiple kinds of visual features to identify and classify images via a robust and efficient
learning process. The objective of the CMVF scheme is to mimic this process in such a
way as to produce relatively small feature vectors that incorporate multiple features and
that can be used to effectively discriminate between images, thus providing both efficient
(small vectors) and effective (good discrimination) retrieval. The core of the work is to
use a hybrid method that incorporates PCA and neural network technology to reduce the
size of composite image features (nonlinear in nature) so that they can be used with an
existing distance-based index structure without any performance penalty. On the other
hand, improved retrieval effectiveness can, in principle, be achieved by compressing
more discriminating information (i.e., integrating more visual features) into the final
vector. Thus, in this chapter, we also investigate precisely how much improvement in
retrieval effectiveness is obtained as more visual features are incorporated. Furthermore,

humans are capable of correctly identifying and classifying images, even in the presence
of moderate amounts of distortion. Since CMVF is being trained to classify images, this
suggests that if we were to train it using not only the original image, but also distorted
versions of that image, it might be more robust in recognizing minor variations of the
image in the future. Another aspect of robustness in CMVF is how much it is affected
by the initial configuration of the neural network. In this chapter, the robustness of CMVF
in these two contexts is also investigated.
BACKGROUND
Image Feature Dimension Reduction
Trying to implement computer systems that mimic how the human visual system
processes images is a very difficult task, because humans
use different features to identify and classify images in different contexts, and
do not give equal weight to various features even within a single context
This observation suggests that an effective content-based image retrieval system

cannot be achieved by considering only a single type of feature and cannot be achieved
by considering only visual content, without taking account of human perception. The
first of these suggests multiple image features are required; the second suggests that
semantic features, based on manual classification of images, are also required.
However, creating an index based on a composite feature vector will typically result in
a very high-dimensional feature space, rendering all existing indexing methods useless.
At the same time, a simple linear combination of different feature types cannot precisely
reflect how human beings perceive images as similar. The natural and practical solution
to these problems lies in discovering a dimension reduction technique, which can fuse
multiple visual content features into a composite feature vector that is low in dimensions
and yet preserves all human-relevant information for image retrieval.
There has been considerable research work on dimension reduction for image
feature vectors. This work can be classified into two general categories: linear dimension
reduction (LDR) and nonlinear dimension reduction (NLDR). The typical examples for
LDR include SVD and PCA (Fukunaga & Koontz, 1970; Kittler & Young, 1973). These
approaches assume that the variance of data can be accounted for by a small amount of
eigenvalues. Thus, LDR works well only for data that exhibits some linear correlation.
However, if the data exhibits some nonlinear correlation, the dimension reduction via
LDR causes significant loss in distance information, which results in less effective query
processing. Due to the complexity of image features, better query effectiveness can be
achieved by using nonlinear dimension reduction. The basis of NLDR is the standard
nonlinear regression analysis as used in the neural network approach, which has been
widely studied in recent years. Systems based on NLDR can maintain a great deal of
knowledge about distance information in the original data source. The information can
be represented as neural network weights between units in successive layers. NLDR

typically performs better than LDR in handling feature vectors for image data. The only
drawback of NLDR is that it requires a training process, which can be time consuming.
Image Similarity Measurement

A major task in content-based retrieval is to find the most similar images from a
multimedia database with respect to a query object (image). Various kinds of features can
be used for specifying query objects including descriptive concepts (keywords) and
numerical specification (color, texture and shape). The feature vectors (mainly numerical)
for the given query object are usually derived using basic image processing techniques
such as segmentation and feature extraction. Calculating the similarity between a query
object and an object in the multimedia database is reduced to computing the distance
between two feature vectors.
However, current research has been focused on finding a similarity function that
corresponds only to a single feature (e.g., color information only). That is, only simple
queries, such as how similar two images are in terms of color, are well supported. A typical
example is the work carried out by Bozkaya and zsoyoglu (1997). In their work, the
similarity measure of a pair of images based on composite feature vectors described by
color and texture was proposed as a linear combination of the similarity measure of the
individual single feature vector. Their proposal can be detailed as follows: Let {xc, yt} and
{yc, yt} be the color and texture feature vectors that fully describe two images X and Y,
then the similarity measure of images X and Y, denoted as S (X, Y), is given by
S = wc S c + wt S t
(1)
where the Sc and St are the color and texture similarity functions respectively; wc and
wt are weighting factors. However, the criteria for selecting these weighting factors are
not mentioned in their research work. From the statistics viewpoint, by treating the
weighting factors as normalization factors, the definition is just a natural extension of the
Euclidean distance function to a high-dimensional space in which the coordinate axes
are not commensurable.
The question that remains to be answered is whether a Euclidean distance function
for similarity measures best correlates with the human perceptual process for image
recognition. That is, when humans perceive two images as similar, can a distance function
given in the form in Equation 1 be defined? Does this same function hold for another pair
of images that are also perceived as similar? So far, no experiments have been conducted
that demonstrate (or counter-demonstrate) whether linear combinations of different
image features are valid similarity measures based on human visual perception. Also, the
importance of designing a distance function that mimics human perception to approximate a perceptual weight of various visual features has not been attempted before. Thus,
incorporating human visual perception into image similarity measurement is the other
major motivation behind our work.
Distance-Based Access Methods

To efficiently support query processing in multidimensional feature space, several
spatial access methods (SAMs) have been proposed. These methods can be broadly
classified into the following types: point access methods and rectangle access methods.
The point quad-tree, which was first proposed in Finkel (1974), is an example of a point
access method. To handle complex objects, such as circles, polygons and any undefined
irregularly shaped objects, minimum bounding rectangles (MBRs) have been used to
approximate the representations of these objects. Thus, the name rectangle access
method. The K-D-B tree (Robinson, 1981) and R+-tree (Sellis, 1987) are typical examples.
However, the applicability of SAMs is limited by two assumptions: (1) for indexing
purposes, objects are represented by means of feature values in a multidimensional
space, and (2) a metric must be used as measure of distance between objects. Furthermore, SAMs have been designed by assuming that distance calculation has negligible
CPU (Central Processing Unit) cost, and especially relative to the cost of disk I/O (Input/
Output). However, this is not always the case in multimedia applications (Ciaccisa &
Patella, 1998). Thus, a more general approach to the similarity indexing problem has
gained some popularity in recent years, leading to the development of so-called metric
trees, which use a distance metric to build up the indexing structure. For metric trees,
objects in a multidimensional space are indexed by their relative distances rather than
their absolute positions. A vantage point is used to compute the distance between two
different points and the search space is divided into two by the median value of this
distance. Several metric trees have been developed so far, including the vp-tree (Chiueh,
1994), the GNAT (Brin, 1995), the mvp-tree (Bozkaya & Ozsoyoglu, 1997) and M-tree
(Ciaccia, 1997).
In this study, our goal is not to develop a new indexing structure for high-dimension
image features but to use an existing one effectively. We choose the well-established Mtree access method as the underlying method for indexing our reduced composite image
visual features. The M-tree is a balanced, paged metric tree that is implemented based on
the GiST (Generalized Search Tree) (Hellerstein, 1995) framework. Since the design of
M-tree is inspired by both principles of metric trees and database access methods, it is
optimized with respect to both CPU (distance computations) and I/O costs.
HYBRID DIMENSION REDUCER

In this section, we present a novel approach to indexing large image databases that
uses both low-level visual features and human visual perception. The scheme utilizes a
two-layer hybrid structure that combines the advantages of LDR and NLDR into a single
architecture. Before exploring the detailed structure, we give a brief overview of what kind
of visual content our system considers.
Composite Image Features

In our work so far, we have considered three different visual features: color, texture
and shape. Note that the CMVF is not limited to these three features and it can be further
expanded to include spatial features for more effective indexing.
Color Features
It is known that the human eye responds well to color. In this work, the color feature
is extracted using the histogram technique (Swain & Ballard, 1991). Given a discrete color
space defined by some color axes, the color histogram is obtained by discretizing the
image colors and counting the number of times each discrete color occurs in the image.
In our experiments, the color space we apply is CIE L*u*v. The reason that we select CIE
L*u*v instead of normal RGB or other color space is that it is more perceptually uniform.
The three axes of L*u*v space are divided into four sections respectively, so we get a
total of 64 (4x4x4) bins for the color histogram. However, for the image collection that we
use, there are bins that never receive any count. In our experiments, the color features
are represented as 37-dimensional vectors after eliminating the bins that have zero count.
Texture Features
Texture characterizes objects by providing measures of properties such as smoothness, coarseness and regularity. In this work, the texture feature is extracted using a filterbased method. This method uses amplitude spectra of images. It detects the global
periodicity in the images by identifying high-energy, narrow peaks in the spectrum. The
advantage of filter-based methods is their consistent interpretation of feature data over
both natural and artificial images.
The Gabor filter (Turner, 1986) is a frequently used filter in texture extraction. It
measures a set of selected orientations and spatial frequencies. Six frequencies are
required to cover the range of frequencies from 0 to 60 cycles/degree. We choose 1, 2,
4, 8, 16 and 32 cycles/degree to cover the whole range of human visual perception.
Therefore, the total number of filters needed for our Gabor filter is 30, and texture features
are represented as 30-dimensional vectors.
Shape Features
Shape is an important and powerful attribute for image retrieval. It can represent
spatial information that is not presented in color and texture histograms. In our system,
the shape information of an image is described based on its edges. A histogram of the
edge directions is used to represent global information of shape attribute for each image.
We used the Canny edge operator (Canny, 1986) to generate edge histograms for images
in the prepropressing stage. To solve the scale invariance problem, the histograms are
normalized to the number of edge points in each image. In addition, smoothing procedures presented in Jain and Vailaya (1996) are used to make the histograms invariant to
rotation. The histogram of edge directions is represented by 30 bins. Shape features are
thus presented as 30-dimensional vectors.
When forming composite feature vectors from the three types of features described
above, the most common approach is to use the direct sum operation. Let xc, xt and xs be
the color, texture and shape feature vectors; the direct sum operation, denoted by the
symbol , of these two feature vectors is defined as follows:
x xc xt xs
(2)
Figure 1. A hybrid image feature dimension reduction scheme. The linear PCA appears
at the bottom, the nonlinear neural network is at the top, and the representation of
lower dimension vector appears in the hidden layer.
The number of dimensions of the composite feature vector x is then the sum of those
of the single feature vectors, that is, dim(x) = dim(xc) + dim(xt) + dim(xs).
Architecture of Hybrid Image Feature Dimension

Reducer
Figure 1 shows the overall architecture of our hybrid method, which is basically a
two-tier hybrid architecture: dimension reduction via PCA followed by a three-layer
neural network with quickprop learning algorithm. Visual content for color, texture and
shape is first extracted from each image. The dimensionality of raw feature vector in our
system is 97-dimensional feature vectors (37 dimensions for color, 30 dimensions for
texture and 30 dimensions for shape). PCA is useful as an initial dimension reducer while
further dimension reduction for nonlinear correlations can be handled by NLDR.
PCA for Dimension Reduction
Mathematically, PCA method can be described as follows. Given a set of N feature
vectors {xk = ( xk1 , xk 2 ,...xkn ) R n | k = 1...N } and the mean vector x , the covariance
matrix S can be calculated as
S=
1 N
( xk x)( xk x)
N k =1
Let vi and i be a pair of eigenvector and eigenvalue of the covariance matrix S. Then
vi and i satisfy the following:
N
i = (vi ( xk x)) 2
k =1
Since trace (S) =
accounts for the total variance of the original set of feature
i =1 i
vectors, and since i can be arranged in decreasing order, that is, 12...n0, if the m
(where m < n) largest eigenvalues account for a large percentage of variance then, with
an nm linear transformation matrix T defined as
T = [v1 , v2 ,..., vm ],
(3)
the nm transformation T T transforms the original n-dimensional feature vectors to mdimensional ones. That is
T ( xk x ) = y k ,
k = 1...N
(4)
where y k Rm, k. Then matrix T above has orthonormal columns because { vi | i = 1...n }
form an orthonormal basis.
The key idea in dimension reduction via PCA is in the computation of and the userdetermined value m, and finally the mn orthogonal matrix T T , which is the required linear
transformation. The feature vectors in the original n-dimensional space can be projected
onto an m-dimensional subspace via the transformation T T . The value of m is normally
determined by the percentage of variance that the system can afford to lose. The i-th
component of the yk vector in (4) is called the i-th principal component (PC) of the original
feature vector x k. Alternatively, one may consider just the i-th column of the T matrix
defined in (3), and the i-th principal component of xk is simply
yki = vi ( xk x)
where
vi is the i-th eigenvector of S.
PCA has been employed to reduce the dimensions of single feature vectors so that
an efficient index can be constructed for image retrieval in an image database (Euripides
& Faloutsos, 1997; Lee, 1993). It has also been applied to image coding, for example,, for
removing correlation from highly correlated data such as face images (Sirovich & Kirby,
1987). In this work, PCA is used as the first step in the NLDR method where it provides
optimal reduced dimensional feature vectors for the three-layer neural network, and thus
speed up the NLDR training time.
Classification Based on Human Visual Perception

Gestalt psychologists (Behrens, 1984) have observed that the human visual system
deals with images by organizing parts into wholes using perceptual grouping, rather than
by perceiving individual image components and then assembling them. A consequence
of this is that our mind perceives whole objects even when we are looking at only a part
or some component of that object. The principles of perceptual organization proposed
by Gestaltists include closure, continuity, proximity and similarity (Lowe, 1985), which
have been applied successfully in feature detection and scene understanding in machine
vision. With these principles, our perceptual system integrates low-level features into
high-level structures. Then, these high-level structures will be further combined until
semantic meaningful representation is achieved.
Another fundamental and powerful Gestalt principle of visual perceptual organization is identification of objects from the surroundings. In the real world, when we are
presented with an image, we tend to see things. Even when there may be little contrast
between the objects and the background, our perceptual system does not seem to have
any major difficulty in determining which is figure and which is background (Lerner et
al., 1986). For example, a ship stands out against the background of sea and sky, a camel
and a man stand out against a background of desert sand, or a group of people is easily
distinguishable from a forest background. Furthermore, we would distinguish an image
of a camel against a background of desert sand as more similar to an image of a camel and
a man against the same background than to an image of a camel against a sandy beach.
In general, we incorporate all the information in color, texture, shape and other visual or
spatial feature under a certain context that is presented to us and classify the image into
the appropriate category.
In conducting our experiments on image classification based on human perception,
we first prepared a set of images (163) that is called test-image from our 10,000-image
collection. This set covers all the different categories (total of 14) of images in the
collection. Amongst these images in the set, images in each category have similarity with
each other in color, in texture and in shape. We set up a simple image classification
experiment on the Web and asked seven people (subjects), all from different backgrounds, to do the experiments. At the beginning of each experiment, a query image was
arbitrarily chosen from the test-images and presented to the subjects. The subjects were
then asked to pick up the top 20 images that were similar in color, in texture and in shape
to the query image from the test-images. Any image that was selected by more than three
subjects was classified to the same class as the query image and was then deleted from
the test-images. The experiment was repeated until every image in test-images had been
categorized into an appropriate class. The end result of the experiments is that images
that are similar to each other in color, in texture and in shape are put into the same class
based on human visual perception. This classification results are used in the NLDR
process described below.
Neural Network for Dimension Reduction

The advantage of using a neural network for NLDR is that the neural network can
be trained to produce an effective solution. In the CMVF framework, a three-layer
perceptron neural network with a quickprop-learning algorithm (Gonzalez & Woods,
2002) is used to perform dimension reduction on composite image features. The network
Figure 2. A three-layer multiplayer perceptron layout
in fact acts as a nonlinear dimensionality reducer. In Wu (1997), a special neural network

called learning based on experiences and perspectives (LEP) has been used to create
categories of images in the domains of human faces and trademarks; however, no details
are given in his work on how the training samples were created.
For our system, the training samples are tuples of the form (v, c) where v is a feature
vector, which can be either a single-feature vector or a composite feature vector, and c
is the class number to which the image represented by v belongs. We note that the class
number for each feature vector is determined by the experiments mentioned in the
previous subsection. Figure 2 depicts the three-layer neural network that we used. The
units in the input layer accept the feature vector v of each training pattern; the number
of units in this layer therefore corresponds to the number of dimensions of v. The hidden
layer is configured to have less units. The number of units in the output layer corresponds
to the total number of image classes M. Given that (v, c) is a training pattern, the input
layer will accept vector v while the output layer will contain (0,...,0,1,0,...,0)T, which is a
vector of dimension M that has a 1 for the c-th component and 0s everywhere else.
Each unit i in the neural network is a simple processing unit that calculates its
activation s i based on its predecessor units pi and the overall incoming activation of unit
i is given as
neti =
s w
j p i
ij
(5)
where j is a predecessor unit of i, the term wij is the interconnected weights from unit j
to unit i, and i is the bias value of the unit i. Passing the value neti through a nonlinear
activation function, the activation value si of unit i can be obtained. The sigmoid logistic
function
si =
1
1 + e neti
(6)
is used as the activation function. Supervised learning is appropriate in our neural

network system because we have a well-defined set of training patterns. The learning
process governed by the training patterns will adjust the weights in the network so that
a desired mapping of input to output activation can be obtained. Given that we have a
set of feature vectors and their appropriate class number classified by the subjects, the
goal of the supervised learning is to seek the global minimum of cost function E
E=
2
1
(t pj o pj )
2 p j
(7)
where t pj and opj are, respectively, the target output and the actual output for feature
vector p at node j. The rule for updating the weights of the network can be defined as
follows:
wij (t ) = d (t )
(8)
wij (t + 1) = wij (t ) + wij (t )
(9)
where is the parameter that controls the learning rate, and d(t) is the direction along
which the weight need to be adjusted in order to minimize the cost function E. There are
many learning algorithms for performing weight updates. The quickprop algorithm is one
of most frequently used adaptive learning paradigms. The weight update can be obtained
by the equation
E
(t )
wij
wij (t 1)
wij (t ) =
E
E
(t 1)
(t )
wij
wij
(10)
The training procedure of the network consists of repeated presentations of the

inputs (the feature vector vs in the training tuples) and the desired output (the class
number c for v) to the network. The weights of the network are initially set to random small
continuous values. Our network adopts the learning by epoch approach. This means
that the updates of weights only happen after all the training samples have been
presented to the network. In the quickprop-learning algorithm, there are two important
parameters: the learning rate for the gradient descent and the maximum step size v.
These two parameters govern the convergence of network learning. In general, the
learning rate for gradient descent can vary from 0.1 to 0.9. In our system, the learning rate
is kept as a constant value during network training. The step size v is 1.75. In every
iteration of the training, the error generated will be in the direction of the minimum error
function. This is due to the fact that the training starts in the direction of the eigenvectors
associated with the largest eigenvalue for each feature. Thus, the network has less
chance of being trapped in a local minimum.
The total gradient error or the total number of error bits indicates the condition of
network convergence. When this value does not change during network training, the
network is said to have converged. The total error is the sum of the total output minus
the desired output. The total number of error bits can measure it, since the network also
functions as a pattern classifier. In this case, the number of error bits is determined by
the difference of the actual and the desired output.
During the network training process, the network weights gradually converge and
the required mapping from image feature vectors to the corresponding classes is
implicitly stored in the network. After the network has been successfully trained, the
weights that connect the input and hidden layers are entries of a transformation that map
the feature vectors v to smaller dimensional vectors. When a high-dimensional feature
vector is passed through the network, its activation values in the hidden units form a
lower-dimensional vector. This lower-dimension feature vector keeps the most important
discriminative information of the original feature vectors.
The Hybrid Training Algorithm

The complete training algorithm for this hybrid dimension reduction is given as
follows:
Step 1: For each type of feature vector; compute the covariance matrix of all N images.
Step 2: Apply the eigen-decomposition to each of the computed covariance matrices
from Step 1. This process yields a list of eigenvectors and eigenvalues (), which
are normally sorted in decreasing order.
n
Step 3: Compute the total variance s = i i and select the m largest eigenvalues whose
sum just exceeds s % where is a predefined cut-off value. This step selects
the m largest eigenvalues that account for the % of the total variance of the feature
vectors.
Step 4: Construct matrix T using the m corresponding eigenvectors as given in Equation 3.
Step 5: Obtain the new representation yk for each image feature vector xk by applying the
PCA transformation given in Equation 4.
Step 6: Select the training samples from the image collection. Group these training
samples into different classes as determined by the experiments described in
Section 3.2.2.
Step 7: Construct the composite feature vectors zk from the color, texture and shape
feature vectors using the direct sum operation defined in Equation 2.
Step 8: Prepare the training patterns (z k, ck) k where ck is the class number to which the
composite feature vector z k belongs.
Step 9: Set all the weights and node offsets of the network to small random values.
Step 10: Present the training patterns z k as input and ck as output to the network. The
training patterns can be different on each trial; alternatively, the training patterns
can be presented cyclically until the weights in the network stabilize.
Step 11: Use the quickprop-learning algorithm to update the weights of the network.
Step 12: Test the convergence of the network. If the condition of convergence of the
network is satisfied, then stop the network training process. Otherwise, go back
to Step 10 and repeat the process. If the network does not converge, it needs a new
starting point. Thus, it is necessary to go back to Step 9 instead of Step 10.
Steps 1~5 cover the PCA dimension reduction procedure, which was applied to all
images in the data rather than only to the training samples. This has the advantage that
the covariance matrix for each type of single feature vector contains the global variance
of images in the database. The number of principal components to be used is determined
by the cut-off value . There is no formal method to define this cut-off value. In Step
3, the cut-off value is set to 99 so the minimum variance that is retained after PCA
dimension reduction is at least 99%.
After the completion of PCA, the images are classified into classes in Step 6. Steps
7~12 then prepare the necessary input and output values for the network training
process. The network training corresponds to Steps 8~11. As noted above, the weight
of each link is initialized to a random small continuous value. In the quickprop-learning
algorithm, the parameter that limits the step-size is set to 1.75, and the learning rate
for the gradient descent can vary from 0.1 to 0.9. Each time we apply the quickproplearning algorithm, the weight of each link in the network is updated. After a specified
number of applications of the quickprop-learning algorithm, the convergence of the
network is tested in Step 12. At this point, it is decided whether the network has
converged or a new starting weight is required for each link of the network. In the latter
case, the process involved in Steps 9~12 is repeated.
EXPERIMENTS AND DISCUSSIONS

In the following section, we present experimental results to demonstrate the
effectiveness of feature vectors generated by CMVF by comparing it to systems that
generate reduced feature vectors based solely on PCA and based on a pure neural
network without initial PCA. To further illustrate the advantage of CMVF, its robustness
against various kinds of image distortion and initial setup of neural network is also
presented.
The CMVF
The CMVF framework has been designed and fully implemented with the C++ and
Java programming languages, and an online demonstration with a CGI-based Web
interface is available for users to evaluate the system (Shen, 2003).
Figure 3 presents the various components for this system. User can submit one
image, which is from existing image database or other source, as a query. The system will
search for the images that are most similar in visual content; the matching images are
displayed in similarity-order, starting from the most similar, and users can score the
results. The query can be executed with any of the following retrieval methods: PCA only,
neural network only and CMVF with different visual feature combinations. Users can also
choose a distorted version of the selected image as the query example to demonstrate
CMVFs robustness against image variability.
Test Image Collection

To conduct the experiment, we constructed a collection of 10,000 images. These
images were retrieved from different public domain sources, and can be classified under
a number of high-level semantic categories that cover natural scenery, architectural
Figure 3. Overall architecture of a content-based image retrieval system based on

CMVF
buildings, plants, animals, rocks, flags, and so forth. All images were scaled to the same
size (128128 pixels).
A subset of this collection was then selected to form the training samples (testimages). There were three steps involved in forming the training samples. Firstly, we
decided on the number of classes according to the themes of the image collection and
selected one image for each class from the collection of 10,000 images. This can be done
with the help of a domain expert. Next, we built three M-tree image databases for the
collection. The first one used color as the index, the second used texture as the index and
the third one used shape as the index. For each image in each class, we retrieved the most
similar images in color using the color index to form a color collection. We then repeated
the same procedure to get images similar in texture and in shape for each image in each
class to form a texture collection and a shape collection. Finally, we got our training
samples1 that are similar in color, in texture and in shape by taking the intersection of
images from the color, texture and shape collections. The training samples (test-images)
were presented to the subjects for classification. To test the effectiveness of additional
feature integration in image classification and retrieval, we use the same procedure as
mentioned in the previous section for generating test-images with additional visual
feature.
Evaluation Metrics
In our experiment, since not all relevant images are examined, some common
measurements such as standard Recall and Precision are inappropriate. Thus, we select
the concepts of normalized precision (Pn) and normalized recall (Rn) (Salton & Mcgill,
1993) as metrics for evaluation. High Precision means that we have few false alarms (i.e.,
few irrelevant images are returned) while high Recall means we have few false dismissals
(i.e., few relevant images are missed). The formulas for these two measures are
Rn = 1
Pn = 1
i =1
(ranki i )
( N R)! R!
R
(log ranki log i )

N!
log(
)
( N R)! R!)
i =1
where N is the number of images in the dataset and is equal to 10,000, R is the number
of relevant images and the rank order of the i-th relevant image is denoted by ranki. During
the test, the top 60 images are evaluated in terms of similarity.
Query Effectiveness of Reduced Dimensional Image

Features
To compare the effectiveness of the three different methods for image feature
dimension reduction, a set of experiments has been carried out. In these experiments, we
use M-tree as implementation basis for indexing structure. The dimension of M-tree is set
to 10, which corresponds to the number of hidden units used in the neural networks. In
fact, every image from the collection can serve as a query image. We randomly selected
20 images from each category of the collection as queries. Figure 4 shows the results of
queries posed against all the 14 classes of images using the three M-trees, which are used
for indexing three feature spaces, generated by CMVF, pure Neural network and PCA
As shown in Figure 4, the CMVF achieves a significant improvement in terms of
similarity search over the PCA for any categories in the collection. The improvement for
recall is from 14.3% to 30% and precision rate is from 23.2% to 37% dependent on image
class. The reason for this better performance is that in CMVF, we build indexing vectors
Figure 4. Comparing hybrid method with PCA and neural network on average
normalized recall and precision rate. The result is obtained under visual combination
including color, texture and shape.
0.9
Ave. Normalized Recall Rate
0.7
0.6
0.5
recall of PCA
recall of CMVF
recall of neural network
0.4
Ave. Normalized Prec. Rate
0.7
0.8
0.6
0.5
0.4
0.3
precision of PCA
precision of CMVF
precision of neural network
0.2
0.1
0.3
0
10 11 12 13 14 15
Class ID
(a) Recall rate
10 11 12 13 14 15
Class ID
(b) Precision rate
Table 1. Comparison of different dimensionality reduction methods in query effectiveness

and training cost
Dimension Reduction
Method
PCA
Neural Network
CMVF
Ave. Recall
Rate (%)
63.1
77.2
77.2
Ave. Prec.
Rate (%)
44.6
60.7
60.7
Training Cost
(epochs)
N/A
7035
4100
from high-dimensional raw feature vectors via PCA and a trained neural network
classifier, which can compress not only various kinds of visual features but also semantic
classification information into a small feature vector. Moreover, we can also see from
Figure 4 that the recall and precision values of neural network and hybrid method are
almost the same. The major difference between the two approaches is the time required
to train the network. Based on Table 1, comparing with pure neural network, CMVF saves
nearly 40% training time on learning process. This efficiency is gained by using a relative
small number of neural network inputs. One can therefore conclude that it is advantageous to use a hybrid dimension reduction to reduce the dimensions of image features
for effective indexing.
An example to illustrate the query effectiveness of different dimension reduction
methods is shown in Appendix A. We use an image with a cat as query example.
Comparing with PCA, CMVF achieves superior retrieval results. In the first nine results,
CMVF returns nine out of nine matches. PCA only retrieves two similar images from the
top nine images. On the other hand, query effectiveness of reduced feature space by
CMVF is very close to the one generated by pure neural network with nine out of nine
matches. The major difference is the order of different images in the final result list. We
conclude from this experiment that by incorporating human visual perception, CMVF
indeed is an effective and efficient dimension reduction technique for indexing large
image databases.
Effects on Query Effectiveness Improvement with

Additional Visual Feature Integration
One of our conjectures is that it is possible to obtain effective retrieval result from
low-dimensional indexing vector, if these vectors are constructed based on a combination of multiple visual features. Thus, when more discriminative information is integrated
into the final vector, systematic performance improvement can be achieved. To find out
how various visual feature configurations contribute to the improvement of query result,
a series of experiments have been carried out, which progressively incorporated new
visual features into CMVF and compared the results on a single set of queries. The system
was tested based on four different visual feature combination: (color, texture), (color,
shape), (shape, texture) and (color, texture, shape).
As shown in Figures 5a and 5b, after the addition of shape feature into CMVF and
Neural network, there is a significant improvement on the recall and precision rate. On
the average, using color, texture and shape give additional 13% and 18% improvement
in recall and precision rate over using the other three configurations, which only
Figure 5. Comparison of query effectiveness with different dimension reduction schemes

Figure 5a. Comparing precision and recall rate of CMVF with different visual feature
combinations
0.7
0.8
0.7
0.6
0.5
Recall with color, texture and shape
Recall with color and texture
Recall with color and shape
Recall with shape and texture
0.4
0.3
0
Ave. Normalized Pre. Rate
0.9
0.6
0.5
0.4
0.3
Precision with color, texture and shape

Precision with color and texture
Precision with color and shape
Precision with shape and texture
0.2
0.1
0
9 10 11 12 13 14 15
9 10 11 12 13 14 15
Class ID
Class ID
(a) Recall rate
(b) Precision rate
Figure 5b. Comparing precision and recall rate of neural network with different visual
feature combinations
0.9
0.7
0.6
0.5
R eca ll wi th co lo r, tex ture a nd sh ap e
Reca ll wi th co lo r a nd tex t ure
0.4
R eca ll wi th co lo r a nd s hap e
R eca ll wi th sh ap e a nd t exture
0.3
0.7
0.8
0.6
0.5
0.4
0.3
Preci sio n with co lo r, tex t ure a nd sh ap e

Preci sio n with co lo r a nd tex tu re
0.2
Preci sio n with co lo r a nd sh ap e

Preci sio n with sh ap e a nd tex tu re
0.1
0
1 2
10 11 12 13 14 15
Cla s s I D
10 11 12 13 14 15
Cla s s I D
(a) Recall rate
(b) Precision rate
0.5
0.65
Figure 5c. Comparing precision and recall rate of PCA with linear concatenation of
different visual feature combinations
0.45
0.6
0.55
0.4
0.35
Recall with color, texture and shape

Recall with color and texture
Recall with color and shape
Recall with shape and texture
0.5
0.45
0
9 10 11 12 13 14 15
Class ID
(a) Recall rate
Precision with color, texture and shape

Precision with color and texture
Precision with color and shape
Precision with shape and texture
0.3
0.25
0
9 10 11 12 13 14 15
Class ID
(b) Precision rate
considers two features, respectively. However, the advantage for CMVF over pure neural
network is that it requires less training cost to achieve results with the same quality. On
the other hand, from Figure 5c, we can see that query effectiveness of a feature vector
generated by PCA doesnt show any improvement with additional visual feature
integration. In contrast, there is a slight drop in terms of precision and recall rate for some
cases. For example, in image class 5, if the system only uses color and texture, a 61%
normalized recall rate can be achieved. Interestingly, normalized recall rate with a feature
combination that includes color, texture and shape is only 60%, which remains close to
that achieved using just color and texture.
Appendix B shows an example of query effectiveness gain due to the addition of
shape feature. Obviously, addition of shape feature resulted in better query result. We
used an image with a cat as the query example. With the feature configuration including
color, texture and shape, CMVF retrieved 12 images with cat on the first 12 matches.
Without considering the shape, there are only seven images with cat returned on the top
12 matches.
Robustness
Robustness is a very important feature for a Content-Based Image Retrieval (CBIR)
system. In this section, we investigate CMVF robustness against both image distortion
and the initial configuration of neural network.
Image Distortion
Humans are capable of correctly identifying and classifying images, even in the
presence of moderate amounts of distortion. This property is potentially useful in reallife image database applications, where the query image may have accompanying noise
and distortion. The typical example for this case is the low-quality scanning of a
photograph. Since CMVF is being trained to reduce the dimensionality of raw visual
feature vectors, this process suggests that if we were to train it using not only the original
image, but also distorted version of that image, it might be more robust in recognizing
the image with minor noise or distortion.
We modified image items with different kinds of alternatives as learning examples
for training purpose and carried out a series of experiments to determine how much
improvement would occur with this additional training. We randomly chose 10 images
from each category in the training data, and applied a specific distortion to each image
and included the distorted image in the training data. This process was repeated for each
type of distortion, to yield a neural network that should have been trained to recognize
images in the presence of any of the trained distortions. In order to evaluate the effect
of this on query performance, we ran the same set of test queries to measure precision
and recall rate. However, each query image was distorted before using it as query, and
the ranks of the result images for this query were compared against the ranks of result
images for the nondistorted query image. This was repeated for varying levels of
distortion.
Figure 6 summarizes the results and Appendix C shows a query example. With
incorporation of human visual perception, CMVF is a robust indexing technique. It can
perform well on different kinds of image variations including color distortion, sharpness
changes, shifting and rotation (Gonzalez & Woods, 2002). The experiment shows that on
the average, CMVF is robust to blurring with an 11x11 Gaussian filter and Median filter,
random spread by 10 pixels, pixelization by nine pixels and various kinds of noise
including Gaussian and salt&pepper noise.
Neural Network Initialization
Another aspect of robustness to investigate in CMVF is the degree in which it is

affected by the initial configuration of the neural network. In CMVF, the weights of the
neural network are initially set to a small random continuous value, so the system may
end up with different configurations for the same training data. It is thus important to
know how much the final query effectiveness will be influenced by the initial choice of
weights. In order to investigate this, we focused on how the initial weights would
influence the final ranking of query results. We built twenty dimension reducers with a
different initial configuration for each of them, and then ran the same set of query images
for each resultant neural network, and compared the query result lists. First, we randomly
selected a query image and performed a similarity search using system one. From the
result list, we chose the top 60 results as reference images. We then ran the same query
example on the other 19 systems and compared the ranks of these 60 reference images.
Rank deviation, rank _ dev , was used to measure rank difference for the same reference
image with different models:
S
s =1
n =1

rank _ dev =
| rankns ini _ rankn |

N
where N is the total number of reference images in the study list, ini_rankn is the initial
rank for the reference image n, rankns is the rank for reference image n in system s, and
the number of systems with different initial states is denoted by S. If the CMVF is
insensitive to its initialization, reference images should have roughly the same ranking
in each of the systems. Table 3 shows that this is not the case. The average rank_dev
for all reference images is 16.5. Thus, in fact, overall the initialization of the neural network
does influence the result. However, in order to study this effect in more detail, we divided
the reference images into six groups (study lists) based on their initial position in system
one: group 1 represents the top 10 (most similar) images (with initial rank from 1 to 10),
group 2 contains the next most similar images (with initial rank from 11 to 20), and so on,
up to group 6, which contains images initially ranked 51-60. If we look at the lower part
of the reference image list (such as group 5 and group 6), we can see that rank_dev is
quite large. This means the initial status of the neural network has a big impact on the
order of results. However, the rank_dev is fairly small for the top part (such as group 1)
of the ranked list. This indicates that for top-ranked images (the most similar images),
the results are relatively insensitive to differences in the neural network initial configuration.
Analysis and Discussion

The results show that the proposed hybrid dimension reduction method is superior
to the other two dimension reduction methods, PCA and pure neural network, that are
Figure 6. Robustness of the CMVF against various image alterations

50
70
Rank of Target Image
Br ig ht en
Da rkenr
Sh arp en
45
Ga uss ian f ilt er

Med ian f iter
60
50
40
30
20
10
0
40
35
30
25
20
15
10
5
0
10
20
30
40
10
20
P r e c en t ag e o f V a r iat io n
S iz e o f F i lt e r ( B l u r )
(b) Brighten, Darken and Sharpen
(a) Blur
70
90
80
Pixelize
Ra nd o m s p read
60
50
40
30
20
10
0
Ga us sian no is e
70
60
50
40
30
20
10
0
10
20
30
40
50
60
10
P ix e ls o f V a r ia t io n
20
30
40
50
60
St a n d a r d D e v ia t io n
(c) Pixelize and Random Spread
(d) Gaussian Noise

100
80
90
70
Sa lt &p ep p er no is e
30
60
50
40
30
20
10
0
Mo re s at uratio n
Less s at uratio n
80
70
60
50
40
30
20
10
0
10
20
30
40
10
20
30
40
50
P r ece n t ag e o f N o i s e P ix e l
P r ece n t ag e o f var ia t io n
(e) Salt and Pepper Noise
(f) More and Less Saturation
60
applied alone. In this section we present a discussion of the issues related to the
performance of this hybrid method.
Parameters for Network Training

A wide variety of parameter values were tested in order to find an optimal choice
for the network-learning algorithm in the experiments just discussed. However, in
practice, it is often undesirable or even impossible to perform large parameter test series.
Moreover, different practical applications may require different sets of parameters of the
network. In our case, the optimal parameter for the quickprop algorithm is the step size
of 1.75 and the learning rate 0.9.
The number of the hidden units used can also affect the network convergence and
learning time greatly. The more the number of hidden units, the easier it is for the network
to learn. This is because more hidden units can keep more information. However, since
the network is a dimension reducer, the number of hidden units is restricted to a practical
limit.
Number of Principal Components used in Network Training

In the hybrid dimension reduction, the inputs to the network are not the original
image features but the transformed image features from PCA. The number of the Principal
Components (PCs) selected may affect the network performance. It may not be necessary
to take too many PCs for network training. On the other hand, the network may not be
trained well with too few PCs since some important information of the feature vectors may
have been excluded in the network training process. To complement the study of
efficiency of our techniques, we report the results of using different PCs for the hybrid
Table 3. Rank deviation comparison between different study lists

Class No
rank_dev for all reference image
rank_dev for group 1
1
14.5
0.4
1.2
5.7
10.4
26.4
42.7
2
18.6
0.5
1.3
7.1
12.3
38.3
52.1
3
16.3
0.7
1.8
6.6
11.8
28.8
47.6
4
17.2
0.4
1.9
5.9
12.9
32.9
48.9
5
17.8
0.6
1.3
7.5
11.7
36.7
49.5
6
15.4
0.3
1.8
7.8
10.5
33.5
38.8
Class No
rank_dev for all reference image
9
15.9
0.7
2.1
7.5
12.4
31.4
41.5
10
17.4
0.6
2.3
6.8
9.8
35.8
48.8
11
17.1
0.6
1.9
6.9
10.7
33.3
46.1
12
15.9
0.5
1.7
6.7
12.1
34.6
47.4
13
16.1
0.7
1.6
7.1
12.5
32.9
44.1
14
16.9
0.6
2.0
6.9
10.3
31.6
42.8
7
15.9
0.8
1.7
7.6
10.9
34.9
39.6
8
15.7
0.5
2.8
6.7
11.4
32.4
40.7
Average
16.5
0.6
1.8
6.9
11.4
33.1
45.1
dimension reduction for the collection of images in this section. Table 4 shows the
learning time for different numbers of PCs.
It can be seen that the numbers of PCs for the best network training in our application
depends on their total variance. There are not significant differences in the time required
for network training from 35 to 50 PCs since they account for more than 99% of the total
variance. Moreover, since the eigenvalues are in decreasing order, increasing the number
of PCs after the first 40 PCs does not require much extra time to train the network. For
example, there are only 40 epochs difference between 45 PCs and 50 PCs. However, if we
choose the number of PCs with a total variance that is less than 90% of the total variance
then the differences are significant. It takes 7100 epochs for 10 PCs that account for 89.7%
of the total variance to reach the ultimate network error of 0.02, which is far greater than
the epochs needed for the number of PCs larger than 35.
Scalability and Updates

The number of images that we used in our experiments for testing our dimension
reducer is 10,000, which is a reasonable large image database collection. From our
experience, the most time-consuming part of the system is not the neural network training
process itself, but the collection of training samples for the neural network system. For
example, it took us around 40 hours to collect a suitable training samples (163) from the
10,000 images versus 8 minutes to train those samples using a SUN Sparc machine with
64MB RAM. The creation of training samples is a one-time job that is performed off-line.
The indexing structure that we used is the well-known M-tree whose scalability has been
demonstrated in many spatial information systems. If a new image needs to be added, the
image features such as color, texture and shape should be extracted first, then combined
together. The combined image features are passed through PCA and neural network for
dimension reduction. The reduced feature vector can be easily inserted into the M-tree.
However, if a new image class needs to be added, the neural network system has to be
retrained and the indexes rebuilt. On the other hand, if an image needs to be deleted then
all that is required is just the deletion of the corresponding index from the M-tree. That
would be a lot simpler.
Table 4. Learning time for different number of PCs

Number of PCs
7
10
15
20
25
30
35
40
45
50
Total Variance %
81.5
89.7
93.8
95.5
97.5
98.1
99.1
99.4
99.7
99.8
Learning Errors
57.3
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
Learning Time (Epochs)

>100,000
7100
4320
3040
1830
1440
1200
870
910
950
SUMMARY
To tackle the dimensionality curse problem for multimedia databases, we have
proposed a novel indexing scheme by combining different types of image features to
support queries that involve composite multiple features. The novelty of this approach
is that various visual features and semantic information can be easily fused into a small
feature vector that provides effective (good discrimination) and efficient (low dimensionality) retrieval. The core of this scheme is to combine PCA and a neural network into a
hybrid dimension reducer. PCA provides the optimal selection of features to reduce the
training time of neural network. Through the learning phase of the network, the context
that human visual system used for judging similarity of the visual features in images is
acquired. This is implicitly represented as the network weights after the training process.
The feature vectors computed at the hidden units (which has a small number of
dimensions) of the neural network become our reduced-dimensional composite image
features. The distance between any two feature vectors at the hidden layer can be used
directly as a measure of similarity between the two corresponding images.
We have developed a learning algorithm to train the hybrid dimension reducer. We
tested this hybrid dimension reduction method on a collection of 10,000 images. The
result is that it achieved the same level of accuracy as the standard neural network
approach with a much shorter network training time. We have also presented the output
quality of our hybrid method for indexing the test image collection using M-trees. This
shows that our proposed hybrid dimension reduction of image features can correctly and
efficiently reduce the dimensions of image features and accumulate the knowledge of
human visual perception in the weights of the network. This suggests that other existing
access methods may be able to be used efficiently. Furthermore, the experimental results
also illustrate that by integrating additional visual features, CMVFs retrieval effectiveness can be improved significantly. Finally, we have demonstrated that CMVF can be
made robust against a range of image distortions, and is not significantly affected by the
initial configuration of the neural network. The issue that remains to be studied is
establishing a formal framework to study the effectiveness and efficiency of additional
visual feature integration. There is also a need to investigate more advanced machine
learning techniques that can incrementally reclassify images as new images are added.
REFERENCES
Behrens, R. (1984). Design in the visual arts. Englewood Cliffs, NJ: Prentice Hall.
Bozkaya, T., & zsoyoglu, M. (1997). Distance-based indexing for high-dimensional
metric spaces. In Proceedings of the 16 th ACM SIGMOD International Conference
on Management of Data (SIGMOD97), Tuscon, Arizona, USA (pp. 357-368).
Brin, S. (1995). Near neighbor search in large metric spaces. In Proceedings of the 21st
International Conference on Very Large Data Bases (VLDB95), Zurich, Switzerland (pp. 574-584).
Canny, J. (1986). A computational approach to edge detection. IEEE Trans. Pattern Anal.
Mach. Intell., 8(6), 679-698.
Chiueh, T. (1994). Content-based image indexing. In Proceedings of the 20th International Conference on Very Large Databases (VLDB94), Santiago de Chile, Chile
(pp. 582-593).
Ciaccisa, P., & Patella, M. (1998). Bulk loading the m-tree. In Proceedings of the Ninth
Australian Database Conference (ADC98), Perth, Australia (pp. 15-26).
Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: An efficient access method for
similarity search in metric spaces. In Proceeding of the 23rd VLDB International
Conference on Very Large Databases (VLDB97), Athens, Greece (pp. 426-435).
Euripides, G.M.P., & Faloutsos, C. (1997). Similarity searching in medical image databases. IEEE Transaction on Knowledge and Data Engineering, 3(9), 435-447.
Fahlam, S.E. (1988). An empirical study of learning speed for back-propagation
networks. Technical Report CMU-CS 88-162, Carnegie-Mellon University.
Faloutsos, C., Barber, R., Flickner, M., Niblack, W., Peetkovic, D., & Equitz, W. (1994).
Efficient and effective querying by image content. Journal of Intelligent Information System, 3(3/4), 231-261.
Fukunaga, K., & Koontz, W. (1970) Representation of random process using the
karhumen-love expansion. Information and Control, 16(1), 85-101.
Hellerstein, J.M., Naughton, J.F., & Pfeffer, A. (1995). Generalized search trees for
database systems. In Proceedings of the 21 st International Conference on Very
Large Data Bases (VLDB95), Zurich, Switzerland (pp. 562-573).
Gonzalez, R., & Woods, R. (2002). Digital image processing. New York: Addison Wesley.
Jain, A.K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern Recognition, 29(8), 1233-1244.
Kittler, J., & Young, P. (1973). A new application to feature selection based on the
karhumen-love expansion. Pattern Recognition, 5(4), 335-352.
Lee, D., Barber, R.W., Niblack, W., Flickner, M., Hafner, J., & Petkovic, D. (1993). Indexing
for complex queries on a query-by-content image. In Proceedings of SPIE Storage
and Retrieval for Image and Video Database III, San Jose, California (pp. 24-35).
Lerner, R.M., Kendall, P.C., Miller, D.T., Hultsch, D.F., & Jensen, R.A. (1986). Psychology. New York: Macmillan.
Lowe, D.G. (1985). Perceptual organization and visual recognition. Kluwer Academic.
Salton, G., & McGill, M. (1993). Introduction to modern information retrieval. New York:
McGraw-Hill.
Sellis, T., Roussopoulos, N., & Faloutsos, C. (1987). The R+-tree: A dynamic index for
multidimensional objects. In Proceedings of the 12th International Conference on
Very Large Databases (VLDB87), Brighton, UK (pp. 507-518).
Shen, J., Ngu, A.H.H., Shepherd, J., Huynh, D., & Sheng, Q.Z. (2003). CMVF: A novel
dimension reduction scheme for efficient indexing a large image database. In
Proceedings of the 22nd ACM SIGMOD International Conference on Management
of Data (SIGMOD03), San Diego, California (p. 657).
Sirovich, L., & Kirby, M. (1987). A low-dimensional procedure for the identification of
human faces. Journal of Optical Society of America, 4(3), 519.
Swain, M.J., & Ballard, D.H. (1991). Color indexing. Int. Journal of Computer Version,
7(1),11-32.
Turner, M. (1986). Texture discrimination by gabor functions. Biol. Cybern, 55,71-82.
White, D., & Jain, R. (1996). Similarity indexing with the ss-tree. In Proceedings of the
12 th International Conference on Data Engineering, New Orleans (pp. 516-523).
Wu, J.K. (1997) Content-based indexing of multimedia databases. IEEE Transaction on
Knowledge and Data Engineering, 9(6), 978-989.
ENDNOTE
1
The size of training sample is predefined. In this study, the size is 163.
APPENDIX A
This example demonstrates query effectiveness between different dimension reduction methods including CMVF, pure neural network, PCA with feature combination
including color, texture and shape.
Query Result with CMVF: Nine out of nine matches
Query Result with Neural Network: Nine out of nine matches
Query Result with PCA: Two out of nine matches
APPENDIX B
An example that demonstrates query effectiveness improvement due to integration
of shape information
Query result of CMVF with color and texture: Seven out of twelve matches
Query result of CMVF with color, texture and shape: Twelve out of twelve matches
APPENDIX C
Demonstration of the robustness for CMVF against various image alternations.

Only the best four results are presented. The first image on every columun is the query
example and the database has 10,000 images
(a) Blur with 11x11 Gaussian filter
(b) Blur with 11x11 Median filter
(c) Pixelize at nine pixels
(d) Random spread at 10 pixels
(e) 12% Salt & pepper noise
30 Lim & Jin
Chapter 2
From Classification
to Retrieval:
Exploiting Pattern Classifiers

in Semantic Image
Indexing and Retrieval
Joo-Hwee Lim, Institute for Infocomm Research, Singapore
ABSTRACT
Users query images by using semantics. Though low-level features can be easily
extracted from images, they are inconsistent with human visual perception. Hence, lowlevel features cannot provide sufficient information for retrieval. High-level semantic
information is useful and effective in retrieval. However, semantic information is
heavily dependent upon semantic image regions and beyond, which are difficult to
obtain themselves. Bridging this semantic gap between computed visual features and
user query expectation poses a key research challenge in managing multimedia
semantics. As a spin-off from pattern recognition and computer vision research more
than a decade ago, content-based image retrieval research focuses on a different
problem from pattern classification though they are closely related. When the patterns
concerned are images, pattern classification could become an image classification
problem or an object recognition problem. While the former deals with the entire image
From Classification to Retrieval 31
as a pattern, the latter attempts to extract useful local semantics, in the form of objects,
in the image to enhance image understanding. In this chapter, we review the role of
pattern classifiers in state-of-the-art content-based image retrieval systems and discuss
their limitations. We present three new indexing schemes that exploit pattern classifiers
for semantic image indexing, and illustrate the usefulness of these schemes on the
retrieval of 2,400 unconstrained consumer images.
INTRODUCTION
Users query images by using semantics. For instance, in a recent paper by Enser
(2000), he gave a typical request to a stock photo library, using broad and abstract
semantics to describe the images one is looking for:
Pretty girl doing something active, sporty in a summery setting, beach not wearing
lycra, exercise clothes more relaxed in tee-shirt. Feature is about deodorant so girl
should look active not sweaty but happy, healthy, carefree nothing too posed or
set up nice and natural looking.
Using existing image processing and computer vision techniques, low-level features such as color, texture, and shape can be easily extracted from images. However, they
have proved to be inconsistent with human visual perception, let alone the incapability
to capture broad and abstract semantics as illustrated by the example above. Hence, lowlevel features cannot provide sufficient information for retrieval. High-level semantic
information is useful and effective in retrieval. However, semantic information is heavily
dependent upon semantic image regions and beyond, which are difficult to obtain
themselves. Between low-level features and high-level semantic information, there is a
so- called semantic gap. Content-based image retrieval research has yet to bridge this
gap between the information that one can extract from the visual data and the
interpretation that the same data have for a user in a given situation (Smeulders et al.,
2000).
In our opinion, the semantic gap is due to two inherent problems. One problem is
that the extraction of complete semantics from image data is extremely hard, as it demands
general object recognition and scene understanding. This is called the semantics
extraction problem. The other problem is the complexity, ambiguity and subjectivity in
user interpretation, that is, the semantics interpretation problem. They are illustrated
in Figure 1. We think that these two problems are manifestation of two one-to-many
relations.
In the first one-to-many relation that makes the semantics extraction problem
difficult, a real world object, say a face, can be presented in various appearances in an
image. This could be due to the illumination condition when the image of the face is being
recorded; the parameters associated with the image capturing device (focus, zooming,
angle, distance, etc.); the pose of the person; the facial expression; artifacts such as
spectacles and hats; variations due to moustache, aging, and so forth. Hence, the same
real-world object may not have consistent color, texture and shape as far as computer
vision is concerned.
32 Lim & Jin
Figure 1. Semantic gap between visual data and user interpretation
Semantics
Extraction
Problem
Semantics
Interpretation
Problem
The other one-to-many relation is related to the semantics interpretation problem.

Given an image, there are usually many possible interpretations due to several factors.
One factor is task-related. Different regions or objects of interest might be focused upon
depending on the task or need at hand. For instance, a user looking for beautiful scenic
images as wallpaper for his or her desktop computer would emphasize the aesthetic
aspect of the images (besides an additional requirement of very high resolution).
Furthermore, differences in culture, education background, gender, and so forth, would
also inject subjectivity into user interpretation of an image, not to mention that perception and judgement are not time-invariant. For example, a Chinese user may look for reddominant images in designing greeting cards for auspicious events, but these images
may not have special appeal to a European user.
As a spin-off from pattern recognition and computer vision research more than a
decade ago (Smeulders et al., 2000), content-based image retrieval research focuses on
a different problem from pattern classification though they are closely related. In pattern
classification, according to the Bayes decision theory, we should select class Ci with the
maximum a posteriori probability P(Ci|x) for a given pattern x in order to minimize the
average probability of classification error (Duda & Hart, 1973, p. 17). When the construction of pattern classifiers relies on statistical learning from observed data, the models for
the pattern classifiers could be parametric or non-parametric.
When the patterns concerned are images, pattern classification could become an
image classification problem (e.g., Vailaya et al., 2001) or an object recognition problem
(e.g., Papageorgiou et al., 1998). While the former deals with the entire image as a pattern,
the latter attempts to extract useful local semantics, in the form of objects, in the image
to enhance image understanding. Needless to say, the success of accurate object
recognition would result in better scene understanding and hence more effective image
classification.
In content-based image retrieval, the objective of a user is to find images relevant

to his or her information need, expressed in some form of query input to an image retrieval
system. Given an image retrieval system with a database of N images (assuming N is large
and stable for a query session), the hidden information need of a user cast over the N
images can be modeled as the posterior probability of the class of relevant images R given
an expression of the information need in the form of query specification q and an image
x in the current database, P(R|q, x). This formulation follows the formalism of probabilistic text information retrieval (Robertson & Sparck Jones, 1976). Here we assume that
the image retrieval system can compute P(R|q, x) for each x in the database. The objective
of the system is to rank and return the images in descending order of probability of
relevance to the user.
Certainly, the image classification and object recognition problems are related to the
image retrieval problem, as their solutions would provide better image semantics to an
image retrieval system to boost its performance. However, the image retrieval problem
is inherently user-centric or query-centric. There is no predefined class and the number
of object classes to be recognized to support queries is huge (Smeulders et al., 2000) in
unconstrained or broad domains.
In this chapter, we review the role of pattern classifiers in state-of-the-art contentbased image retrieval systems and discuss their limitations (Section 2). We propose three
new indexing schemes that exploit pattern classifiers for semantic image indexing
(Section 3) and illustrate the usefulness of these schemes on the retrieval of 2,400
unconstrained consumer images (Section 4). Last but not least, we provide our perspective on the future trend in managing multimedia semantics involving pattern classification and related research challenges in Section 5, followed by a concluding remark.
RELEVANT RESEARCH
User studies on the behavior of users of image collection is limited. The most
comprehensive effort in understanding what a user wants to do with an image collection
is Ensers work on image (Enser, 1993; Enser, 1995) (and also video (Amitage & Enser,
1997)) libraries for media professionals. Other user studies have focused on newspaper
photo archives (Ornager, 1996; Markkula & Sormunen, 2000), art images (Frost et al.,
2000), and medical image archive (Keister, 1994). Typically, knowledgeable users searched
and casual users browsed. But all users found that both searching and browsing are
useful.
As digital cameras and camera phones proliferate, managing personal image
collection effectively and efficiently with semantic organization and access of the images
is becoming a genuine problem to be tackled in the near future. The most relevant findings
on how consumers manage their personal digital photos come from the user studies by
K. Rodden (Rodden & Wood, 2003; Rodden, 1999). In particular, Rodden and Wood
(2003) found that few people will perform annotation, and comprehensive annotation is
not practical, either typed or spoken. Without text annotation, it is not possible to perform
text-based retrieval. Hence, the semantic gap problem remains unsolved.
Content-based image retrieval research has progressed from the pioneering featurebased approach (Bach et al., 1996; Flickner et al., 1995; Pentland et al., 1995) to the regionbased approach (Carson et al., 1997; Li et al., 2000; Smith & Chang, 1996). In order to bridge
34 Lim & Jin
the semantic gap (Smeulders et al., 2000) that exists between computed perceptual visual
features and conceptual user query expectation, detecting semantic objects (e.g., faces,
sky, foliage, buildings, etc.) based on trained pattern classifiers has been an active trend
(Naphade et al., 2003; Town & Sinclair, 2000).
The MiAlbum system uses relevance feedback (Lu et al., 2000) to produce annotation for consumer photos. The text keywords in a query are assigned to positive
feedback examples (i.e., retrieved images that are considered relevant by the user who
issues the query). This would require constant user intervention (in the form of relevance
feedback) and the keywords issued in a query might not necessarily correspond to what
is considered relevant in the positive examples. As an indirect annotation, the annotation
process is slow and inconsistent between users. There is also the problem of small
sampling in retrieval using relevance feedback; the small number of samples would not
have statistical significance. Learning with feedback is not stable due to the inconsistency in users feedback. The similarity will also vary when people use it for different
applications.
Town and Sinclair (2000) use a semantic labeling approach. An image is segmented
into regular non-overlapping regions. Each region is classified into visual categories of
outdoor scenes by neural networks. Similarity between a query and an image is computed
as either the sum over all grids of the Euclidean distance between classification vectors,
or their cosine of correlation. The evaluation was carried out on more than 1,000 Corel
Photo Library images and about 500 home photos, and better classification and retrieval
results were obtained for the professional Corel images.
In a leading effort by the IBM (International Business Machines, Inc.) research
group to design and detect 34 visual concepts (both objects and sites) in the TREC 2002
benchmark corpus (www.nlpir.nist.gov/projects/trecvid/), support vector machines are
trained on segmented regions in key frames using various color and texture features
(Naphade et al., 2003; Naphade & Smith, 2003). Recently the vocabulary has been
extended to include 64 visual concepts for the TREC 2003 news video corpus (Amir et
al., 2003). Several months of effort were devoted to the manual labeling of the training
samples using their VideoAnnEx annotation tool (Lin et al., 2003) contributed by the
TREC participants.
However, highly accurate segmentation of objects is a major bottleneck except for
selected narrow domains when few dominant objects are recorded against a clear
background (Smeulders et al., 2000, p. 1360). The challenge of object segmentation is acute
for polysemic images in broad domains such as unconstrained consumer images. The
interpretation of such scenes is usually not unique, as the scenes may have numerous
conspicuous objects, some with unknown object classes (Smeulders et al., n.d.).
Our Semantic Region Indexing (SRI) scheme addresses the issue of local region
classification differently. We have also adopted statistical learning to extract local
semantics in image content, though our detection-based approach does not rely on
region segmentation. In addition, our innovation lies in reconciliation of multiscale viewbased object detection maps and spatial aggregation of soft semantic histograms as
image content signature. Our local semantic interpretation scheme can also be viewed
as a systematic extension of the signs designed for domain-specific applications
(Smeulders et al., 2000, p. 1359) and the visual keywords built for explicit query
specification (Lim, 2001).
Image classification is another approach to bridging the semantic gap that has
received more attention lately (Bradshaw, 2000; Lipson et al., 1997; Szummer & Picard,
1998; Vailaya et al., 2001). In particular, the efforts to classify photos based on contents
have been devoted to indoor versus outdoor (Bradshaw, Szummer & Picard, n.d.), natural
versus man-made (Bradshaw, Vailaya et al., n.d.) and categories of natural scenes (Lipson
et al., n.d.; Vailaya et al., n.d.). In general, the classifications were made based on lowlevel features such as color, edge directions, and so forth, and Vailaya et al. presented
the most comprehensive coverage of the problem by dealing with a hierarchy of eight
categories (plus three others) progressively with separately designed features. The
vacation photos used in their experiments are a mixture of Corel photos, personal photos,
video key frames, and photos from the Web.
A natural and useful insight is to formulate image retrieval as a classification
problem. In very general terms, the goal of image retrieval is to return images of a class
C that the user has in mind based on a set of features x computed for each image in the
database. In probabilistic sense, the system should return images ranked in the descending return status value of P(C|x), whatever C may be defined as desirable. Under this
general formulation, several approaches have emerged.
A Bayesian formulation to minimize the probability of retrieval error (i.e., the
probability of wrong classification) had been proposed by Vasconcelos and Lippman
(2000) to drive the selection of color and texture features and to unify similarity measures
with the maximum likelihood criteria. Similarly, in an attempt to classify indoor/outdoor
and natural/man-made images, a Bayesian approach was used to combine class likelihoods resulted from multiresolution probabilistic class labels (Bradshaw, 2000). The
class likelihoods were estimated based on local average color information and complex
wavelet transform coefficients.
In a different way, Aksoy and Haralick (2002) as well as Wu and others (2000)
considered a two-class problem with only the relevance class and the irrelevance class.
A two-level classification framework was proposed by Aksoy and Haralick. Image feature
vectors were first mapped to two-dimensional class-conditional probabilities based on
simple parametric models. Linear classifiers were then trained on these probabilities and
their classification outputs were combined to rank images for retrieval. From a different
motivation, the image retrieval problem was cast as a transductive learning problem by
Wu et al. to include an unlabeled data set for training the image classifier. In particular,
a new discriminant-EM algorithm was proposed to generalize the mapping function
learned from the labeled training data to a specific unlabeled data set. The algorithm was
evaluated on a small database (134 images) of seven classes using 12 labeled images in
the form of relevance feedback.
This classification approach has been popular in specific domains. For medical
images, images have been grouped by pathological classes for diagnostic purpose
(Brodley et al., 1999) or by imaging modalities for visualization purpose (Mojsilovic &
Gomes, 2002). In the case of facial images (Moghaddam et al., 1998), intrapersonal and
extrapersonal classes of variation between two facial images were modeled. Then the
similarity between the image intensity of two facial images was expressed as a probabilistic measure in terms of the intrapersonal and extrapersonal class likelihoods and priors
using a Bayesian formulation.
36 Lim & Jin
Image classification or class-based retrieval approaches are adequate for query by

predefined image class. However, the set of relevant images R may not correspond to any
predefined class C in general. In our Class Relative Indexing (CRI) scheme, image
classification is not the end but a means to compute interclass semantic image indexes
for similarity-based matching and retrieval.
While supervised pattern classifiers allow design of image semantics (local object
classes or global scene classes), a major drawback of the supervised learning paradigm
is the human effort required to provide labeled training samples, especially at the image
region level. Lately, there are two promising trends that attempt to achieve semantic
indexing of images with minimal or no effort of manual annotation (i.e., semisupervised
or unsupervised learning).
In the field of computer vision, researchers have developed object recognition
systems from unlabeled and unsegmented images (Fergus et al., 2003; Selinger & Nelson,
2001; Weber et al., 2000). In the context of relevance feedback, unlabeled images have
also been used to bootstrap the learning from very limited labeled examples (Wang et al.,
2003; Wu et al., 2000). For the purpose of image retrieval, unsupervised models based on
generic texture-like descriptors without explicit object semantics can also be learned
from images without manual extraction of objects or features (Schmid, 2001). As a
representative of the state-of-the-art, sophisticated generative and probabilistic model
has been proposed to represent, learn, and detect object parts, locations, scales, and
appearances from fairly cluttered scenes with promising results (Fergus et al., 2003).
Motivated from a machine translation perspective, object recognition is posed as
a lexicon learning problem to translate image regions to corresponding words (Duygulu
et al., 2002). More generally, the joint distribution of meaningful text descriptions and
entire or local image contents are learned from images or categories of images labeled with
a few words (Barnard & Forsyth, 2001; Barnard et al., 2003b; Kutics et al., 2003; Li & Wang,
2003). The lexicon learning metaphor offers a new way of looking at object recognition
(Duygulu et al., 2002) and a powerful means to annotate entire images with concepts
evoked by what is visible in the image and specific words (e.g., fitness, holiday, Paris,
etc. (Li & Wang, 2003)). While the results for the annotation problem on entire images
look promising (Li & Wang, 2003), the correspondence problem of associating words
with segmented image regions remains very challenging (Barnard et al., 2003b) as
segmentation, feature selection, and shape representation are critical and nontrivial
choices (Barnard et al., 2003a).
Our Pattern Discovery Indexing (PDI) scheme addresses the issue of minimal
supervision differently. We do not assume availability of text descriptions for image or
image classes as by Barnard et al. (2003b) as well as Li and Wang (2003). Neither do we
know the object classes to be recognized as by Fergus et al. (2003). We discover and
associate local unsegmented regions with semantics and generate their samples to
construct models for content-based image retrieval, all with minimal manual intervention.
This is realized as a novel three-stage hybrid framework that interleaves supervised and
unsupervised classifications.
USING PATTERN CLASSIFIERS

FOR SEMANTIC INDEXING
Semantic Region Indexing
One of the goals in content-based image retrieval is semantic interpretation
(Smeulders et al., p. 1361). To realize strong semantic interpretation of content, we
propose the use of classifications of local image regions and their statistical aggregates
as image index. In this chapter, we adopt statistical learning to systematically derive these
semantic support regions (SSRs) prior to image indexing. During indexing, the SSRs are
detected from multiscale block-based image regions, as inspired by multiresolution viewbased object recognition framework (Papageorgiou et al., 1998; Sung & Poggio, 1998),
hence without a region segmentation step.
The key in image indexing here is not to record the primitive feature vectors
themselves but to project them into a classification space spanned by semantic labels
and use the soft classification decisions as the local indexes for further aggregation.
Indeed the late K.K. Sung also constructed six face clusters and six nonface clusters and
used the distance between the feature vector of a local image block and these clusters
as the input to the trained face detector rather than using the feature vector directly (Sung
& Poggio, 1998).
To compute the SSRs from training instances, we use support vector machines on
suitable features for a local image patch and denote this feature vector as z. A support
vector classifier Si is a detector for SSR i on z. The classification vector T for region z can
be computed via the softmax function (Bishop, 1995) as
Ti ( z ) =
exp Si ( z )
j exp Sj ( z )
(1)
As each support vector machine is regarded as an expert on an SSR class, the

outputs of Si i are set to 0 if there exists Sj, j i that has a positive output.
As we are dealing with heterogeneous consumer photos, we adopt color and texture
features to characterize SSRs. A feature vector z has two parts, namely a color feature
vector zc and a texture feature vector z t. For the color feature, we compute the mean and
standard deviation of each color channel (i.e., z c has six dimensions). We use the YIQ
color space over other color spaces, as it performed better in our experiments. For the
texture feature, we adopted the Gabor coefficients (Manjunath & Ma, 1996). Similarly, the
means and standard deviations of the Gabor coefficients (five scales and six orientations)
in an image block are computed as z t (60 dimensions). Zero-mean normalization (Ortega
et al., 1997) was applied to both the color and texture features. In this chapter, we adopted
polynomial kernels with a modified dot product similarity measure between feature
vectors y and z,
yz =
1 yc zc
yt zt
+
( c
)
2 | y || z c | | y t || z t |
(2)
38 Lim & Jin
To detect SSRs with translation and scale invariance in an image to be indexed,

the image is scanned with windows of different scales, following the strategy in viewbased object detection (Papageorgiou et al., 1998). In our experiments, we progressively increase the window size from 2020 to 6060 at a step of 10 pixels, on a 240360
size-normalized image. That is, after this detection step, we have five maps of
detection.
To reconcile the detection maps across different resolutions onto a common basis,
we adopt the following principle: If the most confident classification of a region at
resolution r is less than that of a larger region (at resolution r + 1) that subsumes the
region, then the classification output of the region should be replaced by those of the
larger region at resolution r + 1. Using this principle, we start the reconciliation from the
detection map based on the largest scan window (6060) to the detection map based on
the next-to-smallest scan window (3030). After four cycles of reconciliation, the
detection map that is based on the smallest scan window (2020) would have consolidated the detection decisions obtained at other resolutions.
Suppose a region Z comprised of n small equal regions with feature vectors z 1, z2,
, zn respectively. To account for the size of detected SSRs in the spatial area Z, the SSR
classification vectors of the reconciled detection map is aggregated as
Ti ( Z ) =
1
Ti ( z k )
n k
(3)
For Query by Example (QBE), the content-based similarity l between a query q and
an image x can be computed in terms of the similarity between their corresponding local
tessellated blocks. For example, the similarity based on L 1 distance measure (city block
distance) between query q with m local blocks Yj and image x with m local blocks Zj is
defined as
( q, x ) = 1
1
| Ti (Y j ) Ti (Z j ) |
2m j i
(4)
This is equivalent to histogram intersection (Swain & Ballard, 1991) with further
averaging over the number of local histograms m except that the bins have semantic
interpretation as SSRs. There is a trade-off between content symmetry and spatial
specificity. If we want images of similar semantics with different spatial arrangement (e.g.,
mirror images) to be treated as similar, we can have larger tessellated blocks (i.e., similar
to a global histogram). However, in applications where spatial locations are considered
differentiating, local histograms will provide good sensitivity to spatial specificity.
Furthermore, we can attach different weights to the blocks (i.e., Yj, Zj) to emphasize the
focus of attention (e.g., center). In this chapter, we report experimental results based on
even weights as grid tessellation is used. In this chapter, we have attempted various
similarity and distance measures (e.g., cosine similarity, L 2 distance, Kullback-Leibler
(KL) distance, etc.) and the simple city block distance in Equation 4 has the best
performance.
Figure 2. Examples of semantic support regions shown in top-down, left-to-right order:

people (face,figure,crowd,skin), sky (clear,cloudy,blue), ground (floor,sand,grass),
water (pool,pond,river), foliage (green,floral,branch), mountain (far,rocky), building
(old,city,far), interior (wall,wooden,china,fabric,light)
Table 1. Training statistics of the 26 SSR classes
num. pos. trg.

num. sup. vec.
num. pos. test
num. errors
error (%)
min.
5
9
3
0
0
max.
26
66
13
14
7.8
avg.
14.4
33.3
6.9
5.7
3.2
Note that we have presented the features, distance measures, and window sizes of
SSR detection, etc. in concrete forms to facilitate understanding. The SSR methodology
is indeed generic and flexible to adapt to application domains.
For the data set and experiments reported in this paper, we have designed 26 classes
of SSRs (i.e., Si, i = 1, 2, , 26 in Equation 1), organized into eight superclasses as
illustrated in Figure 2. We cropped 554 image regions from 138 images and used 375 of
them (from 105 images) as training data for support vector machines to compute the
support vectors of the SSRs and the remaining one-third for validation. Among all the
kernels evaluated, those with better generalization result on the validation set are used
for the indexing and retrieval tasks. A polynomial kernel with degree 2 and constant 1 (C
= 100) (Joachims, 1999) produced the best result on precision and recall. Hence, it was
adopted in the rest of our experiments.
Table 1 lists the training statistics of the 26 SSR classes. The columns show, left
to right, the minimum, maximum and average of the number of positive training examples
(from a total of 375), the number of support vectors computed from the training examples,
the number of positive test examples (from a total of 179), the number of misclassified
examples on the 179 test set, and the percentage of error on the test set. The negative
training (test) examples for an SSR class are the union of positive training (test) examples
of the other 25 classes. The minimum number of positive training and test examples are
from the Interior:Wooden SSR while their maximum numbers are from the People:Face
class. The minimum and maximum numbers of support vectors are associated with the
Sky:Clear and Building:Old SSRs, respectively. The SSR with the best generalization is
the Interior:Wooden class, and the worst test error belongs to the Building:Old class.
40 Lim & Jin
Class Relative Indexing
When we are dealing with QBE, the set of relevant images R is obscure and a query
example q only provides a glimpse into it. In fact, the set of relevant images R does not
exist until a query has been specified. However, to anchor the query context, we can
define prior image classes Ck, k = 1, 2, , M as prototypical instances of the relevance
class R and compute the relative memberships to these classes of query q. Similarly we
can compute the interclass index for any database image x. These interclass memberships
allow us to compute a form of categorical similarity between q and x (see Equation 7).
In this chapter, as our test images are consumer photos, we design a taxonomy for
consumer photos as shown in Figure 3. This hierarchy is more comprehensive than that
addressed by Vailaya et al. (2001). In particular, we consider subcategories for indoor and
city as well as more common subcategories for nature. We select the seven disjoint
categories represented by the leaf nodes (except the miscellaneous category) in Figure
3 as semantic support classes (SSCs) to model the categorical context of relevance. That
is, we trained seven binary SVMs Ck, k = 1, 2, , 7 on these categories: interior or objects
indoor (inob), people indoor (inpp), mountain and rocky area (mtrk), parks or gardens
(park), swimming pool (pool), street scene (strt), and waterside (wtsd). Using the softmax
function (Bishop, 1995), the output of classification Rk given an image x is computed as,
Rk ( x) =
exp Ck ( x )
j expCj ( x )
(5)
The feature vector of an image for classification is the SRI image index, that is, T i
(Zj) i, j as described above. To be consistent with the SSR training, we adopted the
Figure 3. Proposed taxonomy for consumer photos. The seven disjoint categories (the
leaf nodes except miscellaneous) are selected as semantic support classes to model
categorical context of relevance.
Table 2. Statistics related to SSC learning (left-to-right): SSC class labels, numbers
of positive training examples (p-train), numbers of positive test examples (p-test),
numbers of support vectors computed (sv), and the classification rate (rate) on the
entire 2400 collection.
SSC
p-train
p-test
sv
rate
inob
inpp
mtrk
park
pool
strt
wtsd
27
172
13
61
10
129
30
107
688
54
243
42
516
120
136
234
116
158
72
259
151
95.7
85.1
98.0
92.4
98.7
84.4
95.3
polynomial kernels and the similarity measure between image indexes u = Ti (Y j) and v =
Ti (Zj) as
uv =
m j
T (Y )T (Z )
T (Y ) T (Z
i
)2
(6)
The similarity between a query q and an image x is computed as
( q, x) = 1
1
| Rk ( q ) Rk ( x ) |
2 k
(7)
Similar to the SSR training, the support vector machines were trained using a
polynomial kernel with degree 2 and constant 1 (C = 100) (Joachims, 1999). For each class,
a human subject was asked to define the list of ground truth images from the 2,400
collection, and 20% of the list was used for training. To ensure unbiased training samples,
we generated 10 different sets of positive training samples from the ground truth list for
each class based on uniform random distribution. The negative training (test) examples
for a class are the union of positive training (test) examples of the other six classes and
the miscellaneous class. The classifier training for each class was carried out 10 times on
these different training sets, and the support vector classifier of the best run was retained.
Table 2 lists the statistics related to the SSC learning. The miscellaneous class (not shown
in the table) has 171 images that include images of dark scene and bad quality.
Pattern Discovery Scheme

The Pattern Discovery Indexing (PDI) scheme is a semisupervised framework to
discover local semantic patterns and generate their samples for training with minimal
human intervention. Image classifiers are first trained on local image blocks from a small
42 Lim & Jin
number of labeled images. Then local semantic patterns are discovered from clustering
the image blocks with high classification output. Training samples are induced from
cluster memberships for support vector learning to form local semantic pattern detectors.
An image is then indexed as a tessellation of local semantic histograms and matched
using histogram intersection similar to that of the SRI scheme.
Given an application domain, some typical classes Ck with their image samples are
identified. The training samples are tessellated image blocks z from the class samples.
After learning, the class models would have captured the local class semantics and a high
SVM output (i.e., Ck(z) 0) would suggest that the local region z is typical to the semantics
of class k.
With the help of the learned class models Ck, we can generate sets of local image
regions that characterize the class semantics (which in turn captures the semantic of the
content domain) Xk as
X k = { z | C k ( z ) > } ( 0)
(8)
However, the local semantics hidden in each Xk are opague and possibly multimode.
We would like to discover the multiple groupings in each class by unsupervised learning
such as Gaussian mixture modeling and fuzzy c-means clustering. The result of the
clustering is a collection of partitions mkj, j = 1, 2, , Nk in the space of local semantics
for each class, where mkj are usually represented as cluster centers and Nk are the numbers
of partitions for each class. Once we have obtained the typical semantic partitions for
each class, we can learn the models of Discovered Semantic Regions (DSR) Si, i = 1, 2,
, N where N = k Nk (i.e., we linearize the ordering of mkj as mi). We label a local image
block (x k Xk) as a positive example for Si if it is closest to mi and as a negative example
for Si j i,
X i+ = {x | i = arg min t | x mt |}
(9)
X i = {x | i arg min t | x mt |}
(10)
where |.| is some distance measure. Now we can perform supervised learning again on
X+i and X-i using say support vector machines Si(x) as DSR models.
To visualize a DSR Si, we can display the image block s i that is most typical among
those assigned to cluster mi that belonged to class k,
C k ( si ) = max+ C k ( x )
xX i
(11)
For consumer images used in our experiments, we make use of the same seven
disjoint categories represented by the leaf nodes (except the miscellaneous category) in
Figure 3. The same color and texture features as well as the modified dot product similarity
measure used in the supervised learning framework (Equation 2) are adopted for the
support vector classifier training with polynomial kernels degree 2, constant 1, C = 100
Table 3. Training statistics of the semantic classes Ck for bootstrapping local semantics.
The columns (left to right) list the class labels, the size of ground truth, the number of
training images, the number of support vectors learned, the number of typical image
blocks subject to clustering (Ck(z) > 2), and the number of clusters assigned.
Class
inob
inpp
mtrk
park
pool
strt
wtsd
G.T.
134
840
67
304
52
645
150
#trg
15
20
10
15
10
20
15
#SV
1905
2249
1090
955
1138
2424
2454
#data
1429
936
1550
728
1357
735
732
#clus
4
5
2
4
2
5
4
(Joachims, 1999). The training samples are 6060 image blocks (tessellated with 20 pixels
in both directions) from 105 sample images. Hence, each SVM was trained on 16,800 image
blocks. After training, the samples from each class k are fed into classifier Ck to test their
typicalities. Those samples with SVM output Ck(z) > 2 (Equation 8) are subject to fuzzy
c-means clustering. The number of clusters assigned to each class is roughly proportional to the number of training images in each class. Table 3 lists training statistics for
these semantic classes: inob (indoor interior/objects), inpp (indoor people), mtrk
(mountain/rocks), park (park/garden), pool (swimming pool), strt (street), and wtsd
(waterside). Hence, we have 26 DSRs in total.
To build the DSR models, we trained 26 binary SVMs with polynomial kernels
(degree 2, constant 1, C = 100 (Joachims, 1999)), each on 7467 positive and negative
examples (Equations 9 and 10) (i.e., sum of column 5 of Table 3). To visualize the 26 DSRs
that have been learned, we compute the most typical image block for each cluster
(Equation 11) and concatenate their appearances in Figure 4. Image indexing was based
on the steps as in the case of SRI (Equations 1 to 3) and matching uses the same similarity
measure as given in Equation 4.
Figure 4. Most typical image blocks of the DSRs learned (left to right): china utensils
and cupboard top (first four) for the inob class; faces with different background and
body close-up (next five) for the inpp class; rocky textures (next two) for the mtrk class;
green foliage and flowers (next four) for the park class; pool side and water (next two)
for the pool class; roof top, building structures, and roadside (next five) for the strt
class; and beach, river, pond, far mountain (next four) for the wtsd class.
44 Lim & Jin
EXPERIMENTAL RESULTS
Dataset and Queries
In this paper, we evaluate the SRI, CRI and PDI schemes on 2,400 unconstrained
consumer photos. These genuine consumer photos are taken over five years in several
countries with both indoor and outdoor settings. The images are those of the smallest
resolution (i.e., 256384) from Kodak PhotoCDs, in both portrait and landscape layouts.
After removing possibly noisy marginal pixels, the images are of size 240 360. Figure
5 displays typical photos in this collection. As a matter of fact, this genuine consumer
photo collection includes photos of bad quality (e.g., faded, over- and underexposed,
blurred, etc.) (Figure 6). We retained them in our test to reflect the complexity of the
original data. The indexing process automatically detects the layout and applies the
corresponding tessellation template.
We defined 16 semantic queries and their ground truths (G.T.) among the 2,400
photos (Table 4). In fact, Figure 5 shows, in top-down left-to-right order, two relevant
images for queries Q01-Q16 respectively. As we can see from these sample images, the
relevant images for any query considered here exhibit highly varied and complex visual
appearance. Hence, to represent each query, the we have selected three relevant photos
as query examples for our experiments because a single query image is far from
satisfactory to capture the semantic of any query. Indeed single query images have
resulted in poor precisions and recalls in our initial experiments. The precisions and
recalls were computed without the query images themselves in the lists of retrieved
images.
Figure 5. Sample consumer photos from the [Trial mode] collection. They also represent
[Trial mode] relevant images (top-down, left-right) for each of the [Trial mode] queries
used in our experiments.
Figure 6. Some consumer photos of bad quality
Table 4. Semantic queries used in QBE experiments

Query
Q01
Q02
Q03
Q04
Q05
Q06
Q07
Q08
Q09
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Description
indoor
outdoor
people close-up
people indoor
interior or object
city scene
nature scene
at a swimming pool
street or roadside
along waterside
in a park or garden
at mountain area
buildings close-up
people close up, indoor
small group, indoor
large group, indoor
G.T.
994
1218
277
840
134
697
521
52
645
150
304
67
239
73
491
45
When a query has multiple examples, q = { q1, q2, , qK }, the similarity (q, x) for
any database image is computed as
( q, x) = max i ( qi , x)
(12)
Results and Comparison

In this chapter, we compare our proposed indexing schemes (denoted as SRI, CRI
and PDI) with the feature-based approach that combines color and texture in a linearly
optimal way (denoted as CTO). For each approach, we conducted experiments with
various system parameters and selected their best performances. We looked at both the
overall average precisions (denoted as Pavg) and average precisions at top 30 retrieved
images (denoted as P30) over 16 queries to select the best performances. The choices of
system parameters are described below before result comparison of the best performances.
For the color-based signature, both global and local (44 grid) color histograms of
b3 (b = 4, 5, , 17), the number of bins in the RGB color space were computed on an image.
In the case of global color histograms, the performance saturated at 4096 (b = 16) and 4913
(b = 17) bins with Pavg = 0.36 and P30 = 0.58. Hence, the one that used less number of bins
was preferred. Among the local color histograms attempted, the one with 2197 bins (b =
13) gave the best precisions with Pavg = 0.36 and P30 = 0.58. Histogram intersection (Swain
& Ballard, 1991) was used to compare two color histograms.
For the texture-based signature, we adopted the means and standard deviations of
Gabor coefficients and the associated distance measure as reported in Manjunath and
Ma (1996). The Gabor coefficients were computed with five scales and six orientations.
46 Lim & Jin
Convolution windows of 20 20, 30 30, , 60 60 were attempted. Similarly, we

experimented with both global and local (4 4 grid) signatures. The best results were
obtained when 20 20 windows were used. We obtained Pavg = 0.25 and P30 = 0.30 for
global signatures and Pavg = 0.24 and P30 = 0.38 for local signatures. These inferior results
when compared to those of color histograms lead us to conclude that a simple statistical
texture descriptor is less effective than a color histogram for heterogeneous consumer
image contents.
The distance measures between a query and an image for the color and texture
methods were normalized within [0, 1] and combined linearly ( [0, 1]):
(q, x) = c (q, x) + (1 ) t (q, x )
(13)
where lc and lt are similarities based on color and texture features respectively. Among
the relative weights attempted at 0.1 intervals, the best fusion was obtained at Pavg = 0.38
and P30 = 0.61 with equal color influence and texture influence for global signatures. In
the case of local signatures, the fusion peaked when the local color histograms were given
a dominant influence of 0.9, resulting in Pavg = 0.38 and P30 = 0.59.
The Precision/Recall curves (averaged over 16 queries) in Figure 7 illustrate the
precisions at various recall values for the four methods compared. All three proposed
indexing schemes outperformed the feature-based fusion approach.
Figure 7. Precision/Recall curves for CTO, SRI, CRI and PDI schemes
Table 5. Average precisions at top numbers of retrieved images (left to right): numbers
of retrieved images, average precisions based on CTO, SRI, CRI and PDI, respectively.
The numbers in parentheses are the relative improvement over the CTO method. The last
row shows the overall average precisions.
Avg.Prec.
At 20
At 30
At 50
At 100
Overall
CTO
0.54
0.59
0.52
0.46
0.38
SRI
0.76 (41%)
0.70 (19%)
0.62 (19%)
0.54 (17%)
0.45 (18%)
CRI
0.71 (31%)
0.68 (15%)
0.64 (23%)
0.58 (26%)
0.53 (39%)
PDI
0.71 (31%)
0.68 (15%)
0.63 (21%)
0.57 (24%)
0.48 (26%)
Table 5 shows the average precisions among the top 20, 30, 50 and 100 retrieved
images as well as the overall average precisions for the methods compared. Overall, the
proposed SRI, CRI and PDI schemes improve over the CTO method by 18%, 39% and 26%,
respectively. The CRI scheme has the best overall average precision of 0.53 while the SRI
scheme retrieves the highest number of relevant images at top 20 and 30 images.
DISCUSSION
The complex task of managing multimedia semantics has attracted a lot of research
interests due to the inexorable growth of multimedia information. While automatic feature
extraction does offer some objective measures to index the content of an image, it is far
from satisfactory to capture the subjective and rich semantics required by humans in
multimedia information retrieval tasks. Pattern classifiers provide a mid-level means to
bridge the gap between low-level features and higher level concepts (e.g., faces,
buildings, indoor, outdoor, etc.).
We believe that object and event detection in images and videos based on
supervised or semisupervised pattern classifiers will continue to be active research
areas. In particular, combining multiple modalities (visual, auditory, textual, Web) to
achieve synergy among the semantic cues from different information sources has been
accepted as a promising direction to create semantic indexes for multimedia contents
(e.g., combining visual and textual modalities for images; auditory and textual modalities
for music; auditory, visual and textual modalities for videos, etc.) in order to enhance
system performance. However, currently there is neither established formalism nor
proven large-scale application to guide or demonstrate the exploitation of pattern
classifiers and multiple modalities in semantic multimedia indexing, respectively. Hence,
we believe principled representation and integration schemes for multimodality and
multiclassifier as well as realistic large-scale applications will be well sought after in the
next few years. While some researchers push towards a generic methodology for broad
applicability, we will also see many innovative uses of multimodal pattern classifiers that
incorporate domain-specific knowledge to solve specific narrow domain multimedia
indexing problems.
48 Lim & Jin
Similarly, in the area of semantic image indexing and retrieval, we foresee three
promising trends, among other research opportunities. First, generic object detection
and recognition will continue to be an important research topic, especially in the direction
of unlabeled and unsegmented object recognition (e.g., Fergus et al., 2003). We hope that
the lessons learned in many forthcoming object recognition systems in narrow domains
can be abstracted into some generic and useful guiding principles. Next, complementary
information channels will be utilized to better index the images for semantic access. For
instance, in the area of consumer images, the time stamps available from digital cameras
can help to organize photos into events (Cooper et al., 2003). Associated text information
(e.g., stock photos, medical images, etc.) will provide a rich semantic source in addition
to image content (Barnard & Forsyth, 2001; Barnard et al., 2003b; Kutics et al., 2003; Li
&Wang, 2003). Last, but not least, we believe that pattern discovery (as demonstrated
in this chapter) is an interesting and promising direction for image understanding and
indexing. These three trends (object recognition, text association and pattern discovery)
are not conflicting and their interaction and synergy would produce very powerful
semantic image indexing and retrieval systems in the future.
CONCLUDING REMARKS
In this chapter, we have reviewed several key roles of pattern classifiers in contentbased image retrieval systems, ranging from segmented object detection to image scene
classification. We pointed out the limitations related to region segmentation for object
detection, image classification for similarity matching, and manual labeling effort for
supervised learning. Three new semantic image indexing schemes are introduced to
address these issues respectively. They are compared to the feature-based fusion
approach that requires very high dimension features to attain a reasonable retrieval
performance on the 2,400 unconstrained consumer images with 16 semantic queries.
Experimental results have confirmed that our three proposed indexing schemes are
effective especially when we consider precisions at top retrieved images. We believe that
pattern classifiers are very useful tools to bridge the semantic gap in content-based image
retrieval. The potential for innovative use of pattern classifiers is promising as demonstrated by our research results presented in this chapter.
ACKNOWLEDGMENTS
We thank T. Joachims for his great SVMlight software and J.L. Lebrun for his 2,400
family photos.
REFERENCES
Aksoy, S., & Haralick, R.M. (2002). A classification framework for content-based image
retrieval. In Proceedings of International Conference on Pattern Recognition
2002 (pp. 503-506).
Bach, J.R. et al. (1996). Virage image search engine: an open framework for image
management. In Storage and Retrieval for Image and Video Databases IV,
Proceedings of SPIE 2670 (pp. 76-87).
Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In
Proceedings of International Conference on Computer Vision 2001 (pp. 408-415).
Barnard, K. et al. (2003). The effects of segmentation of feature choices in a translation
model of object recognition. In Proceedings of IEEE Computer Vision and Pattern
Recognition 2003 (pp. 675-684).
Barnard, K. et al. (2003). Matching words and pictures. Journal of Machine Learning
Research, 3, 1107-1135.
Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.
Bradshaw, B. (2000). Semantic based image retrieval: A probabilistic approach. In
Proceedings of ACM Multimedia 2000, (pp. 167-176).
Brodley, C.E. et al. (1999). Content-based retrieval from medical image databases: A
synergy of human interaction, machine learning and computer vision. In Proceedings of AAAI (pp. 760-767).
Carson, C. et al. (2002). Blobworld: Image segmentation using expectation-maximization
and its application to image querying. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24(8), 1026-1038.
Cooper, M. et al. (2003). Temporal event clustering for digital photo collections. In
Proceedings of ACM Multimedia 2003 (pp. 364-373).
Duda, R.O., & Hart, P.E. (1973). Pattern classification and scene analysis. New York:
John Wiley & Sons.
Duygulu, P. et al. (2002). Object recognition as machine translation: Learning a lexicon
for a fixed image vocabulary. In Proceedings of European Conference on Computer Vision 2002 (vol IV, pp. 97-112).
Enser, P. (2000). Visual image retrieval: Seeking the alliance of concept based and content
based paradigms. Journal of Information Science, 26(4), 199-210.
Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised
scale-invariant learning. In Proceedings of IEEE Computer Vision and Pattern
Recognition 2003 (pp. 264-271).
Flickner, M. et al. (1995). Query by image and video content: The QBIC system. IEEE
Computer, 28(9), 23-30.
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C.
Burges, & A. Smola (Eds.), Advances in kernel methods - Support vector learning
(pp. 169-184). Boston: MIT-Press.
Kapur, J.N., & Kesavan, H.K. (1992). Entropy optimization principles with applications. New York: Academic Press.
Kutics, A. et al. (2003). Linking images and keywords for semantics-based image retrieval.
In Proceedings of International Conference on Multimedia & Exposition (pp.
777-780).
Li, J., & Wang, J.Z. (2003). Automatic linguistic indexing of pictures by a statistical
modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10), 1-14.
Li, J., Wang, J.Z., & Wiederhold, G. (2000). Integrated region matching for image retrieval.
Proceedings of ACM Multimedia 2000 (pp. 147-156).
50 Lim & Jin
Lim, J.H. (2001). Building visual vocabulary for image indexation and query formulation.
Pattern Analysis and Applications, 4(2/3), 125-139.
Lipson, P., Grimson, E., & Sinha, P. (1997). Configuration based scene classification and
image indexing. In Proceedings of International Conference on Computer Vision
(pp. 1007-1013).
Manjunath, B.S., & Ma, W.Y. (1996). Texture features for browsing and retrieval of image
data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8),
837-842.
Moghaddam, B., Wahid, W., & Pentland, A. (1998). Beyond Eigenfaces: Probabilistic
matching for face recognition. In Proceedings of IEEE International Conference
on Automatic Face and Gesture Recognition (pp. 30-35).
Mojsilovic, A., & Gomes, J. (2002). Semantic based categorization, browsing and retrieval
in medical image databases. In Proceedings of IEEE International Conference on
Image Processing (pp. III 145-148).
Naphade, M.R. et al. (2003). A framework for moderate vocabulary semantic visual
concept detection. In Proceedings of International Conference on Multimedia &
Exposition (pp. 437-440).
Ortega, M. et al. (1997). Supporting similarity queries in MARS. In Proceedings of ACM
Multimedia (pp. 403-413).
Papageorgiou, P.C., Oren, M., & Poggio, T. (1997). A general framework for object
detection. In Proceedings of International Conference on Computer Vision (pp.
555-562).
Pentland, A., Picard, R.W., & Sclaroff, S. (1995). Photobook: Content-based manipulation
of image databases. International Journal of Computer Vision, 18(3), 233-254.
Robertson, S.E. (1977). The probability ranking principle in IR. Journal of Documentation, 33, 294-304.
Schmid, C. (2001). Constructing models for content-based image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition 2001 (pp. 39-45).
Selinger, A., & Nelson, R.C. (2001). Minimally supervised acquisition of 3D recognition
models from cluttered images. In Proceedings of IEEE Computer Vision and
Pattern Recognition 2001 (pp. 213-220).
Smeulders, A.W.M. et al. (2000). Content-based image retrieval at the end of the early
years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12),
1349-1380.
Smith, J.R., & Chang, S.-F. (1996). VisualSEEk: A fully automated content-based image
query system. In Proceedings of ACM Multimedia, Boston, November 20 (pp. 8798).
Sung, K.K., & Poggio, T. (1998). Example-based learning for view-based human face
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(1), 39-51.
Swain, M.J., & Ballard, D.N. (1991). Color indexing. International Journal of Computer
Vision, 7(1), 11-32.
Szummer, M., & Picard, R.W. (1998). Indoor-outdoor image classification. In Proceedings of IEEE International Workshop on Content-based Access of Image and
Video Databases (pp. 42-51).
Town, C., & Sinclair, D. (2000). Content-based image retrieval using semantic visual
categories. Technical Report 2000.14, AT&T Laboratories Cambridge.
Vailaya, A., et al. (2001). Bayesian framework for hierarchical semantic classification of
vacation images. IEEE Transactions on Image Processing, 10(1), 117-130.
Vasconcelos, N., & Lippman, A. (2000). A probabilistic architecture for content-based
image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition
(pp. 1216-1221).
Wang, L., Chan, K.L., & Zhang, Z. (2003). Bootstrapping SVM active learning by
incorporating unlabelled images for image retrieval. In Proceedings of IEEE
Computer Vision and Pattern Recognition (pp. 629-634).
Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for
recognition. In Proceedings of European Conference on Computer Vision (pp. 1832).
Wu, Y., Tian, Q., & Huang, T.S. (2000). Discriminant-EM algorithm with application to
image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition
(pp. 1222-1227).
52 Tian, Wu, Yu & Huang
Chapter 3
Self-Supervised Learning
Based on Discriminative
Nonlinear Features and
Its Applications for
Pattern Classification
Ying Wu, Northwestern University, USA
Jie Yu, University of Texas at San Antonio, USA
ABSTRACT
For learning-based tasks such as image classification and object recognition, the
feature dimension is usually very high. The learning is afflicted by the curse of
dimensionality as the search space grows exponentially with the dimension.
Discriminant expectation maximization (DEM) proposed a framework by applying
self-supervised learning in a discriminating subspace. This paper extends the linear
DEM to a nonlinear kernel algorithm, Kernel DEM (KDEM), and evaluates KDEM
extensively on benchmark image databases and synthetic data. Various comparisons
with other state-of-the-art learning techniques are investigated for several tasks of
image classification, hand posture recognition and fingertip tracking. Extensive
results show the effectiveness of our approach.
Self-Supervised Learning 53
INTRODUCTION
Invariant object recognition is a fundamental but challenging computer vision task,
since finding effective object representations is generally a difficult problem. Three
dimensional (3D) object reconstruction suggests a way to invariantly characterize
objects. Alternatively, objects could also be represented by their visual appearance
without explicit reconstruction. However, representing objects in the image space is
formidable, since the dimensionality of the image space is intractable. Dimension
reduction could be achieved by identifying invariant image features. In some cases,
domain knowledge could be exploited to extract image features from visual inputs, such
as in content-based image retrieval (CBIR). CBIR is a technique which uses visual content
to search images from large-scale image databases according to users interests, and has
been an active and fast advancing research area since the 1990s (Smeulders, 2000).
However, in many cases machines need to learn such features from a set of examples
when image features are difficult to define. Successful examples of learning approaches
in the areas of content-based image retrieval, face and gesture recognition can be found
in the literature (Tieu et al., 2000; Cox et al., 2000; Tong & Wang, 2001; Tian et al., 2000;
Bellhumeur, 1996).
Generally, characterizing objects from examples requires huge training datasets,
because input dimensionality is large and the variations that object classes undergo are
significant. Labeled or supervised information of training samples are needed for
recognition tasks. The generalization abilities of many current methods largely depend
on training datasets. In general, good generalization requires large and representative
labeled training datasets. Unfortunately, collecting labeled data can be a tedious, if not
impossible, process. Although unsupervised or clustering schemes have been proposed
(e.g., Basri et al., 1998; Weber et al., 2000), it is difficult for pure unsupervised approaches
to achieve accurate classification without supervision.
This problem can be alleviated by semisupervised or self-supervised learning
techniques which take hybrid training datasets. In content-based image retrieval (e.g.,
Smeulders et al., 2000; Tieu et al., 2000; Cox et al., 2000; Tong & Wang, 2001; Tian et al.,
2000), there are a limited number of labeled training samples given by user query and
relevance feedback (Rui et al., 1998). Pure supervised learning on such a small training
dataset will have poor generalization performance. If the learning classifier is overtrained on the small training dataset, over-fitting will probably occur. However, there
are a large number of unlabeled images or unlabeled data in general in the given database.
Unlabeled data contain information about the joint distribution over features which can
be used to help supervised learning. These algorithms assume that only a fraction of the
data is labeled with ground truth, but still take advantage of the entire data set to generate
good classifiers; they make the assumption that nearby data are likely to be generated
by the same class. This learning paradigm could be seen as an integration of pure
supervised and unsupervised learning.
Discriminant-EM (DEM) (Wu et al., 2000) is a self-supervised learning algorithm for
such purposes that use a small set of labeled data with a large set of unlabeled data. The
basic idea is to learn discriminating features and the classifier simultaneously by
inserting a multiclass linear discriminant step in the standard expectation-maximization
(EM) (Duda et al., 2001) iteration loop. DEM makes the assumption that the probabilistic
structure of data distribution in the lower dimensional discriminating space is simplified

and could be captured by lower-order Gaussian mixture.
Fisher discriminant analysis (FDA) and multiple discriminant analysis (MDA)
(Duda et al., 2001) are traditional two-class and multiclass discriminant analysis techniques which treat every class equally when finding the optimal projection subspaces.
Contrary to FDA and MDA, Zhou and Huang (2001) proposed a biased discriminant
analysis (BDA) which treats all positive, that is, relevant, examples as one class, and
negative, that is, irrelevant, examples as different classes for content-based image
retrieval. The intuition behind BDA is that all positive examples are alike, each negative
example is negative in its own way (Zhou & Huang, n.d.). Compared with the state-ofthe-art methods such as support vector machines (SVM) (Vapnik, 2000), BDA (Zhou &
Huang, n.d.) outperforms SVM when the size of negative examples is small (< 20).
However, one drawback of BDA is its ignorance of unlabeled data in the learning
process. Unlabeled data could improve the classification under the assumption that
nearby data is to be generated by the same class (Cozman & Cohen, 2002). In the past
years there has been a growing interest in the use of unlabeled data for enhancing
classification accuracy in supervised learning such as text classification (e.g., Nigram et
al., 2000; Mitchell, 1999), face expression recognition (e.g., Cohen et al., 2003), and image
retrieval (e.g., Wu et al., 2000; Wang et al., 2003).
DEM differs from BDA in the use of unlabeled data and the way they treat the
positive and negative examples in the discrimination step. However, the discrimination
step is linear in both DEM and BDA, and they have difficulty handling data sets which
are not linearly separable. In CBIR, image distribution is likely, for example, a mixture of
Gaussians, which is highly nonlinear-separable. In this paper, we generalize the DEM
from linear setting to a nonlinear one. Nonlinear, kernel discriminant analysis transforms
the original data space X to a higher dimensional kernel feature space1 F and then projects
the transformed data to a lower dimensional discriminating subspace such that
nonlinear discriminating features could be identified and training data could be better
classified in a nonlinear feature subspace.
The rest of this chapter is organized as follows: In the second section, we present
nonlinear discriminant analysis using kernel functions (Wu & Huang, 2001; Tian et al.,
2004). In the third section, two schemes are presented for sampling training data for
efficient learning of nonlinear kernel discriminants. In the fourth section, Kernel DEM is
formulated, and in the fifth section we apply the Kernel DEM algorithm to various
applications and compare with other state-of-the-art methods. Our experiments include
standard benchmark testing, image classification using real image database and synthetic data, view-independent hand posture recognition and invariant fingertip tracking.
Finally, conclusions and future work are given in the last section.
NONLINEAR DISCRIMINANT ANALYSIS

Preliminary results of applying DEM for CBIR have been shown in (Wu et al., 2000).
In this section, we generalize the DEM from linear setting to a nonlinear one. We first map
the data x via a nonlinear mapping into some high, or even infinite dimensional feature
space F and then apply linear DEM in the feature space F. To avoid working with the
mapped data explicitly (being impossible if F is of an infinite dimension), we will adopt
the well-known kernel trick (Schlkopf & Smola, 2002). The kernel functions k(x,z)
compute a dot product in a feature space F: k(x,y) = ( (x)T (z)). Formulating the
algorithms in F using only in dot products, we can replace any occurrence of a dot
product by the kernel function k, which amounts to performing the same linear algorithm
as before, but implicitly in a kernel feature space F. The kernel principle has quickly
gained attention in image classification in recent years (e.g., Zhou & Huang, 2001;
Wang et al., 2003; Wu et al., 2001; Tian et al., 2004; Schlkopf et al., 2002; Wolf &
Shashua, 2003).
Linear Features and Multiple Discriminant Analysis

It is common practice to preprocess data by extracting linear and nonlinear features.
In many feature extraction techniques, one has a criterion assessing the quality of a single
feature which ought to be optimized. Often, one has prior information available that can
be used to formulate quality criteria, or probably even more commonly, the features are
extracted for a certain purpose, for example, for subsequently training some classifier.
What one would like to obtain is a feature which is as invariant as possible while still
covering as much of the information necessary for describing the datas properties of
interest.
A classical and well-known technique that solves this type of problem, considering
only one linear feature, is the maximization of the so called Rayleigh coefficient (Mika
et al., 2003; Duda et al., 2001).
J (W ) =
| W T S1W |
| W T S 2W |
(1)
Here, W denotes the weight vector of a linear feature extractor (i.e., for an example
x, the feature is given by the projections (W Tx) and S1 and S2 are symmetric matrices
designed such that they measure the desired information and the undesired noise along
the direction W. The ratio in Equation (1) is maximized when one covers as much as
possible of the desired information while avoiding the undesired.
If we look for discriminating directions for classification, we can choose S B
(between-class variance) to measure the separability of class centers that is S 1 in
Equation (1), and S W to measure the within-class variance, that is, S 2 in Equation (1).
In this case, we recover the well-known Fisher discriminant (Fisher, 1936), where S B
and SW are given by
C
S B = N j (m j m)(m j m) T
j =1
(2)
Nj
SW = ( xi( j ) m j )( xi( j ) m j ) T
j =1 i =1
(3)
we use {xi(j), i = 1,...,Nj}, j = 1,...,C (C = 2 for Fisher discriminant analysis (FDA)) to denote
the feature vectors of training samples. C is the number of classes, Nj is the number of
the samples of the jth class, xi(j) is the ith sample from the j th class, m j is mean vector of the
j th class, and m is grand mean of all examples.
If S1in Equation (1) is the covariance matrix
S1 =
Nj
1
C
(x
j =1
1
Nj
i =1
( j)
i
m)( x i( j ) m) T
(4)
and S2 identity matrix, we recover standard principal component analysis (PCA)

(Diamantaras & Kung, 1996).
If S1 is the data covariance and S2 the noise covariance (which can be estimated
analogous to Equation (4), but over examples sampled from the assumed noise distribution), we obtain oriented PCA (Diamantaras & Kung, 1996), which aims at finding a
direction that describes most variance in the data while avoiding known noise as much
as possible.
PCA and FDA, that is, linear discriminant analysis (LDA), are both common
techniques for feature dimension reduction. LDA constructs the most discriminative
features while PCA constructs the most descriptive features in the sense of packing most
energy.
There has been a tendency to prefer LDA over PCA because, as intuition would
suggest, the former deals directly with discrimination between classes, whereas the latter
deals without paying particular attention to the underlying class structure. An interesting result reported by Martinez and Kaka (2001) is that this is not always true in their study
on face recognition. According to Martinez and Kak, PCA might outperform LDA when
the number of samples per class is small or when the training data nonuniformly sample
the underlying distribution. When the number of training samples is large and training
data is representative for each class, LDA will outperform PCA.
Multiple discriminant analysis (MDA) is a natural generalization of Fishers linear
discriminative analysis (FDA) for multiple classes (Duda et al., 2001). The goal is to
maximize the ratio of Equation (1). The advantage of using this ratio is that it has been
proven (Fisher, 1938) that if SW is a nonsingular matrix then this ratio is maximized when
the column vectors of the projection matrix, W, are the eigenvectors of SW1 S B . It should
be noted that W maps the original d1-dimensional data space X to a d2-dimensional space
(d2 C 1, C is the number of classes).
For both FDA and MDA, the columns of the optimal W are the generalized
eigenvector(s) w i associated with the largest eigenvalue(s). W opt = [w 1, w 2 , ..., w C1 ]
will contain in its columns C-1 eigenvectors corresponding to C-1 eigenvalues, that
is, S Bw i = iS Ww i (Duda et al., 2001).
Kernel Discriminant Analysis

To take into account nonlinearity in the data, we propose a kernel-based approach.
The original MDA algorithm is applied in a feature space F which is related to the original
space by a nonlinear mapping f: x (x). Since in general the number of components
in (x) can be very large or even infinite, this mapping is too expensive and can not be
carried out explicitly, but through the evaluation of a kernel k, with elements k(xi, ij) = (xi)T
(xj). This is the same idea adopted by the support vector machine (Vapnik, 2000), kernel
PCA (Schlkopf et al., 1998), and invariant feature extractions (Mika et al., 1999; Roth &
Steinhage, 1999). The trick is to rewrite the MDA formulae using only dot products of
the form iT j , so that the reproducing kernel matrix can be substituted into the
formulation and the solution, thus eliminating the need for direct nonlinear transformation.
Using superscript to denote quantities in the new space and using SB and SW for
between-class scatter matrix and within-class scatter matrix, we have the objective
function in the following form:
Wopt = arg max

W
| W T S B W |
| W T S W W |
(5)
and
C
S B = N j (m j m )(m j m ) T
(6)
j =1
C
Nj
S W = ( (x i( j ) ) m j )( (x i( j ) ) m j ) T
j =1 i =1
with m =
Nj
1
N
(x k ) , mj =
k =1
(7)
1
Nj
(x
k =1
) where j = 1, ..., C, and N is the total number
of samples.
In general, there is no other way to express the solution W opt F, either because F
is too high or infinite dimension, or because we do not even know the actual feature space
connected to a certain kernel. Schlkopf and Smola (2002) and Mika et al. (2003) showed
that any column of the solution Wopt , must lie in the span of all training samples in F, that
r
is, Wi F. Thus for some expansion coefficients = [ 1 , L , N ]T ,
N
r
w i = k (x k ) =
k =1
i = 1, K , N
(8)
where = [ (x 1 ), L , (x N )] . We can therefore project a data point xk onto one coordinate of the linear subspace of F as follows (we will drop the subscript on wi in the ensuing
equation):
r
w T (x k ) = T T (x k )
(9)
k (x 1 , x k )
v
= r T
= T
M
k
k (x N , x k )
k (x 1 , x k )
M
k =
k (x N , x k )
(10)
(11)
where we have rewritten dot products, (x)T (y) with kernel notation k(x,y). Similarly,
we can project each of the class means onto an axis of the subspace of feature space F
using only products:
r
w T m j = T
(x 1 ) T ( x k )
M
=1
k
(x N ) T ( x k )
Nj
1
Nj
1 Nj
N j k (x 1 , x k )
r k =1
M
=T
1 j
N j k (x N , x k )
k =1
r
= Tj
(12)
(13)
(14)
It follows that
r
r
w T S B w = T K B
(15)
T
where K B = N j ( j )( j ) and
j =1
r
r
w T SW w = T K W
C
(16)
Nj
T
where K W = ( k j )( k j ) . The goal of kernel multiple discriminant analysis
j =1 k =1
(KMDA) is to find
A opt = arg max

A
| AT KBA |
| A T KW A |
(17)
r
r
where A = [ 1 , L , C 1 ] , C is the total number of classes, N is the size of training samples,
and KB and KW are NN matrices which require only kernel computations on the training
samples (Schlkopf & Smola, 2002).
r
Now we can solve for s, the projection of a new pattern z onto w is given by
Equations (9) and (10). Similarly, algorithms using different matrices for S1, and S2 in
Equation (1), are easily obtained along the same lines.
Biased Discriminant Analysis

Biased discriminant analysis (BDA) (Zhou & Huang, 2001) differs from traditional
MDA defined in Equations (1)-(3) and (5)-(7) in a modification on the computation of
between-class scatter matrix SB and within-class scatter matrix SW. They are replaced by
SNP and SP, respectively.
Ny
S N P = (y i m x )(y i m x ) T
i =1
(18)
Nx
S P = (x i m x )(x i m x ) T
i =1
(19)
where {xi, i = 1, ..., Nx} denotes the positive examples and {yi, i = 1, ..., Ny} denotes the
negative examples, and mx is the mean vector of the sets {xi}, respectively. SNP is the
scatter matrix between the negative examples and the centroid of the positive examples,
and SP is the scatter matrix within the positive examples. NP indicates the asymmetric
property of this approach, that is, the users biased opinion towards the positive class,
thus the name of biased discriminant analysis (BDA) (Zhou & Huang, 2001).
Regularization and Discounting Factors

It is well known that sample-based plug-in estimates of the scatter matrices based
on Equations (2, 3, 6, 7, 18, 19) will be severely biased for a small number of training
samples, that is, the large eigenvalue becomes larger, while the small ones become
smaller. If the number of the feature dimensions is large compared to the number of
training examples, the problem becomes illposed. Especially in the case of kernel
algorithms, we effectively work in the space spanned by all N mapped training examples
(x) which are, in practice, often linearly dependent. For instance, for KMDA, a solution
with zero within class scatter (i.e., ATKWA = 0) is very likely due to overfitting. A
compensation or regulation can be done by adding small quantities to the diagonal of
the scatter matrices (Friedman, 1989).
TRAINING ON A SUBSET
We still have one problem: Although we could avoid working explicitly in the
extremely high or infinite dimensional space F, we are now facing a problem in N variables,
a number which in many practical applications would not allow us to store or manipulate
NN matrices on a computer anymore. Furthermore, solving, for example, an eigenproblem or a QP of this size is very time consuming (O(N3)). To maximize Equation (17),
we need to solve an NN eigen- or mathematical programming problem, which might be
intractable for a large N. Approximate solutions could be obtained by sampling representative subsets of the training data {x k | k = 1, L , M , M << N } , and using
k = [k (x 1 , x k ), L , k (x M , x k )]T to take the place of k . Two data-sampling schemes are

proposed.
PCA-Based Kernel Vector Selection
The first scheme is blind to the class labeling. We select representatives, or kernel
vectors, by identifying those training samples which are likely to play a key role in
= [1 , L , N ] . is an NN matrix, but rank () << N , when the size of the training

dataset is very large. This fact suggests that some training samples could be ignored in
calculating kernel features .
We first compute the principal components of . Denote the NN matrix of
concatenated eigenvectors with P. Thresholding elements of abs(P) by some fraction of
the largest element of it allows us to identify salient PCA coefficients. For each column
corresponding to a nonzero eigenvalue, choose the training samples which correspond
to a salient PCA coefficient, that is, choose the training samples corresponding to rows
that survive the thresholding. Do so for every nonzero eigenvalue and we arrive at a
decimated training set, which represents data at the periphery of each data cluster.
Figure 1 shows an example of KMDA with 2D (two dimensional) two-class nonlinear-separable samples.
Evolutionary Kernel Vector Selection

The second scheme is to take advantage of class labels in the data. We maintain a
set of kernel vectors at every iteration which are meant to be the key pieces of data for
training. M initial kernel vectors, KV (0), are chosen at random. At iteration k, we have a
set of kernel vectors, KV (k), which are used to perform KMDA such that the nonlinear
T
k)
projection y i( k ) = w ( k ) (x i ) = A (opt
i( k ) of the original data xi can be obtained. We
assume Gaussian distribution (k) for each class in the nonlinear discrimination space ,
and the parameters (k) can be estimated by {y(k)}, such that the labeling and training error
e(k) can be obtained by l i ( k ) = arg max p(l j | y i , ( k )} .
j
If e(k) < e(k1), we randomly select M training samples from the correctly classified
training samples as kernel vector KV(t +1) in iteration k +1. Another possibility is that if
any current kernel vector is correctly classified, we randomly select a sample in its
Figure 1. KMDA with a 2D two-class nonlinear-separable example: (a) original data;

(b) the kernel features of the data; (c) the normalized coefficients of PCA on , in which
only a small number of them are large (in black); (d) the nonlinear mapping
(a)
(b)
(c)
(d)
topological neighborhood to replace this kernel vector in the next iteration. Otherwise,
that is, e(k) > e(k1), and we terminate.
The evolutionary kernel vector selection algorithm is summarized below:
Evolutionary Kernel Vector Selection: Given a set of training data D = (X, L) = {xi,
li), i = 1, ..., N}
to identify a set of M kernel vectors KV = {vi, i = 1, ..., M}.
k = 0; e = ; KV (0) = random_pick(X); // Init
do {
k)
A (opt
= KMDA(X, KV ( k ) ) ; // Perform KMDA
k)
);
Y ( k ) = Pr oj (X, A (opt
( k ) = Bayes(Y ( k ) , L);
// Project X to
// Bayesian Classifier
L ( k ) = Labeling (Y ( k ) , ( k ) ); // Classification
e ( k ) = Error ( L ( k ) , L) ;
if ( e
(k )
// Calculate error
< e)
e = e (k ) ; KV = KV (k ) ; k + +;
KV ( k ) = random _ pick ({x i : l i ( k ) l i });
else
KV = KV ( k 1) ;
break;
end
}
return KV;
KERNEL DEM ALGORITHM
In this paper, pattern classification is formulated as a transductive problem, which

is to generalize the mapping function learned from the labeled training data set L to a
specific unlabeled data set U. We make an assumption here that L and U are from the same
distribution. This assumption is reasonable because, for example, in content-based
image retrieval, the query images are drawn from the same image database. In short,
pattern classification is to classify the images or objects in the database by
y i = arg max p( y j | x i , L, U : x i U }
(20)
j =1,L,C
where C is the number of classes and yi is the class label for xi.
The expectation-maximization (EM) (Duda et al., 2001) approach can be applied to
this transductive learning problem, since the labels of unlabeled data can be treated as
missing values. We assume that the hybrid data set is drawn from a mixed density
distribution of C components {cj, j = 1, ..., C}, which are parameterized by = { j, j =
1, ..., C}. The mixture model can be represented as
C
p ( x | ) = p ( x | c j ; j ) p ( c j | j )
(21)
j =1
where x is sample drawn from the hybrid data set D = L U. We make another assumption
that each component in the mixture model corresponds to one class, that is, {yj = c j , j =
1, ..., C}.
Since the training data set D is the union of labeled data set L and unlabeled data
set U, the joint probability density of the hybrid data set can be written as:
p ( D | ) =
p(c
x i U j =1
| ) p ( x i | c j ; ) p ( y i = c i | ) p ( x i | y i = c i ; )
x i L
(22)
Equation (22) holds when we assume that each sample is independent to others. The
first part of Equation (22) is for the unlabeled data set, and the second part is for the
labeled data set.
The parameters can be estimated by maximizing a posteriori probability p( | D) .
Equivalently,
this
can
be
done
by
maximizing
log( p( | D)).
Let
l ( | D) = log( p() p( D | )) , and we have

l ( | D ) = log( p()) +
log( p(c
x i U
log( p( y
x i L
j =1
| ) p (x i | c j ; )) +
= ci | ) p(x i | y i = ci ; ))
(23)
Since the log of a sum is hard to deal with, a binary indicator zi is introduced, zi = (zi1,
..., ziC), denoted with observation Oj : z ij = 1 if and only if yi = cj, and zij = 0 otherwise, so
that
l ( | D, Z ) = log( p()) +
x i D j =1
ij
log( p(O j | ) p(x i | O j ; ))
(24)
The EM algorithm can be used to estimate the probability parameters by an

iterative hill climbing procedure, which alternatively calculates E(Z), the expected values
of all unlabeled data, and estimates the parameters given E(Z). The EM algorithm
generally reaches a local maximum of l ( | D) .
As an extension to the EM algorithm, Wu et al. (2000) proposed a three-step
algorithm, called Discriminant-EM (DEM), which loops between an expectation step, a
discriminant step (via MDA), and a maximization step. DEM estimates the parameters of
a generative model in a discriminating space.
As discussed in Section 2.2, Kernel DEM (KDEM) is a generalization of DEM in
which instead of a simple linear transformation to project the data into discriminant
subspaces, the data is first projected nonlinearly into a high dimensional feature space
F where the data is better linearly separated. The nonlinear mapping () is implicitly
determined by the kernel function, which must be determined in advance. The transformation from the original data space X to the discriminating space , which is a linear
subspace of the feature space F, is given by w T() implicitly or A T explicitly. A lowdimensional generative model is used to capture the transformed data in .
C
p ( y | ) = p(w T (x ) | c j ; j ) p(c j | j )
j =1
(25)
Empirical observations suggest that the transformed data y approximates a Gaussian

in , and so in our current implementation, we use low-order Gaussian mixtures to model
the transformed data in . Kernel DEM can be initialized by selecting all labeled data as
kernel vectors and training a weak classifier based on only labeled samples. Then, the
three steps of Kernel DEM are iterated until some appropriate convergence criterion:
(k ) ]
E-step: set Z ( k +1) = E[ Z | D;
D-step: set A kopt+1 = arg max

A
| AT KBA |
, and project a data point x to a linear
| A T KW A |
subspace of feature space F.
( k +1) = arg max p( | D; Z ( k +1) )

M-Step: set
The E-step gives probabilistic labels to unlabeled data, which are then used by the
D-step to separate the data. As mentioned above, this assumes that the class distribution
is moderately smooth.
EXPERIMENTS AND ANALYSIS

In this section, we compare KMDA and KDEM with other supervised learning
techniques on various benchmark datasets and synthetic data for image classification,
hand posture recognition, and invariant fingertip tracking tasks. The various datasets
include the benchmark datasets2, the MIT facial image database3 (CBC Face Database),
Corel database4, our raw dataset of 14,000 unlabeled hand images together with 560
labeled images, and 1,000 images including both fingertips and nonfingertips.
Benchmark Test for KMDA

In the first experiment, we verify the ability of KMDA with our data sampling
algorithms. Several benchmark datasets2 are used in the experiments. For comparison,
KMDA is compared with a single RBF classifier (RBF), a support vector machine (SVM),
AdaBoost, and the kernel Fisher discriminant (KFD) on the benchmark dataset (Mika et
al., 1999) and linear MDA. Kernel functions that have been proven useful are for example,
Gaussian RBF, k (x , z ) = exp(- x - z
/ c) , or polynomial kernels, k(x,z) = (xz)d , for some
positive constants cR and dN, respectively (Schlkopf & Smola, 2002). RBF kernels
are used in all kernel-based algorithms.
In Table 1, KMDA-random is KMDA with kernel vectors randomly selected from
training samples, KMDA-pca is KMDA with kernel vectors selected from training
samples based on PCA, KMDA-evolutionary is KMDA with kernel vectors selected from
training samples based on an evolutionary scheme. The benchmark test shows the Kernel
MDA achieves comparable performance as other state-of-the-art techniques over
different training datasets, in spite of the use of a decimated training set. Comparing three
schemes of selecting kernel vectors, it is clear that both PCA-based and evolutionarybased schemes work slightly better than the random selection scheme by having smaller
error rate and/or smaller standard deviation. Finally, Table 1 clearly shows superior
performance of KMDA over linear MDA.
Table 1. Benchmark test: Average test error and standard deviation in percentage
Error Rate
Banana
Breast-Cancer
Heart
RBF
10.80.06
27.60.47
17.60.33
AdaBoost
12.30.07
30.40.47
20.30.34
SVM
11.50.07
26.00.47
16.00.33
KFD
10.80.05
25.80.48
16.10.34
MDA
38.432.5
28.571.37
20.11.43
KMDArandom
11.030.26
27.41.53
16.50.85
KMDApca
10.70.25
27.50.47
16.50.32
KMDAevolutionary
10.80.56
26.30.48
16.10.33
(# Kernel Vectors)
120
40
20
Method
Classification
in Percentage (%)
Benchmark
Kernel Setting
There are two parameters that need to be determined for kernel algorithms using RBF
(Radial Basis Function) kernel. The first is the degree c and the second is the number of
kernel vectors used. The kernel-based approaches are facing the problem of sensitivity
to its parameter selected, for example, the Gaussian (or Radial Basis Function) kernel,
k (x , z ) = exp(- x - z
/ c) . This very classical kernel is highly sensitive to the scale
parameter c. The varying performance of kernel-based approaches not only happen to

the different kernels but also within the same kernel when applied on different, for
example, image databases. Till now there is no general guideline on how to set the
parameters beforehand except setting them empirically.
In the second experiment, we will determine degree c and the number of kernel
vectors empirically using Gaussian RBF kernel as an example. The same benchmark
dataset as in the previous section is used. Figure 2 shows the average error rate in
percentage of KDEM with RBF kernel under different degrees c and the varying number
of kernel vectors used on heart data. By empirical observation, we find that 10 for c and
20 for number of kernel vectors gives nearly the best performance at a relatively low
computation cost. Similar results are obtained for tests on breast-cancer data and banana
data. Therefore this kernel setting will be used in the rest of our experiments.
KDEM versus KBDA for Image Classification

As mentioned above, biased discriminant analysis (BDA) (Zhou & Huang, 2001)
has achieved satisfactory results in content-based image retrieval when the number of
training samples is small (<20). BDA differs from traditional MDA in that it tends to cluster
all the positive samples and scatter all the negative samples from the centroid of the
Error Rate in Percentage
Figure 2. Average error rate for KDEM with RBF kernel under varying degree c and
number of kernel vectors on heart data
50
45
c1
40
c5
35
c 10
30
c 20
25
c 40
20
c 60
c 80
15
1
10
20
40
60
80
100
c 100
Number of Kernel Vectors
Error Rate
Figure 3. Comparison of KDEM and KBDA for face and non-face classification
20
18
16
14
12
KBDA
10
8
6
4
2
0
KDEM
10
20
40
60
80
100
200
Number of Negtive Examples
positive examples. This works very well with a relatively small training set. However, BDA
is biased towards the centroid of the positive examples. It will be effective only if these
positive examples are the most-informative images (Cox et al., 2000; Tong & Wang, 2001),
for example, images close to the classification boundary. If the positive examples are
most-positive images (Cox et al., 2000; Tong & Wang, 2001), for example, the images far
away from the classification boundary. The optimal transformation found based on the
most-positive images will not help the classification for images on the boundary.
Moreover, BDA ignores the unlabeled data and takes only the labeled data in learning.
In the third experiment, Kernel DEM (KDEM) is compared with the Kernel BDA
(KBDA) on both an image database and synthetic data. Figure 3 shows the average
classification error rate in percentage for KDEM and KBDA with the same RBF kernel for
face and nonnonface classification. The face images are from an MIT facial image
database3 (CBC Face Database) and nonface images are from a Corel database4. There
are 2,429 faces images from the MIT databases and 1,385 nonface images (14 categories
with about 99 each category), a subset of the Corel database, in the experiment. Some
Figure 4. Examples of (a) face images from MIT facial database and (b) non-face images
from Corel database
(a)
(b)
examples of face and nonface images are shown in Figure 4. For training sets, the face
images are randomly selected from the MIT database with fixed size 100, and nonface
images are randomly selected from the Corel database with varying sizes from five to 200.
The testing set consists of 200 random images (100 faces and 100 nonfaces) from two
databases. The images are resized to 1616 and converted to a column-wise concatenated feature vector.
In Figure 3, when the size of negative examples is small (< 20), KBDA outperforms
KDEM, and KDEM performs better when more negative examples are provided. This
agrees with our expectation.
In this experiment, the size of negative examples is increased from five to 200. There
is a possibility that most of the negative examples are from the same class. To further test
the capability of KDEM and KBDA in classifying negative examples with a varying
number of classes, we perform experiments on synthetic data for which we have more
controls over data distribution.
A series of synthetic data is generated based on Gaussian or Gaussian mixture
models with feature dimension of 2, 5 and 10 and a varying number of negative classes
from 1 to 9. In the feature space, the centroid of positive samples is set at origin and the
centroids of negative classes are set randomly with distance of 1 to the origin. The
variance of each class is a random number between 0.1 and 0.3. The features are
independent of each other. We include 2D synthetic data for visualization purpose. Both
the training and testing sets have fixed size of 200 samples with 100 positive samples and
100 negative samples with varying number of classes.
Figure 5 shows the comparison of KDEM, KBDA and DEM algorithms on 2D, 5D
and 10D synthetic data. In all cases, with the increasing size of negative classes from 1
Figure 5. Comparison of KDEM, KBDA and DEM algorithms on (a) 2-D (b) 5-D (c) 10D synthetic data with varying number of negative classes
45
Error Rate
40
35
DEM
30
KBDA
25
KDEM
20
15
10
1
Num ber of Classes of Negtive Exam ples
(a) Error rate on 2-D synthetic data

45
Error Rate
40
35
DEM
30
KBDA
25
KDEM
20
15
10
1
Number of Classes of Negtive Examples
(b) Error rate on 5-D synthetic data

45
Error Rate
40
35
DEM
30
KBDA
25
KDEM
20
15
10
1
Num ber of Classes of Negtive Exam ples
(c) Error rate on 10-D synthetic data
to 9, KDEM always performs better than KBDA and DEM thus shows its superior
capability of multiclass classification. Linear DEM has comparable performance with
KBDA on 2D synthetic data and outperforms KBDA on 10D synthetic data. One possible
reason is that learning is on hybrid data in both DEM and KDEM, while only labeled data
is used in KBDA. This indicates that proper incorporation of unlabeled data in

semisupervised learning does improve classification to some extent.
Moreover, Zhou and Huang (2001) used two parameters m, g[0,1] to control the
regularization. The regularized version of SP and SNP with n being the dimension of the
original space and I identify matrix are
S P = (1 ) S P +
tr[ S P ]I
n
S N P = (1 ) S N P +
tr[ S N P ]I
n
(26)
(27)
The parameter m controls shrinkage toward a multiple of the identity matrix. And
tr[] denotes the trace operation for a matrix. g is the discounting factor. With different
combinations of the (,) values, the regularized and/or discounted BDA provides a rich
set of alternatives: ( = 0, = 1) gives a subspace that is mainly defined by minimizing the
scatters among the positive examples, resembling the effect of a whitening transform.
Whitening transform is the special case when only positive examples are considered; (
= 1, = 0) gives a subspace that mainly separates the negative from the positive centroid,
with minimal effort on clustering the positive examples; ( = 0, = 0) is the full BDA and
( = 1, = 1) represents the extreme of discounting all configurations of the training
examples and keeping the original feature space unchanged.
However, the set of (, ) was proposed without further testing. Zhou and Huang
(2001) only analyzed full BDA ( = 0, = 0). To take a step further, we also investigate the
various combinations of ( , ) values on the performance of BDA. We test on cropped
face images consisting of 94 facial images (48 male and 48 female). We feed BDA a small
number of training samples with different values of (,). We find that full BDA ( = 0,
= 0) could be further improved by 41.4% in terms of average error rate with a different value
( = 0.1, = 0.4). This is a promising result and we will further investigate the regularization
issue for all discriminant-based approaches in future work.
Hand Posture Recognition

In the fourth experiment, we examine KDEM on a hand gesture recognition task. The
task is to classify among 14 different hand postures, each of which represents a gesture
command model, such as navigating, pointing, grasping, etc. Our raw dataset consists
of 14,000 unlabeled hand images together with 560 labeled images (approximately 40
labeled images per hand posture), most from video of subjects making each of the hand
postures. These 560 labeled images are used to test the classifiers by calculating the
classification errors.
Hands are localized in video sequences by adaptive color segmentation, and hand
regions are cropped and converted to gray-level images. Gabor wavelet (Jain & Farroknia,
1991) filters with three levels and four orientations are used to extract 12 texture features.
Ten coefficients from the Fourier descriptor of the occluding contour are used to
represent hand shape. We also use area, contour length, total edge length, density, and
second moments of edge distribution, for a total of 28 low-level image features (I-feature).
For comparison, we represent images by coefficients of 22 largest principal components
Table 2. View-independent hand posture recognition: Comparison among multi-layer

perceptron (MLP), Nearest Neighbor (NN), Nearest Neighbor with growing templates
(NN-G), EM, linear DEM (LDEM) and KDEM. The average error rate in percentage on
560 labeled and 14,000 unlabeled hand images with 14 different hand postures.
Algorithm
MLP
NN
NN-G
EM
LDEM
KDEM
I-Feature
33.3
30.2
15.8
21.4
9.2
5.3
E-Feature
39.6
35.7
20.3
20.8
7.6
4.9
of the dataset resized to 2020 pixels (these are eigenimages, or E-features). In our
experiments, we use 140, that is, 10 for each hand posture, and 10,000 (randomly selected
from the whole database) labeled and unlabeled images respectively, for training both
EM and DEM.
Table 2 shows the comparison. Six classification algorithms are compared in this
experiment. The multilayer perceptron (Haykin, 1999) used in this experiment has one
hidden layer of 25 nodes. We experiment with two schemes of the nearest neighbor
classifier. One uses just 140 labeled samples, and the other uses 140 labeled samples to
bootstrap the classifier by a growing scheme, in which newly labeled samples will be
added to the classifier according to their labels. The labeled and unlabeled data for both
EM and DEM are 140 and 10,000, respectively.
We observe that multilayer perceptrons are often trapped in local minima, and
nearest neighbors suffers from the sparsity of the labeled templates. The poor performance of pure EM is due to the fact that the generative model does not capture the
Figure 6. Data distribution in the projected subspace (a) Linear MDA (b) Kernel MDA.
Different postures are more separated and clustered in the nonlinear subspace by
KMDA.
(a)
(b)
Figure 7. (a) Some correctly classified images by both LDEM and KDEM (b) images
that are mislabeled by LDEM, but correctly labeled by KDEM (c) images that neither
LDEM or KDEM can correctly label.
(a)
(b)
(c)
ground-truth distribution well, since the underlying data distribution is highly complex.
It is not surprising that Linear DEM (LDEM) and KDEM outperform other methods, since
the D-step optimizes separability of the class.
Comparing KDEM with LDEM, we find KDEM often appears to project classes to
approximately Gaussian clusters in the transformed spaces, which facilitate their modeling with Gaussians. Figure 6 shows typical transformed data sets for linear and
nonlinear discriminant analysis, in projected 2D subspaces of three different hand
postures. Different postures are more separated and clustered in the nonlinear subspace
by KMDA. Figure 7 shows some examples of correctly classified and mislabeled hand
postures for KDEM and Linear DEM.
Fingertip Tracking
In some vision-based gesture interface systems, fingers could be used as accurate
pointing input devices. Also, fingertip detection and tracking play an important role in
recovering hand articulations. A difficulty of the task is that fingertip motion often
undergoes arbitrary rotations, which makes it hard to invariantly characterize fingertips.
In the last experiment, the proposed Kernel DEM algorithm is employed to discriminate
fingertips and nonfingertips.
We collected 1000 training samples including both fingertips and nonfingertips.
Nonfingertip samples are collected from the background of the working space. Some
training samples are shown in Figure 8. The 50 samples for each of the two classes are
manually labeled. Training images are resized to 2020 and converted to gray-level
images. Each training sample is represented by its coefficient of the 22 largest principal
components. Kernel DEM algorithm is performed on such training dataset to obtain a
kernel transformation and a Bayesian classifier. Assume at time t1, fingertip location is
Figure 8. (a) Fingertip samples and (b) non-fingertip samples
(a)
(b)
~
Xt1 in the image. At time t, the predicted location of the fingertip is X t according to
Kalman prediction. For simplicity, the size of search window is fixed by 1010 centered
~
at X t . For each location in the search window, a fingertip candidate is constructed by
the 2020 sized image centered at that location. Thus, 100 candidates will be tested. A
probabilistic label of such fingertip candidate is obtained by classification. The one with
the largest probability is determined as the tracked location at time t. We run the tracking
algorithm on sequences containing a large amount of fingertip rotation and complex
backgrounds. The tracking results are fairly accurate.
CONCLUSION
Two sampling schemes are proposed for efficient, kernel-based, nonlinear, multiple
discriminant analysis. These algorithms identify a representative subset of the training
samples for the purpose of classification. Benchmark tests show that KMDA with these
adaptations not only outperforms the linear MDA but also performs comparably with the
best known supervised learning algorithms. We also present a self-supervised discriminant analysis technique, Kernel DEM (KDEM), which employs both labeled and unlabeled data in training. On synthetic data and real image databases for several applications
such as image classification, hand posture recognition, and fingertip tracking, KDEM
shows superior performance over biased discriminant analysis (BDA), nave supervised
learning and some other existing semisupervised learning algorithms.
Our future work includes several aspects: (1) We will look further into the regularization factor issue for the discriminant-based approaches on a large database; (2) We
will intelligently integrate biased discriminant analysis for small numbers of training
samples with traditional multiple discriminant analysis on large numbers of training
samples and varying numbers of classes; (3) To avoid the heavy computation over the
whole database, we will investigate schemes of selecting a representative subset of
unlabeled data whenever unlabeled data helps, and perform parametric or nonparametric
tests on the condition when it does not help; (4) Gaussian or Gaussian mixture models
are assumed for data distribution in the projected optimal subspace, even when the initial
data distribution is highly nonGaussian. We will examine the data modeling issue more
closely with Gaussian (or Gaussian mixture) and nonGaussian distributions.
ACKNOWLEDGMENT
This work was supported in part by the National Science Foundation (NSF) under
EIA-99-75019 in the University of Illinois at Urbana-Champaign, and by the University
of Texas at San Antonio.
REFERENCES
Basri, R., Roth, D., & Jacobs, D. (1998). Clustering appearances of 3D objects. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA.
Bellhumeur, P., Hespanha, J., & Kriegman, D. (1996). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. Proceedings of European Conference
on Computer Vision, Cambridge, UK.
CBCL Face Database #1, MIT Center for Biological and Computation Learning. Online:
http://www.ai.mit.edu/projects/cbcl
Cohen, I., Sebe, N., Cozman, F. G., Cirelo, M. C., & Huang, T. S. (2003). Learning Bayesian
network classifiers for facial expression recognition with both labeled and unlabeled data. Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition, Madison, WI.
Cox, I. J., Miller, M. L., Minka, T. P., & Papsthomas, T. V. (2000). The Bayesian image
retrieval system, PicHunter: Theory, implementation, and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20-37.
Cozman, F. G., & Cohen, I. (2002). Unlabeled data can degrade classification performance
of generative classifiers. Proceedings of the 15 th International Florida Artificial
Intelligence Society Conference, Pensacola, FL, (pp. 327-331).
Cui, Y., & Weng, J. (1996). Hand sign recognition from intensity image sequence with
complex background. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, San Francisco (pp. 88-93).
Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks. New
York: John Wiley & Sons.
Duda, R. O., Hart, P. E., & Stork, D.G. (2001). Pattern classification (2nd ed.). New York:
John Wiley & Sons.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, vol. 7, 179-188.
Fisher, R. A. (1938). The statistical utilization of multiple measurements. Annals of
Eugenics, vol. 8, 376-386.
Friedman, J. (1989). Regularized discriminant analysis. Journal of American Statistical
Association, 84(405), 165-175.
Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). NJ: Prentice
Hall.
Jain, A. K., & Farroknia, F. (1991). Unsupervised texture segmentation using Gabor filters.
Pattern Recognition, 24(12), 1167-1186.
Martinez, A. M., & Kak, A. C., (2001). PCA versus LDA. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(2), 228-233.
Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. (1999a). Fisher
discriminant analysis with Kernels. Proceedings of IEEE Workshop on Neural
Networks for Signal Processing.
Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. R. (1999b).
Invariant feature extraction and classification in kernel spaces. Proceedings of
Neural Information Processing Systems, Denver.
Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. R. (2003).
Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(5).
Mitchell, T. (1999). The role of unlabeled data in supervised learning. Proceedings of
the Sixth International Colloquium on Cognitive Science, Spain.
Nigram, K., McCallum, A. K., Thrun, S., & Mitchell, T. M. (2000). Text classification from
labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103-134.
Roth, V., & Steinhage, V. (1999). Nonlinear discriminant analysis using kernel functions.
Proceedings of Neural Information Processing Systems, Denver, CO.
Rui, Y., Huang, T. S., Ortega, M., & Mehrotra, S. (1998). Relevance feedback: A power
tool in interactive content-based image retrieval. IEEE Transactions on Circuits
and Systems for Video Technology, 8(5), 644-655.
Schlkopf B., & Smola, A. J. (2002). Learning with kernels. Boston: MIT Press.
Schlkopf, B., Smola, A., & Mller, K. R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10, 1299-1319.
Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image
retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22(12), 1349-1380.
Tian, Q., Hong, P., & Huang, T. S. (2000). Update relevant image weights for contentbased image retrieval using support vector machines. Proceedings of IEEE
International Conference on Multimedia and Expo, New York, (vol. 2, pp. 11991202).
Tian, Q., Yu, J., Wu, Y., & Huang, T.S. (2004). Learning based on kernel discriminant-EM
algorithm for image classification. IEEE International Conference on Acoustics,
Speech, and Signal Processing, Montreal, Quebec, Canada.
Tieu, K., & Viola, P. (2000). Boosting image retrieval. Proceedings of IEEE International
Conference on Computer Vision and Pattern Recognition, Hilton Head, SC.
Tong, S., & Wang, E. (2001). Support vector machine active learning for image retrieval.
Proceedings of ACM International Conference on Multimedia, Ottawa, Canada
(pp. 107-118).
Vapnik, V. (2000). The nature of statistical learning theory (2nd ed.). Springer-Verlag.
Wang, L., Chan, K. L., & Zhang, Z. (2003). Bootstrapping SVM active learning by
incorporating unlabelled images for image retrieval. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, WI.
Weber, M., Welling, M., & Perona, P. (2000). Towards automatic discovery of object
categories. Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition, Hilton Head Island, SC.
Wolf, L., & Shashua, A. (2003). Kernel principle for classification machines with
applications to image sequence interpretation. Proceedings of IEEE International
Conference on Computer Vision and Pattern Recognition, Madison, WI.
Wu, Y., & Huang, T. S. (2001). Self-supervised learning for object recognition based on
kernel discriminant-EM algorithm. Proceedings of IEEE International Conference
on Computer Vision, Vancouver, Canada.
Wu, Y., Tian, Q., & Huang, T. S. (2000). Discriminant EM algorithm with application to
image retrieval. Proceedings of IEEE International Conference on Computer
Vision and Pattern Recognition, Hilton Head Island, SC.
Zhou, X., & Huang, T.S. (2001). Small sample learning during multimedia retrieval using
biasMap. Proceedings of IEEE International Conference on Computer Vision and
Pattern Recognition, Hawaii.
ENDNOTES
1
2
3
A term used in kernel machine literatures to denote the new space after the nonlinear
transform; not to be confused with the feature space concept used in contentbased image retrieval to denote the space for features or descriptors extracted from
the media data.
The benchmark data sets are obtained from http://mlg.anu.edu.au/~raetsch/.
The MIT facial database can be downloaded from http://www.ai.mit.edu/projects/
cbcl/software-datasets/FaceData2.html.
The Corel database is widely used as benchmark in content-based image retrieval.
Section 2
Audio and Video Semantics:
Models and Standards
Context-Based Interpretation and Indexing of Video Data 77
Chapter 4
Context-Based
Interpretation and
Indexing of Video Data
Ankush Mittal, IIT Roorkee, India
Cheong Loong Fah, The National University of Singapore, Singapore
Ashraf Kassim, The National University of Singapore, Singapore
Krishnan V. Pagalthivarthi, IIT Delhi, India
ABSTRACT
Most of the video retrieval systems work with a single shot without considering the
temporal context in which the shot appears. However, the meaning of a shot depends
on the context in which it is situated and a change in the order of the shots within a
scene changes the meaning of the shot. Recently, it has been shown that to find higherlevel interpretations of a collection of shots (i.e., a sequence), intershot analysis is at
least as important as intrashot analysis. Several such interpretations would be
impossible without a context. Contextual characterization of video data involves
extracting patterns in the temporal behavior of features of video and mapping these
patterns to a high-level interpretation. A Dynamic Bayesian Network (DBN) framework
is designed with the temporal context of a segment of a video considered at different
granularity depending on the desired application. The novel applications of the system
include classifying a group of shots called sequence and parsing a video program into
individual segments by building a model of the video program.
78 Mittal, Fah, Kassim & Pagalthivarthi
INTRODUCTION
Many pattern recognition problems cannot be handled satisfactorily in the absence
of contextual information, as the observed values under-constrain the recognition
problem leading to ambiguous interpretations. Context is hereby loosely defined as the
local domain from which observations are taken, and it often includes spatially or
temporally related measurements (Yu & Fu, 1983; Olson & Chun, 2001), though our
focus would be on the temporal aspect, that is, measurements and formation of
relationships over larger timelines. Note that our definition does not address a
contextual meaning arising from culturally determined connotations, such as a rose as
a symbol of love.
A landmark in the understanding of film perception was the Kuleshov experiments
(Kuleshov, 1974). He showed that the juxtaposition of two unrelated images would force
the viewer to find a connection between the two, and the meaning of a shot depends on
the context in which it is situated. Experiments concerning contextual details performed
by Frith and Robson (1975) showed a film sequence has a structure that can be described
through selection rules.
In video data, each shot contains only a small amount of semantic information. A
shot is similar to a sentence in a piece of text; it consists of some semantic meaning which
may not be comprehensible in the absence of sufficient context. Actions have to be
developed sequentially; simultaneous or parallel processes are shown one after the other
in a concatenation of shots. Specific domains contain rich temporal transitional structures that help in the classification process. In sports, the events that unfold are
governed by the rules of the sport and therefore contain a recurring temporal structure.
The rules of production of videos for such applications have also been standardized. For
example, in baseball videos, there are only a few recurrent views, such as pitching, close
up, home plate, crowd and so forth (Chang & Sundaram, 2000). Similarly, for medical
videos, there is a fixed clinical procedure for capturing different video views and thus the
temporal structures are exhibited.
The sequential order of events creates a temporal context or structure. Temporal
context helps create expectancies about what may come next, and when it will happen.
In other words, temporal context may direct attention to important events as they unfold
over time.
With the assumption that there is inherent structure in most video classes,
especially in a temporal domain, we can design a suitable framework for automatic
recognition of video classes. Typically in a Content Based Retrieval (CBR) system, there
are several elements which determine the nature of the content and its meaning. The
problem can thus be stated as extracting patterns in the temporal behavior of each
variable and also in the dynamics of relationship between the variables, and mapping
these patterns to a high-level interpretation. We tackle the problem in a Dynamic
Bayesian Framework that can learn the temporal structure through the fusion of all the
features (for tutorial, please refer to Ghahramani (1997)).
The chapter is organized as follows. A brief review of related work is presented first.
Next we describe the descriptors that we used in this work to characterize the video. The
algorithms for contextual information extraction are then presented along with a strategy
for building larger video models. Then we present the overview of the DBN framework
and structure of DBN. A discussion on what needs to be learned, and the problems in
using a conventional DBN learning approach are also presented in this section. Experiments and results are then presented, followed by discussion and conclusions.
RELATED WORK
Extracting information from the spatial context has found its use in many applications, primarily in remote sensing (Jeon & Landgrebe, 1990; Kittler & Foglein, 1984),
character recognition (Kittler & Foglein, n.d.), and detection of faults and cracks (Bryson
et al., 1994). Extracting temporal information is, however, a more complicated task but it
has been shown to be important in many applications like discrete monitoring (Nicholson,
1994) and plan recognition tasks, such as tracking football players in a video (Intille &
Bobick, 1995) and traffic monitoring (Pynadath & Wellman, 1995). Contextual information
extraction has also been studied for problems such as activity recognition using
graphical models (Hamid & Huang, 2003), visual intrusion detection (Kettnaker, 2003)
and face recognition.
In an interesting analysis done by Nack and Parkes (1997), it is shown how the
editing process can be used to automatically generate short sequences of video that
realize a particular theme, say humor. Thus for extraction of indices like humor, climax and
so forth, context information is very important. A use of context for finding an important
shot in a sequence is highlighted by the work of Aigrain and Joly (1996). They detect
editing rhythm changes through the second-order regressive modeling of shot duration.
The duration of a shot PRED (n) is predicted by (1) where the coefficients of a and b are
estimated in a 10-shot sliding window. The rule employed therein is that if (Tn > 2 *PRED
(n)) or (Tn < PRED (n)/2) then it is likely that the nth shot is an important (distinguished)
shot in a sequence. The above model was solely based on tapping the rhythm information
through shot durations. Dorai and Venkatesh (2001) have recently proposed an algorithmic framework called computational media aesthetics for understanding of the dynamic
structure of the narrative structure via analysis of the integration and sequencing of
audio/video elements. They consider expressive elements such as tempo, rhythm and
tone. Tempo or pace is the rate of performance or delivery and it is a reflection of the speed
and time of the underlying events being portrayed and affects the overall sense of time
of a movie. They define P (n), a continuous valued pace function, as
P (n) = W (s(n)) +
m(n) - m
m
where s refers to shot length in frames, m to motion magnitude, m and m, to the mean
and standard deviation of motion respectively and n to shot number. W (s(n)) is an overall
two-part shot length normalizing scheme, having the property of being more sensitive
near the median shot length, but slows in gradient as shot length increases into the
longer range.
We would like to view the contextual information extraction from a more generic
perspective and consider the extraction of temporal pattern in the behavior of the
variables.
THE DESCRIPTORS
The perceptual level features considered in this work are Time-to-Collision (TTC),
shot editing, and temporal motion activity. Since our emphasis is on presenting algorithms for contextual information, they are only briefly discussed here. The interested
reader can refer to Mittal and Cheong (2001) and Mittal and Altman (2003) for more details.
TTC is the time needed for the observer to reach the object, if the instantaneous relative
velocity along the optical axis is kept unchanged (Meyer, 1994). There exists a specific
mechanism in the human visual system, designed to cause one to blink or to avoid a
looming object approaching too quickly. Video shot with a small TTC evokes fear
because it indicates a scene of impending collision. Thus, TTC can serve as a potent cue
for the characterization of an accident or violence.
Although complex editing cues are employed by the cameramen for making a
coherent sequence, the two most significant ones are the shot transitions and shot
pacing. The use of a particular shot transition, like dissolve, wipe, fade-in, and so forth,
can be associated with the possible intentions of the movie producer. For example,
dissolves have been typically used to bring about smoothness in the passage of time or
place. A shot depicting a close-up of a young woman followed by a dissolve to a shot
containing an old woman suggests that the young woman has become old. Similarly, shot
pacing can be adjusted accordingly for creating the desired effects, like building up of
the tension by using a fast cutting-rate. The third feature, that is, temporal motion feature,
characterizes motion via several measures such as total motion activity, distribution of
motion, local-motion/global-motion, and so forth.
Figure 1. Shots leading to climax in a movie
Chase 1
Collision Alarm
Climax scene
(a)
Figure 2. Feature values for the shots in Figure 1. Shots in which TTC length is not
shown correspond to a case when TTC is infinite; that is, there is no possibility of
collision.
Figure 1 shows the effectiveness of TTC in the content characterization with the
example of a climax. Before the climax of a movie, there are generally some chase scenes
leading to the meeting of bad protagonists of a movie with good ones. One example of
such a movie is depicted in this figure, where the camera is shown both from the
perspective of the prey and of the predator leading to several large lengths of the
impending collisions as depicted in Figure 2. During the climax, there is a direct contact,
and therefore collision length is small and frequent. Combined with large motion and small
shot length (both relative to the context), TTC could be used to extract the climax
sequences.
Many works in the past have focused on shot classification based on single-shot
features like color, intensity variation, and so forth. We believe that since each video
class represents structured events unfolding in time, the appropriate class signatures are
Figure 3. A hierarchy of descriptors

Semantic
Structure
Media
Descriptor
Large-timeline
Descriptors
Context over
large timeline
Feelings
Activity
Scene
Characteristics
Sequence-level
Descriptors
Context
over shots/
Mapping
Collision
Detection
Shot
Transition
Motion
Descriptors
Shot-level
Descriptors
Context
Within
Shot
Optical
Flow
Color
Statistics
Frame
Difference
Frame-level
Descriptors
also present in the temporal domain. The features such as shot-length, motion, TTC, and
so forth, were chosen to illustrate the importance of temporal information, as they are
perceptual in nature, as opposed to low-level features such as color, and so forth.
CONTEXT INFORMATION EXTRACTION

Hierarchy of Context
Depending on the level of abstraction of the descriptors, the context information
can be coarsely (i.e., large neighborhood) or finely integrated. Figure 3 shows a hierarchy
of descriptors in a bottom-up fashion where each representational level derives its
properties from the lower levels. At the lowest level of the hierarchy are properties like
color, optical flow, and so forth, which correspond only to the individual frames. They
might employ the spatial context methods that are found in the image processing
literature.
At the second lowest level are descriptors such as TTC, shot transition details, and
so forth, which derive their characteristics from a number of frames. For example, patterns
in the mean and variance of color features computed over a number of frames (i.e., context)
are used in identifying shot-transition (Yeo & Liu, 1995). Similarly, the context over
frames is used for collision detection by identifying a monotonically decreasing TTC, and
for the extraction of motion descriptors (like local-motion/global-motion).
At the next higher level are the sequence level descriptors (or indices), which might
require the information of several shots. Some examples of these descriptors could be in
terms of feelings (like interesting, horror, excitement, sad, etc.), activity (like accident,
chase, etc.) and scene characteristics (like climax, violence, newscaster, commercial,
etc.). For example, a large number of collisions typically depict a violent scene, whereas
one or two collision shots followed by a fading to a long shot typically depicts a
melancholy scene. If the context information over the neighboring shots is not considered, then these and many other distinctions would not be possible.
Finally, these indices are integrated over a large timeline to obtain semantic
structures (like for news and sports) or media descriptors. An action movie has many
scenes where violence is shown, a thriller has many climax scenes and an emotional movie
has many sentimental scenes (with close-ups, special effects, etc.). These make it
possible to perform automatic labeling as well as efficient screening of the media (for
example, for violence or for profanity).
Descriptors at different levels of the hierarchy need to be handled with different
strategies. One basic framework is presented in the following sections.
The Algorithm
Figure 4 depicts the steps in a context information extraction (mainly the first pass
over the shots). The digitized media is kept in the database and the shot transition module
segments the raw video into shots. The sequence composer groups a number of shots
(say ) based on the similarity values of the shots (Jain, Vailaya, & Wei, 1999) depending
on the selected application. For example, for applications like climax detection, the
sequences consist of a much larger number of shots than that for the sports identifier.
The feature extraction of a sequence yields an observation sequence, which
consists of feature vectors corresponding to each shot in the sequence. An appropriate
DBN Model is selected based on the application, which determines the complexity of the
mapping required. Thus, each application has its own model (although the DBN could
be made in such a way that a few applications could share a common model, but the
performance would not be optimal). During the training phase, only the sequences
corresponding to positive examples of the domain are used for learning. During the
querying or labeling phase, DBN evaluates the likelihood of the input observation
sequence belonging to the domain it represents. The sequence labeling module and the
application selector communicate with each other to define the task which needs to be
performed, with the output being a set of likelihoods. If the application involves
classifying into one of the exclusive genres, the label corresponding to the DBN model
with the maximum likelihood is assigned. In general, however, a threshold is chosen
(automatically during the training phase) over the likelihoods of a DBN model. If the
likelihood is more than the threshold for a DBN model, the corresponding label is
assigned. Thus, a sequence can have zero label or multiple labels (such as interesting,
soccer, and violent) after the first pass. Domain rules aid further classification in the
subsequent passes, the details of which are presented in the next section.
Context Over Sequences: The Subsequent Passes

The algorithm in the subsequent passes is dependent on the application (or the
goal). The fundamental philosophy, however, is to map the sequences with their labels
onto a one-dimensional domain, and apply the algorithms relevant in the spatial context.
A similar example could be found in the character recognition task, where the classification accuracy of a character could be improved by looking at the letters both preceding
and following it (Kittler & Foglein, 1984). In this case, common arrangements of the letters
Figure 4. Steps in context information extraction (details of the first pass are shown
here)
SequenceComposition
Media
Database
Shot Transition
Detector
...
Shots
Sequence
Feature Extraction
Module (TTC,
Sequence motion, color....)
Information &
Feedback
Observation
Sequence
DBN Model1
Sequence
Application
Selector
DBN Model 2
Labeling
.
.
.
Select
Input (0/1)
&
Feature
Vector
DBN Model k
Second pass
Select Mode
(e.g., in English: qu, ee, tion) are used to provide a context within which the character may
be interpreted.
Three examples of algorithms at this level are briefly presented in this section as
follows:
1.
Probabilistic relaxation (Hummel & Zucker, 1983) is used to characterize a domain

like the identification of climax, which has a lot of variation amongst the training
samples. In probabilistic relaxation, each shot within the contextual neighborhood
is assigned a label with a given probability (or likelihood) by the DBN models. A
sliding window is chosen around the center shot. Iterations of the relaxation
process update each of the labels around the central shot with respect to a
Figure 5. Subsequent passes perform analysis on the context line of sequences

Effecti-M-1
Effecti-2
Seqi-M
Label i-M
...
Effect i-1
Effect
Seqi-1
Seq
Label i-1
Label i
2M+1
Effecti+1
Seqi+1
Label i+1
Effect
Seqi+M
...
i-M
Label i+M
Center
time
2.
3.
compatibility function between the labels in the contextual neighborhood. In this

manner, successive iterations propagate context throughout the timeline. The
compatibility function encodes the constraint relationships, such as it is more
likely to find a climax scene at a small distance from the other climax scene. In other
words, the probabilities of both climax scenes are increased after each pass. The
relaxation process is stopped when the number of passes exceeds a fixed value or
when the passes do not bring about significant changes to the probabilities.
Context cues can also aid in improving the classification accuracy of shots.
Consider a CBR system that considers an individual shot or a group of shots (say,
4 to 10), that is, sequence, seqi. A sequence seq i is classified into one of the classes
(like MTV, soccer, etc.) with label i on the basis of its low-level and high-level
properties. Figure 5 shows the time-layout of sequences with their labels. Effect i
is the transition-effect between seqi and seq i+1. Consider the 2M neighborhood
sequences (in one dimension) of seq i. The nature of the video domain puts a
constraint on defining the constraint relationships within the neighborhood. For
example, it is generally not possible to find a tennis shot in between several soccer
shots. On the other hand, a few commercial shots may certainly be present in a
soccer match. Our strategy to reduce the misclassifications is to slide a window of
2M+1 size over each sequence and check against governing the labels and effects
in the neighborhood. If labeli and label i-1 are to be different (i.e., there is a change
of video class), effecti should be a special effect (and not a flat-cut), and label i
should match at least a few labels ahead, that is, label i+1 and so on. In the second
pass, those consecutive items with the same labels are clustered together. A
reallotment of the labels is done on the basis of rules, like the length of each
clustered sequence should be greater than a minimum value. The second pass can
also be used to combine unclassified clustered sequences with appropriate
neighboring items. This strategy can also be used to retrieve the entire program.
By classifying only the individual shots, retrieval of the entire program is a difficult
task. The model can only be built for certain sections of the program. For example,
it is not an easy task to model outdoor shots in News, or in-between audience shots
in sports. However, through this pass scheme, parts of the same program can also
be appropriately combined to form one unit.
The goal of the subsequent pass could be to extract information about the media
(such as the media being an action movie, thriller movie or an emotional movie), or
to construct semantic timelines. For media descriptors like action, the pass involves
counting the number of violent scenes, with due consideration to their degrees of
violence, which is estimated by the likelihood generated by the DBN model. The
pass also considers motion, shot length, and so forth, in the scenes classified as
nonviolent. Constraint conditions are also enhanced during the pass (for
instance, the shot length on average is smaller in an action movie than that in a
thriller). The application of constructing semantic timelines is considered in detail
in the next section.
Building Larger Models

How DBNs can be made to learn larger models of video is illustrated here. Consider,
for example, parsing broadcast news into different sections, and the user is provided with
a facility to browse any one of them. Examples of such queries could be Show me the
sport clip which came in the news and go to the weather report.
If a video can be segmented into its scene units, the user can more conveniently
browse through that video on a scene basis rather than on a shot-by-shot basis, as is
commonly done in practice. This allows a significant reduction of information to be
conveyed or presented to the user.
Zweig (1998) has shown that a Finite State Automaton (FSA) can be modeled using
DBNs. The same idea is explored in the domain of a CBR system. The video programs
evolve through a series of distinct processes, each of which is best represented by a
separate model. When modeling these processes, it is convenient to create submodels
for each stage, and to model the entire process as a composition of atomic parts. By
factoring a complex model into a combination of simpler ones, we achieve a combinatorial
reduction in the number of models that need to be learned. Thus a probabilistic
nondeterministic FSA can be constructed as shown in Mittal and Altman (2003). In the
FSA, each of the states represents a stage of development and can be either manually
or automatically constructed with a few hours of News program.
Since most video programs begin and end with a specific video sequence (which
can be recognized), modeling the entire structure through FSA, which has explicitly
defined start and end states, is justified. The probabilistic FSA of News has transitions
on the type of shot cut (i.e., dissolve, wipe, etc.) with a probability.
The modeling of the FSA by DBN can be done as follows. The position in the FSA
at a specific time is represented by a state variable in the DBN. The DBN transition
variable encodes which arc is taken out of the FSA state at any particular time. The number
of values the transition variable assumes is equal to the maximum out-degree of any of
the states in the FSA. The transition probabilities associated with the arcs in the
automaton are reflected in the class probability tables associated with the transition
variables in the DBN.
DYNAMIC BAYESIAN NETWORKS

DBNs (Ghahramani, 1997; Nicholson, 1994; Pavlovic, Frey, & Huang, 1999) are a
class of graphical, probabilistic models that encode dependencies among sets of random
variables evolving in time. They generalize the Hidden Markov Models (HMM) and the
Linear dynamical systems by adopting a wider range of topologies and inference
algorithms.
DBN has been used to temporally fuse heterogeneous features like face, skin and
silence detectors for tackling the problem of speaker detection (Garg et al., 2000; Pavlovic
et al., 2000). An important application which demonstrates the potential of DBN is
multivariate classification of business cycles in phases (Sondhauss & Weihs, 1999).
Modeling a DBN with enough hidden states allows the learning of the patterns of
variation shown by each feature in individual video classes (or high-level indices). It can
also establish correlations and associations between the features leading to the learning
of conditional relationships. Temporal contextual information from temporal neighbors
is conveyed to the current classification process via the class transition probabilities.
Besides the fact that DBNs were explicitly developed to model temporal domain, the other
reason for preferring DBN over time-delay neural networks or modified versions of other
learning tools like SVM, and so forth, is that DBN offers the interpretation in terms of
probability that makes it suitable to be part of a larger model.
Structuring the CBR Network
Consider an observation sequence seqT consisting of feature vectors F0, . . . , F T for

T + 1 shots (typically seven to 30 shots). Since multiple-label assignment should be
allowed in the domain of multimedia indexing (for example, interesting + soccer), each
video class or index is represented by a DBN model. A DBN model of a class is trained
with preclassified sequences to extract the characteristic patterns in the features. During
the inference phase, each DBN model gives a likelihood measure, and if this exceeds the
threshold for the model, the label is assigned to the sequence.
Let the set of n CBR features at time t be represented by F 1t, , Fnt where feature
vector F Z, Z is the set of all the observed and hidden variables. The system is modeled
as evolving in discrete time steps and is a compact representation for the two time-slice
conditional probability distribution P (Zt+1 | Z t ). Both the state evolution model and the
observation model form a part of the system such that the Markov assumption and the
time-invariant assumption hold. The Markov assumption simply states that the future
is independent of the past given the present. The time-invariant assumption means that
the process is stationary, that is, P (Zt+1 | Zt) is the same for all t, which simplifies the
learning process.
A DBN can be expressed in terms of two Bayesian Networks (BN): a prior network
BN0, which specifies a distribution over the initial states, and a transition network BN,
which represents the transition probability from state Z t to state Zt+1. Although a DBN
defines a distribution over infinite trajectories of states, in practice, reasoning is carried
out on a finite time interval 0, . . . , T by unrolling the DBN structure into long sequences
of BNs over Z1 , . . . , ZT . In time slice 0, the parents of Z0 and its conditional probability
distributions (CPD) are those specified in the prior network BN0. In time slice t + 1, the
parents of Z t+1 and its CPDs are specified in BN!. Thus, the joint distribution over Z 1 , .
. . , Z T is
P (z 0 , . . . , z T ) = PBN 0 (z 0 )
T -1
t =1
PBN (z t +1 | z t )
Consider a DBN model and an observation sequence seqT consisting of feature

vectors F1 , . . . , Fm. There are three basic problems of inference, learning and decoding
of a model which are useful in a CBR task.
DBN Computational Tasks

The inference problem can be stated as computing the probability P ( | seqi) such
that the observation sequence is produced by the model. The classical solution to the
DBN inference is based on the same theory as the forward-backward propagation for
HMMs (Rabiner, 1989). The algorithm propagates forward messages at the start of the
sequence, gathering evidence along the way. Similar process is used to propagate the
backward messages t in the reverse direction. The posterior distribution t over the
states at time t is simply at (z) t(z) (with suitable re-normalization). The joint posterior
over the states at t and t + 1 is proportional to t(zt ) PBN (zt+1 |z t) tt+1 (zt+1 ) PBN (Ft+1
|zt+1 ).
The learning algorithm for dynamic Bayesian networks follows from the EM
algorithm. The goal of sequence decoding in DBN is to find the most likely state sequence
Q of hidden variables given the observations such that X *T = arg maxXT P (XT | seqT). This
task can be achieved by using the Viterbi algorithm (Viterbi, 1967) based on dynamic
programming. Decoding attempts to uncover the hidden part of the model and outputs
the state sequence that best explains the observations. The previous section presents
an application to parse a video program where each state corresponds to a segment of
the program.
What Can Be Learned?

An important question that can be raised is this: Which pattern can be learned and
which cannot be learnt? Extracting temporal information is complicated. Some of the
aspects of the temporal information that we attempt to model are
1.
2.
3.
4.
Temporal ordering. Many scene events have precedent-antecedent relationships.

For example, as discussed in the sports domain, shots depicting interesting events
are generally followed by a replay. These replays can be detected in a domainindependent manner and the shots preceding them can be retrieved, which would
typically be the highlights of the sport.
Logical constraints could be of the form that event A occurs either before the event
B or the event C or later than the event D.
Time duration. The time an observation sequence lasts is also learnt. For example,
a climax scene is not expected to last over hundreds of shots. Thus, a very-long
sequence having feature values (like large number of TTC shots and small shotlengths) similar to the climax should be classified as nonclimax.
Association between variables. For instance, during the building up of a climax,
TTC measure should continuously decrease while the motion should increase.
The Problems in Learning

Although DBN can simultaneously work with continuous variables and discrete
variables, some characteristic effects, which do not occur often, might not be properly
modeled by DBN learning. A case in point is the presence of a black frame at the end of
every commercial. Since it is just a spike in a time-frame, DBN associates more meaning
to its absence and treats the presence at the end of sequence as noise. Therefore, we
included a binary variable: presence of a black frame, which takes the value 1 for every
frame if at the end of a sequence there is a black frame.
While DBN discovers the temporal correlations of the variables very well, it does
not support the modeling of long-range dependencies and aggregate influences from
variables evolving at different speeds, which could be a frequent occurrence in a video
domain. This solution to the problem is suggested as explicitly searching for violations
of the Markov property at widely varying time granularity (Boyen et al., 1999).
Another crucial point to be considered in the design of DBN is the number of hidden
states. Insufficient numbers of hidden states would not model all the variables, while an
excessive number would generally lead to overlearning. Since, for each video class a
separate DBN is employed, the optimization of the number of hidden states is done by
trial and test method so as to achieve the best performance on the training data. Since
feature extraction on video classes is very inefficient (three frames/second!), we do not
have enough data to learn the structure of DBN at present for our application. We
assumed a simple DBN structure as shown in Figure 6.
Another characteristic of the DBN learning is that the features which are not
relevant to the class have their estimated probability density functions spread out, which
can easily be noticed and these features can be removed. For example, there is no
significance of the shot-transition length in cricket sequences, and thus this feature can
be removed.
Like most of the learning tools, DBN also requires the presence of a few representational training sequences of the model to help extract the rules. For example, just by
training the interesting sequences of cricket, the interesting sequences of badminton
cannot be extracted. This is because the parameter learning is not generalized enough.
DBN is specific in learning the shot-length specific to cricket, along with the length of
the special effect (i.e., wipe) used to indicate replay. On the other hand, a human expert
would probably be in a position to generalize these rules.
Of course, this learning strategy of DBN has its own advantages. Consider for
example, a sequence of cricket shots (which is noninteresting) having two wipe transi-
Figure 6. DBN architecture for the CBR system. The black frame feature has value 1 if
the shot ends with black frame and 0 otherwise. It is especially relevant in detection
of commercials.
State t-1
State t
State Evolution
Model
State t+1
Observation Model
Feature vector t-1
Feature vector t
Feature vector t+1
Time slice t
Black Frame
Shot
Length
Cut type
Length of
Transition
Presence of
TTC
Length of
TTC
Frame
Difference
tions and the transitions are separated by many flat-cut shots. Since DBN is trained to
expect only one to three shots in between two-wipe transitions, it would not recognize
this as an interesting shot.
Another practical problem that we faced was to decide the optimum length of shots
for DBN learning. If some redundant shots are taken, DBN fails to learn the pattern. For
example, in climax application, if we train the DBN only with climax scene, it can perform
the classification with good accuracy. However, if we increase the number of shots, the
classification accuracy drops. This implies that the training samples should have just
enough number of shots to model or characterize the pattern (which at present requires
human input).
Modeling video programs is usually tough, as they cannot be mapped to the
templates except in exceptional cases. The number of factors which contribute toward
variations in the parameters and thus stochastic learning in DBN is highly suitable. We
would therefore consider subsequent passes in the next section which takes the initial
probability assigned by DBN and tries to improve the classification performance based
on the task at hand.
EXPERIMENTS AND APPLICATIONS

For the purpose of experimentation, we recorded sequences from TV using a VCR
and grabbed shots in MPEG format. The size of the database was around four hours 30
minutes (details are given in Table 1) from video sequences of different categories. The
frame dimension was 352288.
The principal objective of these experiments is not to demonstrate the computational feasibility of some of the algorithms (for example, TTC or shot detection). Rather
we want to demonstrate that the constraints/laws derived from the structure or patterns
of interaction between producers and viewers are valid. A few applications are considered to demonstrate the working and effectiveness of the present work.
In order to highlight the contribution of the perceptual-level features and the
contextual information discussed in this chapter, other features like color, texture, shape,
and so forth, are not employed. The set of features employed is shown in Figure 6. Many
works in the past (including ours (Mittal & Cheong, 2003)) have focused on shot
classification based on single-shot features like color, intensity variation, and so forth.
We believe that since each video class represents structured events unfolding in time,
the appropriate class signatures are also present in the temporal domain such as shotlength, motion, TTC, and so forth. For example, though the plots of log shot-length of
news, cricket and soccer were very similar, the frequent presence of wipe and high-motion
distinguishes cricket and soccer from the news.
Sequence Classifier
As discussed before, the problem of sequence classification is to assign labels from
one or more of the classes to the observation sequence, seq T consisting of feature vectors
F0 , . . . , FT . Table 2 shows the classification performance of the DBN models for six video
classes. The experiment was conducted by training each DBN model with preclassified
sequences and testing with unclassified 30 sequences of the same class and 50
Table 1. Video database used for experimentation

Type
Duration (min sec)
Sports
MTV
Commercial
Movie
BBC News
News from Singapore TCS
Channel 5 ((consist of 2
commercial breaks)
Total
6350
1210
1713
2725
6454
8750
4 hr 33 min 22 sec
sequences of other classes. The DBN models for different classes had a different value
of T, which was based on optimization performed during the training phase. The number
of shots for DBN model of news was 5 shots, of soccer 8 shots, and of commercials 20
shots. The news class consists of all the segments, that is, newscaster shots, outdoor
shots, and so forth. The commercial sequences all have a black frame at the end.
There are two paradigms of training the DBN: (i) Through a fixed number of shots
(around seven to 10), and (ii) through a fixed number of frames (400 to 700 frames). The
first paradigm works better than the second one as is clear from the recall and precision
rates. This could be due to two reasons: first, having a fixed number of frames does not
yield proper models for classes, and second, the DBN output is in terms of likelihood,
which reduces as there are a larger number of shots in a sequence. Large numbers of shots
implies more state transitions, and since each transition has probability of less than one,
the overall likelihood decreases. Thus, classes with longer shotlengths are favored in the
second paradigm, leading to misclassifications.
In general, DBN modeling gave good results for all the classes except soccer. It is
interesting to note that commercials and news were detected with very high precision
because of the presence of the characteristic black frame and the absence of high-motion,
respectively. A large number of fade-in and fade-out effects are present in the MTV and
commercial classes, and dissolves and wipes are frequently present in sports. The black
frame feature also prevents the MTV class being classified as a commercial, though both
of them have similar shot lengths and shot transitions.
The poor performance of the soccer class could be explained by the fact that the
standard deviations of the DBN parameters corresponding to motion or shot length
features were large, signifying that the degree of characterization of the soccer class was
poor by features such as shot length. On the other hand, cricket was much more
structured, especially with a large number of wipes.
Highlight Extraction
In replays, the following conventions are typically used:
1.
Replays are bounded by a pair of identical gradual transitions, which can either be
a pair of wipes or a pair of dissolves.
Table 2. Performance: Sequence classifier

Fixed Number of shots
Class
Commercial
(CO )
Cricket
(CR)
MT V
(MT)
False alarm (Out of

10 seq. for each class)
None
3 MT, 1 SO
1 MT, 1 NE,
2 NE, 2 TE
11
7NE, 6 TE,
Soccer
(SO)
10
Tennis
Precision
Fixed Number of frames

Misclassified
(Out of 30)
None
Recall
False alarm (Out of

10 seq. for each class)
None
News
(NE)
(TE)
2.
Misclassified
(Out of 30)
None
2 CO, 4 CR,
5 MT, 8 NE,7 TN
4 MT, 6 NE ,
2 SO
13
13
7 SO, 4 TE
7 SO
3 TE, 1 SO
1 CO, 1 MT .
4 NE, 9 TE
3 CR, 3 NE ,
5 SO
83.9 %
68.3 %
76.7 %
66.5 %
Cuts and dissolves are the only transition types allowed between two successive
shots during a replay. Figure 7 shows two interesting scenes from the soccer and
cricket videos. Figure 7(a) shows a soccer match in which the player touched the
football with his arm. A cut to a close-up of the referee showing a yellow card is
used. A wipe is employed in the beginning of the replay, showing how the player
touched the ball. A dissolve is used for showing the expression of the manager of
the team, followed by a wipe to indicate the end of the replay. Finally, a close-up
view of the player who received the yellow card is shown.
For the purpose of experiments, only two sports, cricket and soccer, were considered, although the same idea can be extended to most of the sports, if not all. This is
because the interesting scene analysis is based on detecting the replays, the structure
of which is the same for most sports. Below is a typical format of the transition effects
around an interesting shot in cricket and soccer.
E . C G1 . E G2 E
where,
E {Cut, Dissolve}
C {Cut}
G1, G2 {Wipe, Dissolve}, G1 and G2 are the gradual transitions before and after the replay
D {Dissolve}
, , {Natural number}
is a followed by operator.
Figure 7. Typical formats of the transition effects used in the replay sequences of (a)
soccer match (b) cricket match
This special structure of replay is used for identifying interesting shots. The
training sequence for both sports, which was typically 5 to 7 shots, consisted of
interesting shots followed by the replays. Figure 8 shows the classification performance
of DBN for highlight extraction. A threshold could be chosen on the likelihood returned
by DBN based on training data (such that the threshold is less than the likelihood of most
of the training sequences). All sequences possessing a likelihood more than the
threshold of a DBN model are assigned an interesting class label. Figure 8 shows that
only two misclassifications result with the testing. One recommendation could be to
lower the threshold such that no interesting sequences are missed; although a few
uninteresting sequences may also be labeled and retrieved to the user. Once an
interesting scene is identified, the replays are removed before presenting it to the user.
An interesting detail in this application is that one might want to show the
scoreboard during the extraction of highlights (scoreboards are generally preceded and
followed by many dissolves). DBN can encode in its learning both the previous model
of the shot followed by a replay and the scoreboard sequences with their characteristics.
Climax Characterization and Censoring Violence

Climax is the culmination of a gradual building up of tempo of events, such as chase
scenes, which end up typically in violence or in passive events. Generally, a movie with
a lot of climax scenes is classified as a thriller while a movie which has more violence
scenes is classified as action. Sequences from thriller or action movies could be
used to train a DBN model, which can learn about the climax or violence structures easily.
This application can be used to present a trailer to the user, or to classify the movie into
different media categories and restrict the presentation of media to only acceptable
scenes.
Figure 8. Classification performance for highlight extraction

Cricket
0
0.05
Log likelihood
0. 1
Threshold
0.15
0. 2
0.25
Interesting
0. 3
Uninteresting
Misclassified
0
10
11
Shots
Over the last two decades, a large body of literature has linked the exposure to
violent television with increased physical aggressiveness among children and violent
criminal behavior (Kopel, 1995, p. 17; Centerwall, 1989). A modeling of the censoring
process can be done to restrict access to media containing violence. Motion alone is
insufficient to characterize violence, as many acceptable classes (especially sports like
car racing, etc.) also possess high motion. On the other hand, shot-length and especially
TTC are highly relevant due to the fact that the camera is generally at a short distance
during violence (thus the movements of the actors yield impression of impending
collisions). The cutting rate is also high, as many perspectives are generally covered. The
set of the features used, though, remains the same as a sequence classifier application.
Figure 10 shows a few images from a violent scene of a movie.
Figure 9 shows the classification performance for a censoring application. For each
sequence in the training and test database, the opinions of three judges were sought in
terms of one of the three categories: violent, nonviolent and cannot say. Majority
voting was used to decide if a sequence is violent. The test samples were from sports,
MTV, commercials and action movies. Figure 9 shows that while the violent scenes were
correctly detected, two MTV shots were misclassified as violent. For one sequence of
these MTV sequences, the opinion of two judges was cannot say, while the third one
classified it as violent (Figure 11). The other sequence consisted of too many objects
near to the camera with a large cutting rate although it should be classified as nonviolent.
DISCUSSION AND CONCLUSION

In this chapter, the integration of temporal context information and the modeling of
this information through DBN framework were considered. Modeling through DBN
Figure 9. Classification performance for violence detection

Soccer
0
0.05
0.1
Threshold
Loglikelihood
0.15
0.2
0.25
0.3
0.35
Interesting
Uninteresting
0.4
0.45
Misclassified
10
11
Shots
Figure 10. A violent scene that needs to be censored
removes the cumbersome task of manually designing a rule-based system. The design
of such a rule-based system would have to be based on the low-level details, such as
thresholding; besides, many temporal structures are difficult to observe but could be
extracted by automatic learning approaches. DBN assignment of the initial labels on the
data prepares it for the subsequent passes, where expert knowledge could be used
without much difficulty.
The experiments conducted in this chapter employed a few perceptual-level features. The temporal properties of such features are more readily understood than lowlevel features. Though the inclusion of low-level features could enhance the characterization of the categories, it would raise the important issue of dealing with the highCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
Figure 11. An MTV scene which was misclassified as violent. There are many moving
objects near the camera and the cutting rate is high.
dimensionality of the feature-space in the temporal domain. Thus, the extraction of

information, which involves learning, decoding and inference, would require stronger
and more efficient algorithms.
REFERENCES
Aigrain, P., & Joly, P. (1996). Medium knowledge-based macro-segmentation of video
into sequences. Intelligent Multimedia Information Retrieval.
Boyen, X., Firedman, N., & Koller, D. (1999). Discovering the hidden structure of complex
dynamic systems. Proceedings of Uncertainty in Artificial Intelligence (pp. 91100).
Bryson, N., Dixon, R.N., Hunter, J.J., & Taylor, C. (1994). Contextual classification of
cracks. Image and vision computing, 12, 149-154.
Centerwall, B. (1989). Exposure to television as a risk factor for violence. Journal of
Epidemology, 643-652.
Chang, S. F., & Sundaram, H. (2000). Structural and semantic analysis of video. IEEE
International Conference on Multimedia and Expo (pp. 687-690).
Dorai, C., & Venkatesh, S. (2001). Bridging the semantic gap in content management
systems: Computational media aesthetics. International Conference on Computational Semiotics in Games and New Media (pp. 94-99).
Frith, U., & Robson, J. E. (1975). Perceiving the language of film in Perception, 4, 97-103.
Garg, A., Pavlovic, V., & Rehg, J. M. (2000). Audio-visual speaker detection using
Dynamic Bayesian networks. IEEE Conference on Automatic Face and Gesture
Recognition (pp. 384-390).
Ghahramani, Z. (1997). Learning dynamic Bayesian networks. Adaptive Processing of
Temporal Information. Lecture Notes in AI. SpringerVerlag.
Hamid, I. E., & Huang, Yan. (2003). Argmode activity recognition using graphical models.
IEEE CVPR Workshop on Event Mining: Detection and Recognition of Events in
Video (pp. 1-7).
Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling
processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5,
267-287.
Intille, S. S., & Bobick, A. F. (1995). Closed-world tracking. IEEE International Conference on Computer Vision (pp. 672-678).
Jain, A. K., Vailaya, A., &Wei, X. (1999). Query by video clip. Multimedia Systems, 369384.
Jeon, B., & Landgrebe, D. A. (1990). Spatio-temporal contextual classification of remotely
sensed multispectral data. IEEE International Conference on Systems, Man and
Cybernetics (pp. 342-344).
Kettnaker, V. M. (2003). Time-dependent HMMs for visual intrusion detection. IEEE
CVPR Workshop on Event Mining: Detection and Recognition of Events in Video.
Kittler, J., & Foglein, J. (1984). Contextual classification of multispectral pixel data. Image
and Vision Computing, 2, 13-29.
Kopel, D. B. (1995). Massaging the medium: Analyzing and responding to media violence
without harming the first. Kansas Journal of Law and Public Policy, 4, 17.
Kuleshov, L. (1974). Kuleshov on film: Writing of Lev Kuleshov. Berkeley, CA: University of California Press.
Meyer, F. G. (1994). Time-to-collision from first-order models of the motion fields. IEEE
Transactions of Robotics and Automation (pp. 792-798).
Mittal, A., & Altman, E. (2003). Contextual information extraction for video data. The 9th
International Conference on Multimedia Modeling (MMM), Taiwan (pp. 209223).
Mittal, A., & Cheong, L.-F. (2001). Dynamic Bayesian framework for extracting temporal
structure in video. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2, 110-115.
Mittal, A., & Cheong, L.-F. (2003). Framework for synthesizing semantic-level indices.
Journal of Multimedia Tools and Application, 135-158.
Nack, F., & Parkes, A. (1997).The application of video semantics and theme representation in automated video editing. Multimedia tools and applications, 57-83.
Nicholson, A. (1994). Dynamic belief networks for discrete monitoring. IEEE Transactions on Systems, Man, and Cybernetics, 24(11), 1593-1610.
Olson, I. R., & Chun, M. M. (2001). Temporal contextual cueing of visual attention.
Journal of Experimental Psychology: Learning, Memory, and Cognition.
Pavlovic, V., Frey, B., & Huang, T. (1999). Time-series classification using mixed-state
dynamic Bayesian networks. IEEE Conference on Computer Vision and Pattern
Recognition (pp. 609-615).
Pavlovic, V., Garg, A., Rehg, J., & Huang, T. (2000). Multimodal speaker detection using
error feedback dynamic Bayesian networks. IEEE Conference on Computer Vision
and Pattern Recognition.
Pynadath, D. V., & Wellman, M. P. (1995). Accounting for context in plan recognition with
application to traffic monitoring. International Conference on Artificial Intelligence, 11.
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected application in
speech recognition. Proceedings of the IEEE (vol. 77, pp. 257-286).
Sondhauss, U., & Weihs, C.(1999). Dynamic bayesian networks for classification of
business cycles. SFB Technical report No. 17. Online at http://www.statistik.unidortmund.de/
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal
decoding algorithm. IEEE Transactions on Information Theory (pp. 260-269).
Yeo, B. L., & Liu, B.(1995). Rapid scene analysis on compressed video. IEEE Transactions on Circuits, Systems, and Video Technology (pp. 533-544).
Yu, T. S., & Fu, K. S. (1983). Recursive contextual classification using a spatial stochastic
model. Pattern Recognition, 16, 89-108.
Zweig, G. G. (1998). Speech recognition with dynamic Bayesian networks. PhD thesis,
Dept. of Computer Science, University of California, Berkeley.
Content-Based Music Summarization and Classification
99
Chapter 5
Content-Based Music
Summarization
and Classification
Changsheng Xu, Institute for Infocomm Research, Singapore
Xi Shao, Institute for Infocomm Research, Singapore
Namunu C. Maddage, Institute for Infocomm Research, Singapore
Qi Tian, Institute for Infocomm Research, Singapore
ABSTRACT
This chapter aims to provide a comprehensive survey of the technical achievements in

the area of content-based music summarization and classification and to present our
recent achievements. In order to give a full picture of the current status, the chapter
covers the aspects of music summarization in compressed domain and uncompressed
domain, music video summarization, music genre classification, and semantic region
detection in acoustical music signals. By reviewing the current technologies and the
demands from practical applications in music summarization and classification, the
chapter identifies the directions for future research.
100 Xu, Shao, Maddage, Jin & Tian
INTRODUCTION
Recent advances in computing, networking, and multimedia technologies have
resulted in a tremendous growth of music-related data and accelerated the need to
analyse and understand the music content. Music representation is multidimensional
and time-dependent. How to effectively organize and process such large variety and
quantity of music information to allow efficient browsing, searching and retrieving is an
active research area in recent years. Audio content analysis, especially music content
understanding, posts a big challenge for those who need to organize and structure music
data. The difficulty arises in converting the featureless collections of raw music data to
suitable forms that would allow tools to automatically segment, classify, summarize,
search and retrieve large databases. The research community is now at the point where
the limitations and properties of developed methods are well understood and used to
provide and create more advanced techniques tailored to user needs and able to better
bridge the semantic gap between the current audio/music technologies and the semantic
needs of interactive media applications
The aim of this chapter is to provide a comprehensive survey of the technical
achievements in the area of content-based music summarization and classification and
to present our recent achievements. The next section introduces music representation
and feature extraction. Music summarization and music genre classification are presented
in details in the two sections, respectively. Semantic region detection in acoustical music
signals is described in the fifth section. Finally, the last section gives the concluding
remarks and discusses future research directions.
MUSIC REPRESENTATION
AND FEATURE EXTRACTION
Feature extraction is the first step of content-based music analysis. There are many
features that can be used to characterize the music signal. Generally speaking, these
features can be divided into three categories: timbral textural features, rhythmic content
features and pitch content features.
Timbral Textural Features

Timbral textural features are used to differentiate mixture of sounds that may have
the same or similar rhythmic and pitch contents. The use of these features originates from
music-speech discrimination and speech recognition. The calculated features are based
on the short-time Fourier transform (STFT) and are calculated for each frame.
Amplitude Envelope
The amplitude envelope describes the energy change of the signal in the time
domain and is generally equivalent to the so-called ADSR (attack, decay, sustain and
release) of a music song. The envelope of the signal is computed with a frame-by-frame
root mean square (RMS) and a third-order Butterworth low-pass filter (Eiilis, 1994). RMS
is a perceptually relevant measure and has been shown to correspond closely to the way
we hear Loudness. The length of the RMS frame determines the time resolution of the
101
envelope. A large frame length yields low transient information and a small frame length
yields greater transient energy.
Spectral Power
For a music signal s(n), each frame is weighted with a Hanning window h(n):
83
n
1 cos(2 )
2
N
h( n ) =
(1)
where N is the number of samples of each frame. The spectral power of the signal s(n)
is calculated as
1
S ( k ) = 10 log 10
N
N 1
s(n)h(n) exp( j 2
n =0
nk
)
N
(2)
Spectral Centroid
The spectral centroid (Tzanetakis & Cook, 2002) is defined as the centre of gravity
of spectrum magnitude in STFT.
N
Ct =
M ( n) n
n =1
N
M ( n)
n =1
(3)
where Mt(n) is the spectrum of the Fast Fourier Transform (FFT) at the t-th frame with a
frequency bin f. The spectral centroid is a measure of spectral shape. Higher centroid
values correspond to brighter textures with more high frequencies.
Spectrum Rolloff
Spectrum Rolloff (Tzanetakis & Cook, 2002) is the frequency below which 85% of
spectrum distribution is concentrated. It is also a measure of the spectral shape.
Spectrum Flux
Spectrum flux (Tzanetakis & Cook, 2002) is defined as the variation value of
spectrum between two adjacent frames.
SF = N t ( f ) N t 1 ( f )
(4)
where Nt( f ) and Nt-1( f ) are the normalized magnitude of the FFT at the current frame t
and previous frame t-1, respectively. Spectrum flux is a measure of the amount of local
spectral changes.
Cepstrum
The mel-frequency cepstra has proven to be highly effective in automatic speech
recognition and in modeling the subjective pitch and frequency content of audio signals.
Psychophysical studies have found the phenomena of the mel pitch scale and the critical
band, and the frequency scale-warping to the mel scale has led to the cepstrum domain
representation.
The cepstrum can be illustrated by means of the Mel-Frequency Cepstral Coefficients (MFCCs). These are computed from the FFT power coefficients (Logan & Chu,
2000). The power coefficients are filtered by a triangular bandpass filter. The filter
consists of K triangular banks. They have a constant mel-frequency interval and cover
the frequency range of 0-4000Hz. Denoting the output of the filter bank by Sk (k=1,2,K),
(K is denoted to 19 for speech recognition purpose, while K has a value higher than 19
for music signals because music signals have a wider spectrum than speech signals), the
MFCCs are calculated as
cn =
2 K
(log S k ) cos[n(k 0.5) / K ] n = 1,2,..., L
K k =1
(5)
where L is the order of the cepstrum. Figure 1 illustrates the estimation procedure of
MFCC.
Zero Crossing Rates

Zero crossing rates are usually suitable for narrowband signals (Deller et al., 2000),
but music signals include both narrowband and broadband components. Therefore, the
short time zero crossing rates can be used to characterize music signals. The N-length
short time zero crossing rates are defined as
Z s ( m) =
1
N
sgn{s( n)} sgn{s ( n 1)}
n = m N +1
w(m n)
(6)
where w(m) is a rectangular window.
Low Energy Component

Low energy component (LEC) (Wold et al., 1996) is the percentage of frames that
have energy less than the average energy over the whole signal. It measures amplitude
distribution of the signal.
Figure 1. Estimation procedure of MFCC

Digtal
Digital
Signal
Signal
FFT
FFT
MelMelScale
Scale
Filter
Filter
Sum
Sum
log
log
MFCC
MFOC
DCT
103
Figure 2. Estimation procedure of octave-based spectral contrast

Digtal
Digital
Signal
Signal
FFT
FFT
Octave
Octave
Scale
Scale
Filter
Filter
Peak/
Peak/
Valley
Valley
Select
Select
log
log
K-L
K-L
Spectral
Spectral
Contrast
Contrast
Spectral Contrast Feature

In Jiang et al. (2002), an octave-based spectral contrast feature was proposed to
represent the spectral characteristics of a music clip. These are computed from the FFT
power coefficients. The power coefficients are filtered by a triangular bandpass filter, that
is, an octave scale filter. Then the spectral peaks, valleys, and their differences in each
subband are extracted. The final step for spectral contrast is to use a Karhumen-Loeve
transform to eliminate relativity. Figure 2 illustrates the estimation procedure of octavebased spectral contrast.
Rhythmic Content Features

Rhythmic content features characterize the movement of music signals over time
and contain information such as the regularity of the rhythm, beat, tempo, and time
signature. The feature set for representing rhythm structure is usually extracted from the
beat histogram. Tzanetakis (Tzanetakis et al., 2001) used a beat histogram built from the
autocorrelation function of the signal to extract rhythmic content features. The timedomain amplitude envelopes of each band are extracted by decomposing the music signal
into a number of octave frequency bands. Then, the envelopes of each band are summed
together followed by autocorrelation of resulting sum envelopes. The dominant peaks
of the autocorrelation function, corresponding to the various periodicities of the signals
envelopes, are accumulated over the whole music source into a beat histogram where
each bin corresponds to the peak lag. Figure 3 illustrates the procedure of constructing
a beat histogram from music sources.
After the beat histogram is created, the rhythmic content features are extracted from
the beat histogram. Generally, they contain amplitudes of the first and the second
histogram peaks, the ratio of the amplitudes of the second peak over the first peak,
periods of the first and second peaks, and the overall sum of the histogram, and so forth.
Pitch Content Features

Pitch is a perceptual term which can be approximated by fundamental frequency.
The pitch content features describe the melody and harmony information about music
signals, and a pitch content feature set is extracted based on various multipitch detection
techniques. More specifically, the multipitch detection algorithm described in Tolonen
and Karjalainen (2000) can be used to estimate the pitch. In this algorithm, the signal is
decomposed into two frequency bands and an amplitude envelope is extracted for each
frequency band. The envelopes are summed and an enhanced autocorrelation function
is computed so that the effect of integer multiples of the peak frequencies on multiple
pitch detection is reduced. The prominent peaks of this summary enhanced autocorrelation
Figure 3. Procedure to construct a beat histogram from music signals
function correspond to the main pitches for that short segment of sound and are
accumulated into pitch histograms. Then the pitch content features can be extracted from
the pitch histograms.
MUSIC SUMMARIZATION
The creation of a concise and informative extraction that accurately summarizes
original digital content is extremely important in a large-scale information repository.
Currently, the majority of summaries used commercially are manually produced from the
original content. For example, a movie clip may provide a good preview of the movie.
However, as a large volume of digital content has become publicly available on the
Internet and in other physical storage media during recent years, automatic summarization has become increasingly important and necessary.
There are a number of techniques being proposed and developed to automatically
generate summaries from text (Mani & Maybury, 1999), speech (Hori & Furui, 2000) and
video (Gong et al., 2001). Similar to text, speech and video summarization, music
summarization refers to determining the most common and salient themes of a given music
piece that may be used to represent the music and is readily recognizable by a listener.
Automatic music summarization can be applied to music indexing, content-based music
retrieval and web-based music distribution.
A summarization system for MIDI data has been developed (Kraft et al., 2001). It
uses the repetition nature of MIDI compositions to automatically recognize the main
melody theme segment for a given piece of music. A detection engine converts melody
recognition and music summarization to string processing and provides efficient ways
of retrieval and manipulation. The system recognizes maximal length segments that have
105
nontrivial repetitions in each track of the MIDI data of music pieces. These segments are
treated as basic units in music composition, and are the candidates for the melody in a
music piece. However, MIDI format is not sampled audio data (i.e., actual audio sounds),
instead, it contains synthesizer instructions, or MIDI notes, to reproduce audio.
Compared with actual audio sounds, MIDI data cannot provide a real playback experience and an unlimited sound palette for both instruments and sound effects. On the other
hand, MIDI data is a structured format, so it is easy to create a summary according to its
structure. Therefore, MIDI summarization has little practical significance. In this section,
we focus on the music summarization for sound recoding from the real world, both in
uncompressed domain such as WAV format and compressed domain such as MP3
format.
Music Summarization in Uncompressed Domain

Although an exact definition of what makes part of a song memorable or distinguishable is still unclear, a general consensus assumes the most repeated section plays an
important role. Approaches aimed at automatic music summarization include two stages.
The first stage is feature extraction. The music signal is segmented into frames and each
frame is characterized by features. Features related to instrumentation, texture, dynamics,
rhythmic characteristics, melodic gestures and harmonic content are used. Unfortunately, some of these features are difficult to extract, and it is not always clear which
features are most relevant. As a result, the first challenge in music summarization is to
determine the relevant features and find a way to extract them. In the second stage (music
structure analysis stage), the most repeated sections are identified based on similarity
analysis using various methods discussed below.
Feature Extraction
The commonly used features for music summarization are timbral textual features
including
Amplitude Envelope, used in Xu et al. (2002)

MFCC, used in Logan and Chu (2000), Xu et al. (2002), Foote et al. (2002), Lu and
Zhang (2003)
Octave-based spectral contrast, used in Lu and Zhang (2003)
It considers the spectral peak, spectral valley and their differences in each subband.
Therefore, it can roughly reflect the relative distribution of harmonic and nonharmonic
components in the spectrum, which complements the weak point of MFCC, that is, MFCC
averages the spectral distribution in each subband and thus loses the relative spectral
information.
Pitch, used in Chai and Vercoe (2003)
The pitch can be estimated using autocorrelation of each frame. Although all the
test data in their experiment are polyphonic, the authors believe that this feature is able
to capture much information for music signals with a leading vocal.
Music Structure Analysis

All approaches of music structure analysis are based on detecting the most
repeated section of a song. As a result, the critical issue in this phrase is how to measure
the similarity for different frames or for different sections. More precisely, these
approaches can be classified into two main categories: Machine Learning Approaches
and Pattern Matching Approaches. Machine learning approaches attempt to categorize
each frame of a song into a certain cluster based on the similarity distance between this
frame and other frames in the same song. Then the frame number of each cluster is used
to measure the occurrence frequency. The final summary is generated based on the
cluster that contains the largest number of frames. Pattern matching approaches aim at
matching the underlying candidate excerpts, which include a fixed number of continuous
frames, with the whole song. The final summary is generated based on the best matching
excerpt.
Machine Learning Approach
Since the music structure can be determined without prior knowledge, the provision
for unsupervised learning can be naturally met. Clustering is the most widely used
approach in this category, and there are several music structure analysis methods based
on clustering.
Logan (Logan & Chu, 2000) used clustering techniques to find the most salient part
of a song, which is called the key phrase, in selections of popular music. They proposed
a modified cross-entropy or Kullback Leibler (KL) distance to measure the similarity
between the different frames. In addition, they proposed a Hidden Markov Model
(HMM)-based summarization method which used each state of HMM to correspond to
a group of similar frames in the song.
Xu (Xu et al., 2002) proposed a clustering based method to group segmented frames
into different clusters to structure the music content. They used the Mahalanobis
distance for similarity measure.
Lu and Zhang (2003) divided their music structure analysis into two steps. In the
first step, they used clustering method similar to Xu et al. (2002) to group the frames. In
the second step, they used estimated phrase length and the phrase boundary confidence
of each frame to detect the phrase boundary. In this way, the final music summary will
not include broken music phrases.
Pattern Matching Approach
The pattern matching approach aims at matching the underlying excerpt with the
whole song to find the most salient part. The best matching excerpt can be the one that
is most similar to the whole song or the one that is repeated most in the whole song.
Foote et al. (2002) and Cooper and Foote (2002) proposed a representation called
similarity matrix for visualizing and analyzing the structure of music. One attempt of this
representation was to locate points of significant change in music, which they called
audio novelty. The audio novelty score is based on the similarity matrix, which compares
frames of music signals based on the features extracted from the audio. The resulting
summary was selected to maximize quantitative measures of the similarity between
candidate excerpts and the source audio as a whole. In their method, a simple Euclidean
distance or Cosine distance was used to measure the similarity between different frames.
107
Bartsch and Wakefield (2001) used chroma-based features and the similarity matrix
proposed by Foote for music summarization.
Chai and Vercoe (2003) proposed a Dynamic Programming method to detect the
repetition of a fixed length excerpt in a song one by one. Firstly, they segmented the whole
song into frames, and grouped the fixed number of continuous frames into excerpts. Then
they computed the repetition property of each excerpt in the song using Dynamic
Programming. The consecutive excerpts that have the same repetitive property were
merged into sections and each section was labelled according to the repetitive relation
(i.e., they gave each section a symbol such as A,B, etc.). The final summary was
generated based on the most frequently repeated section.
Evaluation of Summary Result

It is difficult to objectively evaluate the music summarization results because there
is no absolute measure to evaluate the quality of generated music summary. The only
possible validation is thus through well designed user tests. Basically, the current
subjective evaluation methods can be divided into two categories: the Attributes
Ranking method and the Ground Truth Based method.
The Attributes Ranking Method
The Attributes Ranking method uses appropriate attributes to access the users
perception of systems. Logan and Chu (2000), and Lu and Zhang (2003) used the general
perception of the subjects as the evaluation attribute in the music summary. The rating
has three levels: 3, 2, and 1, representing good, acceptable and unacceptable. By
comparing the average scores of different summarization methods, the highest scoring
method emerges as the best one. The drawback of this evaluation standard is obvious:
it is extremely coarse, not only for the number of rating levels, but also for the number
of evaluating attributes.
Xu et al. (2002) and Shao (Shao et al., 2004) proposed a more delicate evaluation
standard. They divided the general perception of a summary into three attributes, namely
Clarity, Conciseness and Coherence. In addition, the rating level had been extended to
five levels.
Chai and Vercoe (2003) considered four novel attributes to evaluate a music
summary: the percentage of the summary that contains a vocal portion, the percentage
of the summary that contains the songs title, the percentage of the summary that starts
at the beginning of a section and the percentage of the summary that starts at the
beginning of a phrase.
The Ground Truth Based Method
The Ground Truth Based method compares the summarization method with a
predefined ground truth which is generally assumed as a summary generated manually
by the music experts from the original music.
Lu and Zhang (2003) measured the overlap between an automatically extracted
music summary and the summary generated manually by the music expert.
Cooper and Foote (2002) proposed a method to measure whether the summarization
is good or not by generating the summary for the same song with different lengths and
validating whether the longer length summary contains the shorter one. Their basic
assumption is that the ideal summary lasting for a long time should contain the summary
of short time.
Music Summarization in Compressed Domain

So far, the music summarization methods mentioned above are all performed on
uncompressed domain. Due to the huge size of music data and the limited bandwidth,
audio/music analysis in compressed domain is in great demand. Research in this field is
still in its infancy and there are many open questions that need to be solved.
There are a number of approaches proposed for compressed domain audio processing. Most of the work focuses on compressed domain audio segmentation (Tzanetakis
& Cook, 2000; Patel & Sethi, 1996). Compared with compressed domain speech processing, compressed domain music processing is much more difficult, because music consists
of many types of sounds and instrument effects. Wang and Vilermo (2001) used the
window type information encoded in MPEG-1 Layer 3 side information header to detect
beats. The short windows were used for short but intensive sounds to avoid pre-echo.
They found that the window-switching pattern of pop-music beats for their specific
encoder at bit-rates of 64-96 kbps gives (long, long-to-short, short, short, short-to-long,
long) window sequences in 99% of the beats.
However, there is no music summarization method available for compressed domains. Due to the large amount of compressed domain music (e.g., MP3) available
nowadays, automatic music summarization in compressed domain is in high demand.
We have proposed an automatic music summarization method in the compressed
domain (MP3 format) (Shao et al., 2004). Considering the features that have been used
to characterize music content for summarization in uncompressed domain, we have
developed similar features in the compressed domain to simulate those features.
Compressed domain feature selection includes amplitude envelope, spectral centroid and mel-frequency cepstrum. They are extracted frame by frame over a segmentation
window which includes 30 MP3 granules. We have analyzed and illustrated that these
features approximated well the corresponding features in uncompressed domain (PCM)
(Shao et al., 2004). However, there are two major differences between compressed domain
and uncompressed domain feature extraction. Firstly, the time resolution for PCM and
MP3 is different. For PCM samples, we can arbitrarily adjust the window size; but for MP3
samples, the resolution unit is granule, which means we can only increase or decrease
the window size by a granule (corresponding to 576 PCM samples). Secondly, to conceal
side effects, we segment PCM samples into a fixed length and overlap windows to
generate summary. But for MP3 samples, we group 30 MP3 granules into a bigger window
which is not overlapping.
Based on calculated features of each frame, all the Machine Learning Approaches
mentioned in the uncompressed domain can be used to find the most salient part of the
music. We use clustering method in Xu et al. (2002) to group the music frames and get
the structure of the music content. After clustering, the structure of the music content
can be obtained. Each cluster contains frames with similar features. Summary can be
generated in terms of this structure and music domain knowledge. According to music
theory, the most distinctive or representative music themes occur repetitively in an entire
music work. The scheme of summary generation can be found in Xu et al. (2002).
109
Evaluation
We adopted the Attributes Ranking method to evaluate the summarization result.

Three attributes, namely clarity, conciseness and coherence, are introduced. The
experiment shows that the summarization conducted on MP3 samples is comparable with
the summarization conducted on PCM samples for all genres of music testing samples.
The aim of providing different music of different genres is to determine the effectiveness
of the proposed method in creating summary of different genres. A complete description
of the results can be found in Shao et al. (2004).
Music Video Summarization

Nowadays, many music companies are putting their music video (MTV) products
on the Web and customers can purchase them online. From the customer point of view,
they would prefer to watch the highlight of an MTV before they make a decision on
whether to purchase or not. On the other hand, from the music company point of view,
they would be glad to invoke the buying interests of the music fans by showing the
highlights of a music video rather than showing all of the video. Although there are
summaries in some Web sites, they are currently generated manually, which needs
expensive manpower and is time-consuming. Therefore, it is crucial to come up with an
automatic summarization approach for music videos.
There are a number of approaches proposed for automatically creating video
summaries. The existing video summarization methods can be classified into two
categories: key-frame extraction and highlight creation. Using a set of key frames to
create a video summary is the most common approach. A great number of key frame
extraction methods (DeMenthon et al. 1998; Gunsel & Tekalp, 1998) have been proposed.
Key frames can help the user identify the desired shots of video, but they are insufficient
to help the user obtain a general idea of whether the created summary is relevant or not.
To make the created summary more relevant and representative to the video content,
video highlight creation methods (Sundaram et al., 2002; Assfalg et al., 2002) are
proposed to reduce a long video into a short sequence and help the user determine
whether a video is worth viewing in its entirety. It can provide an impression of the entire
video content or only contain the most interesting video sequences.
MTV is a special kind of video. It is an extension of music and widely welcomed by
music fans. Nowadays, automatic video summarization has been applied to sports video
(Yow et al., 1995), news video (Nakamura & Kanade, 1997), home video (Gong et al., 2001)
and movies (Pfeiffer et al., 1996). However, there is no widely accepted summarization
technique used for music video.
We have proposed an automatic music video summarization approach (Shao et al.,
2003), which is described in the following subsections.
Structure of Music Video

Video programs such as movies, dramas, talk shows, and so forth, have a strong
synchronization between the audio and visual contents. What we hear from the audio
track is highly correlated with what we see on the screen, and vice versa. For this type
of video program, since synchronization between audio and image is critical, the
summarization has to be either audio-centric or image-centric. The audio-centric summarization can be accomplished by first selecting important audio segments of the original
video based on certain criteria and then concatenating them to compose an audio
summary. To enforce the synchronization, the visual summary has to be generated by
selecting the image segments corresponding to those audio segments which form the
audio summary. Similarly, the image-centric summarization can be created by selecting
representative image segments from the original video to form a visual summary, and then
taking the corresponding audio segments to form the associated audio summary. For
these types of summarizations, either audio or visual contents of the original video will
be sacrificed in the summaries.
However, music video programs do not have a strong synchronization between
their audio and visual contents. Considering a music video program in which an audio
segment presents a song sung by a singer, the corresponding image segment could be
a close-up shot of the singer sitting in front of a piano, or shots of some related interesting
scenes. The audio content does not directly refer to the corresponding visual content.
Since music video programs do not have strong synchronization between the associated
audio and visual contents, we propose to first create an audio and a visual summary
separately, and then integrate the two summaries with partial alignment. With this
approach, we can maximize the coverage for both audio and visual contents without
sacrificing either of them.
Our Proposed Method

Figure 4 is the block diagram of the proposed music video summarization system.
Music video is separated into the audio track and the visual track. For the audio track,
a music summary is created by analyzing music content based on music features, adaptive
clustering algorithm and music domain knowledge. For the visual track, shots are
detected and clustered using visual content analysis. Finally, the music video summary
is created by specially aligning the music summary with clustered visual shots. Assuming
for the moment that the music summarization for the audio track has been settled in the
previous section, we focus on the process of shot detection and aligning the music
summary with clustered visual shots. The music summarization scheme can be found in
Xu et al. (2002).
Shot Detection
In general, to create a video summary, the original video sequence must be first
structured into a shot cluster set S. Any pairs in the set must be visually different, and
all shots belonging to the same cluster must be visually similar. We choose the first frame
appearing after each detected shot boundary as a key frame. Therefore, for each shot,
we have a key frame related to it. When comparing the similarities of two different shots,
we calculate the difference between two key frames using color histograms.
Dv (i, j ) =
e =Y ,U ,V
k =1..n
e
i
( k ) h ej (k )
(7)
where hie , h ej are the YUV-level histograms of key frame i and j, respectively.
111
Figure 4. Block diagram of proposed summarization system
The total number of clusters in S varies depending on the internal structure of the
original video. When given a shot cluster set S, the video sequence with the minimum
redundancy measure is the one in which all the shot clusters have a uniform occurrence
probability and an equal time length of 1.5 seconds (Gong et al., 2001). Based on these
criteria, the video summaries were created using the following major steps:
1.
2.
3.
4.
5.
Segment the video into individual camera shots.

Group the camera shots into clusters based on their visual similarities. After the
clustering process, each resultant cluster consists of the camera shots whose
similarity distance to the centre of the cluster is below a threshold D.
For each cluster, find the shot with the longest length, and use it as the representative shot for the cluster.
Discard the clusters with a representative shot shorter than 1.5 seconds. For those
clusters with a representative shot longer than 1.5 seconds, we cut the shot to 1.5
seconds.
Sort the representative shots of all clusters by the time code, resulting in the
representative shot set U={u1, u2,, um}, mn, where n is the total number of
clusters in shot set S.
Music Video Alignment

The final task for creating a music video summary is the alignment operation that
partially aligns the image segments in the video summary with the associated music
segments. Our goal for the alignment is that the summary should be smooth and natural,
and should maximize the coverage for both music and visual contents of the original
music video without sacrificing audio or visual parts.
Assume that the whole time span Lsum of the video summary is divided by the
alignment into P partitions (required clusters), and the time length of partition i is T i.
Because each image segment forming the visual summary must be at least L min seconds
long (a time slot equals one Lmin duration) as shown in Figure 5, partition i will provide
N i = Ti /Lmin
(8)
and hence the total number of available time slots become

P
N total = N i
(9)
i =1
For each partition, the time length of music subsummary lasts for three to five
seconds, and the time length of a shot is 1.5 seconds. Therefore, the alignment problem
can be formally described as
1.
2.
Given:
An ordered set of representative shots U={u1, u2,, um}, mn, n is the total number
of clusters in cluster set S.
P partitions and N total time slots.
To extract:
P sets of output shots R ={R1, R2, , RP} which are the best matches between shot
set U and Ntotal time slots.
where:
P=The number of partitions
Ri={ri1,...,r ij ,...,r iNi} U, i=1,2,,P and Ni= Ti /Lmin
where r i1,...,r ij ,...,riNi are optimal shots selected from the shot set U for the i-th
partition.
By proper reformulation, this problem can be converted into the Minimum Spanning
Tree (MST) problem (Dale, 2003). Let G = (V, E) represent an undirected graph with a
Figure 5. Alignment operations on image and music
T1
Audio
Image
T2
Ti
TP-1
TP
Partition 1
Partition
Lmin
Lmin
Lmin
Lmin
Lmin
Lmin
113
weighted edge set V and a finite set of vertices E. The MST of a graph defines the lowestweight subset of edges that spans the graph in one connected component. To apply the
MST on our alignment problem, we use each vertex to represent a representative shot
ui, and an edge eij=(ui , uj) to represent the similarity between shots ui and uj. The similarity
here is defined as the combination of time similarity and visual similarity, and we give time
similarity a higher weight. The similarity is defined as follows:
eij = (1 )T (i, j ) + D (i, j )
(10)
where (0 1) is a weight coefficient, and D(i, j ) and T(i, j) represent the normalized
visual similarity and time similarity, respectively.
D(i, j ) is defined as follows:
D (i, j ) = Dv(i,j) /max(Dv(i,j))
(11)
where Dv(i,j) is the visual similarity calculated from Equation (9). After normalization,
D(i, j ) has a value range from 0 to 1.

T(i,j) is defined as follows:
1/(F j Li ) Li < F j
T(i,j) =
0
otherwise
(12)
where Li is the index of the last frame in the ith shot, and Fj is the index of the first frame
in the jth shot. Using this equation, the closer the two shots are in the time domain, the
higher the time similarity value they get. T(i,j) varies from 0 to 1, when shot j just follows
shot i. There are no other frames between these two shots. In order to give the time
similarity high priority, we set less than 0.5. Thus, we can create a similarity matrix
for all shots in representative shots set U, and the (i,j)th element of is eij.
For every partition Ri, we generate an MST based on the similarity matrix .
In summary, for creating content-rich audio-visual extraction, we propose the
following alignment operations:
1.
2.
3.
Summarize the music track of the music video. The music summary consists of
several partitions, each of which lasts for three to five seconds. The total duration
of the summary is about 30 seconds.
Divide each music partition into several time slots, each of which lasts for 1.5
seconds.
For each music partition, find the corresponding image segment as follows:
In the first time slot of the partition, find the corresponding image segment in the
time domain. If it exists in the representative shot set U, assign it to the first slot and delete
it from the shot set U; if not, identify it in the shot set S, and find a most similar shot in
shot set U using similarity measure which is defined in Equation (7). After finding the
shot, take it as the root, apply an MST algorithm on it, find other shots in shot set U, and
fill them in the subsequent time slots in this partition.
Future Work
We believe that there is a long way to go for automatically generating transcription
from acoustic music signals. Current techniques are not robust and efficient enough.
Thus, analyzing the acoustic music signals directly without transcription is practically
important in music summarization. The future direction will focus on making summarization more accurate. To achieve this, on the one hand, we need to explore more music
features that can be used to characterize the music content; on the other hand, we need
to investigate more sophisticated music structure analysis methods to create more
accurate and acceptable music summary. In addition, we will investigate more deeply into
human perception of music; for example, what makes part of music sound like a complete
phrase, and what makes it memorable or distinguishable.
For music video, except for improving the summarization on the audio part, more
sophisticated music/video alignment methods will be developed. Furthermore, some of
the other information in music video can be integrated generation the summary. For
example, some Karaoke music videos have lyric captions, which can be detected and
recognized. These captions, together with visual shots and vocal information, can be
used to make better music video summary.
MUSIC GENRE CLASSIFICATION

The ever-increasing wealth of digitized music on the Internet calls for an automated
organization of music materials. Music genre is an important description that can be used
to classify and characterize music from different sources such as music shops, broadcasts and Internet. It is very useful for music indexing and content-based music retrieval.
For humans, it is not difficult to classify music into different genres. Although to make
computers understand and classify music genre is a challenging task, there are still
perceptual criteria related to the melody, tempo, texture, instrumentation and rhythmic
structure that can be used to characterize and discriminate different music genres.
A music genre is characterized by common features related to instrumentation,
texture, dynamics, rhythmic characteristics, melodic gestures and harmonic content.
Similar to music summarization, the challenge of genre classification is to determine the
relevant features and find a way to extract them.
Once the features have been extracted, it is then necessary to find an appropriate
pattern recognition method for classification. Fortunately, there are a variety of existing
machine learning and heuristic-based techniques that can be adapted to this task.
Aucouturier and Pachet (2003) presented an overview of the various approaches
for automatic genre classification and categorized them into two categories: prescriptive
approaches and emergent approaches.
115
Prescriptive Approach
Aucouturier and Pachet (2003) defined the prescriptive approach as an automatic
process that involves two steps: frame-based feature extraction followed by machine
learning.
Tzanetakis et al. (2001) cited a study indicating that humans are able to classify
genre after hearing only 250 ms of a music signal. The authors concluded from this that
it should be possible to make classification systems that do not consider music form or
structure. This implied that real-time analysis of genre could be easier to implement than
thought.
The ideas were further developed in Tzanetakis and Cook (2002), where a fully
functional system was described in detail. The authors proposed to use features related
to timbral texture, rhythmic content and pitch content to classify pieces, and the
statistical values (such as the mean and the variance) of these features were then
computed.
Several types of statistical pattern recognition (SPR) classifiers are used to identify
genre based on feature data. SPR classifiers attempt to estimate the probability density
function for the feature vectors of each genre. The Gaussian Mixture Model (GMM)
classifier and K-Nearest Neighbor (KNN) classifier were, respectively, trained to distinguish between 20 music genres and three speech genres by feeding them with feature sets
of a number of representative samples of each genre.
Pye (2000) used MFCCs as the feature vector. Two statistical classifier, GMM and
Tree-based Vector Quantization scheme, were used separately to classify music into six
types: blues, easy listening, classic, opera, dance and rock.
Grimaldi (Grimalidi et al., 2003) built a system using a discrete wavelet transform to
extract time and frequency features, for a total of 64 time features and 79 frequency
features. This is a greater number of features than Tzanetakis and Cook (2002) used,
although few details were given about the specifics of these features. This work used an
ensemble of binary classifiers to perform the classification operation with each trained
on a pair of genres. The final classification is obtained through a vote of the classifiers.
Tzanetakis, in contrast, used single classifiers that processed all features for all genres.
Xu (Xu et al., 2003) proposed a multilayer classifier based on support vector
machines (SVM) to classify music into four genres of pop, classic, rock and jazz. In order
to discriminate different music genres, a set of music features was developed to
characterize music content of different genres and an SVM learning approach was applied
to build a multilayer classifier. For different layers, different features and support vectors
were employed. In the first layer, the music was classified into pop/classic and rock/jazz
using an SVM to obtain the optimal class boundaries. In the second layer, pop/classic
music was further classified into pop and classic music and rock/jazz music was classified
into rock and jazz music. This multilayer classification method can provide a better
classification result than existing methods.
Classification Results and Evaluation

It is impossible to give an exhaustive comparison of these approaches because they
use different target taxonomies and different training sets. However, we can still draw
some interesting remarks.
Tzanetakis et al. (2001) achieved 61% accuracy using 50 songs belonging to 10

genres.
Pye (2000) reported 90% accuracy on a total set of 175 songs over five genres.
Grimalidi et al. (2003) achieved a success rate of 82%, although only four categories
are used.
Xu et al. (2003) reported accuracy of 90% over four categories.
A common remark is that some types of music have proven to be more difficult to
classify than others. In particular, Classic and Techno are easy to classify, while
Rock and Pop are not. A possible explanation for this is that the global frequency
distribution of Classic and Techno is very different from other music types, whereas
many Pop and Rock music versions use the same instrumentation.
Emergent Approach
There are two challenges in the prescriptive method (see the preceding section):
how to determine features to characterize the music and how to find an appropriate
pattern recognition method to perform classifications. The more fundamental problem,
however, is to determine the structure of the taxonomy in which music pieces will be
classified. Unfortunately, this is not a trivial problem. Different people may classify the
same piece differently. They may also select genres from entirely different domains or
emphasize different features. There is often an overlap between different genres, and the
boundaries of each genre are not clearly defined. The lack of universally agreed upon
definitions of genres and relationships between them makes it difficult to find appropriate
taxonomies for automatic classification systems.
Pachet and Cazaly (2000) attempted to solve this problem. They observed that the
taxonomies currently used by the music industry were inconsistent and therefore
inappropriate for the purpose of developing a global music database. They suggested
building an entirely new classification system. They emphasized the goals of producing
a taxonomy that was objective, consistent, and independent from other metadata
descriptors and that supported searches by similarity. They suggested a tree-based
system organized by genealogical relationships as an implementation, where only leaves
would contain music examples. Each node would contain its parent genre and the
differences between its own genre and that of its parent.
Although merits exist, the proposed solution has problems of its own. To begin
with, defining an objective classification system is much easier to say than do, and
getting everyone to agree on a standardized system would be far from an easy task,
especially when it is considered that new genres are constantly emerging. Furthermore,
this system did not solve the problem of fuzzy boundaries between genres, nor did it deal
with the problem of multiple parents that could compromise the tree structure.
Since, up to now, no good solution for the ambiguity and inconsistence of music
genre definition, Pachet et al. (2001) presented the emergent approach as the best
approach to take to achieve automatic genre classification. Rather than using existing
taxonomies, as done in prescriptive systems, emergent systems attempted to emerge
classifications according to certain measure of similarity. The authors suggested some
similarity measurements based on audio signals as well as on cultural similarity gleaned
from the application of data mining techniques to text documents. They proposed the use
of both collaborative filtering to search for similarities in the taste profiles of different
individuals and co-occurrence analysis on the play lists of different radio programs and
117
track listings of CD compilation albums. Although this emergent system has not been
successfully applied to music, the idea of automatically exploiting text documents to
generate genre profiles is an interesting one.
Future Work
There are two directions for prescriptive approaches that need to be investigated
in the future. First, more music features are needed to be explored because the better
feature set can improve the performance dramatically. For example, some music genres
use the same instrumentation, which implies that the timbre features are not good enough
to separate them. Thus we can use rhythm features in the future. Existing beat-tracking
systems are useful in acquiring rhythmic features. However, many existing beat-tracking
systems provide only an estimate of the main beat and its strength. For the purpose of
genre classification, more detailed information of features such as overall meter, syncopation, use of rubato, recurring rhythmic gestures and the relative strengths of beats and
subbeats are all significant. Furthermore, we can consider segmenting the music clip
according to its intrinsic rhythmic structure. It captures the natural structure of music
genres better than the traditional fixed length window framing segmentation. The second
direction is to scale-up unsupervised classification to music genre classification. Since
the supervised machine learning method is limited by the inconsistency of built-in
taxonomy, we will explore an unsupervised machine learning method which tries to
emerge a classification from the database. We will also investigate the possibility of
combining an unsupervised classification method with a supervised classification
method for music genre classification. For example, the unsupervised method could be
employed to initially classify music into broad and strongly different categories, and the
supervised method could then be employed to classify finely narrowed subcategories.
This would partially solve the problem of fuzzy boundaries between genres and could
lead to better overall results.
The emergent approach is able to extract high-level similarity between titles and
artists and is therefore suitable for unsupervised clustering of songs into meaningful
genre-like categories. These techniques suffer from technical problems, such as labelling
clusters. These issues are currently under investigation.
SEMANTIC REGION DETECTION IN

ACOUSTICAL MUSIC SIGNAL
Semantic region detection in music signals is a new direction in music content
analysis. Compared with speech signals, music signals are heterogeneous because they
contain different source signal mixtures in different regions. Thus, detecting these
regions in music signals can bring down the complexity of both analysis of music content
and information extraction. For example, in order to build a content-based music retrieval
system, the sung vocal line is one of the intrinsic properties in a given music signal. In
automatic music transcription, instrumental mixed vocal line should be analyzed in order
to extract music note information such as the type of the instrument and note characteristics (attack, sustain, release, decay). In automatic music summarization, semantic
structure of the song (i.e., intro, chorus, verses and bridge, outro) should be identified
accurately. In singing voice identification, it is required to detect and extract the vocal
regions in the music. In music source separation, it is required to identify the vocal and
instrumental section. To remove the voice from the music for applications such as
Karaoke and for automatic lyrics generator, it is also required to detect the vocal sections
in the music signal.
The continuous raw music data can be divided into four preliminary classes: pure
instrumental (PI), pure vocal (PV), instrumental mixed vocal (IMV), and silence (S). The
pure instrumental regions contain signal mixture of many types of musical instruments
such as string type, bowing type, blowing type, percussion type, and so forth. The pure
vocal regions are the vocal lines sung without instrumental music. The IMV regions
contain the mixture of both vocals and instrumental music. Although the silence is not
a common section in popular music, it can be found at the beginning, ending and between
chorus verse transitions in the songs.
The singing voice is the oldest musical instrument and the human auditory
physiology and perceptual apparatus have evolved to a high level of sensitivity to the
human voice. After over three decades of extensive research on speech recognition, the
technology has matured to the level of practical applications. However, speech recognition techniques have limitations when applied to singing voice identification because
speech and singing voice differ significantly in terms of their production and perception
by the human ear (Sundberg, 1987). A singing voice has more dynamic and complicated
characteristics than speech (Saitou et al., 2002). The dynamic range of the fundamental
frequency (F0) contours in a singing voice is wider than that in speech, and F0
fluctuations in singing voices are larger and more rapid than those in speech.
The instrumental signals are broadband and harmonically rich signals compared
with singing voice. The harmonic structures of instrumental music are in the frequency
range up to 15 kHz, whereas singing voice is in the range of below 5 kHz.
Thus it is important to revise the speech processing techniques according to
structural musical knowledge so that these techniques can be applied to music content
analysis such as semantic region detection where the signal complexity defers for
different regions.
The semantic region detection is a hot topic in content-based music analysis. In the
following subsections, we summarize related work in this area and introduce our new
approach for semantic region detection.
Related Work
Many of the existing approaches for speech, instrument and singing voice detection and identification are based on speech processing techniques.
Boundary Detection in Speech

Speech can be considered as a homogeneous acoustic signal because it contains
the signal of a single source only. Regions in the speech can be classified into voiced,
unvoiced, voiced/unvoiced mixture or silence depending on how the speech model is
excited. The efficiency of extracting the meaning of the continuous speech signal
depends on how accurately the regions are detected in the automatic speech recognition
systems. Many researches on speech analysis have been done over three decades and
methodologies are well established (Rabiner & Juang, 1993; Rabiner & Schafer, 1978). All
119
vowels and some consonants [m], [n], [l] are voiced while other consonants [f], [s], [t]
are unvoiced. For unvoiced, the source is no longer the phonation of the vocal folds, but
the turbulence caused by air is impeded by the vocal tract. Some consonants ([v], [z])
are mixed sounds (mixture of voiced and unvoiced) that use both phonation and
turbulence to produce the overall sound (Kim, 1999).
Speech is a narrow band (<10 kHz) signal and voiced and unvoiced regions are
distinctive in the spectrogram. Voiced fricatives produce quasi-periodic pulses. Thus
harmonically spaced strong frequencies in the lower frequency band (<1 kHz) can be
noticed in the spectrogram. Since unvoiced fricatives are produced by exciting the vocal
tract with broadband noise, it appears as a broadband frequency beam in the spectrum.
The analysis of formant structures which are the resonant frequencies of the vocal tract
tube has been one of the key techniques for detecting the voiced/unvoiced regions. Pitch
contour and time domain speech modeling using signal energy and average zero crossing
are some of the other speech features inspected for detecting the speech boundaries.
Basic steps for detecting the boundaries in the speech signals are shown in Figure
6. The signal is first segmented into 30~40 ms with 50% overlapping short windows, then
features are extracted. The shorter window smooths the shape of lower frequencies in
the spectrum and highlights the lower frequency resonant in the vocal tract (formants). Another
reason is that a shorter window is capable of detecting dynamic changes of the speech and with
reasonable window overlap can detect these temporal properties in the signal.
The linear predictive coding coefficients (LPC) calculated from the speech model,
stationary signal spectrum representation using Cepstral coefficients and dominant
pitch sensitive Mel-scaled Cepstral coefficients are some of the features extracted from
the short time windowed signals and they are modeled with statistical learning methods.
Most of the speech recognition systems have employed Hidden Markov Model (HMM)
to detect these boundaries and it is found that HMM is efficient in modeling the
dynamically changing speech properties in different regions.
Instrument Detection and Identification

Researches on instrumental music content analysis focus on identifying musical
instruments, timbre, and musical notes from isolated tones recorded in the studio
environment. The approaches are similar to the steps shown in Figure 6 where statistical
signal processing and neural network have been employed for feature extraction and
pattern classification.
Fujinaga (1998) trained a K-Nearest Neighbor (K-NN) classifier with spectral
domain features extracted from 1,338 spectral slices representing 23 instruments playing
a range of pitches. The extracted features included the mass or the integral of the curve
(zeroth-order moment), the centroid (first-order moment), the standard deviation (square
root of the second-order central moment), the skewness (third-order central moment),
kurtosis (fourth-order central moment), higher-order central moments (up to 10th), the
Figure 6. Steps for boundary detection in speech signals
Speech
Feature
extraction
Signal Segmentation /
Windowing
Classification /
Learning
Boundary
detection
fundamental frequency, and the amplitudes of the harmonic partials. For the best
recognition score, genetic algorithm (GA) was used to find the optimized subset of 352
main feature set. In his experiment, it was found that the accuracy varied from 8% to 84%.
Cosi (Cosi et al., 1994) trained a Self-Organizing Map (SOM) with MFCCs extracted
from isolated musical tones of the 40 different musical instruments for timbre classification. Martin (1999) trained a Bayesian network with different types of features such as
spectral features, pitch, vibrato, tremolo features, and note characteristic features to
recognize the nonpercussive musical instruments.
Eronen and Klapuri (2000) proposed a system for musical instrument recognition
using a wide set of features to model the temporal and spectral characteristics of sounds.
Kashino and Murase (1997) compared the classification abilities of a feed-forward
neural network with a K-Nearest Neighbor classifier, both trained with features of the
amplitude envelopes for isolated instrument tones.
Brown (1999) trained a GMM with constant-Q Cepstral coefficients for each
instrument (i.e., oboe, saxophone, flute and clarinet), using approximately one minute of
music data each.
Maddage et al. (2002) extracted spectral power coefficients, ZCR, MFCCs and LPC
derived Cepstral coefficients from eight pitch class electrical guitar notes (C4 to C5
shown in Table 1) and employ nonparametric learning technique (i.e., the nearest
neighbour rule) to classify the musical notes. Over 85% accuracy of correct note
classification was reported using a musical note database which has 100 samples of each
note.
Singing Voice Detection and Identification

For singing voice detection, Berenzweig and Ellis (2001) used probabilistic features,
which are generated from Cepstral coefficients using an MPL neural network acoustic
model with 2000 hidden units. Two HMMs (vocal-HMM and nonvocal-HMM) were
trained with these specific features, which are originally extracted from 61 fragments (one
fragment = 15 seconds) of training data, to classify vocal and instrumental sections of
a given song. However, the reported accuracy was only 81.2% with a 40-fragment training
dataset.
Kim and Brian (2002) first filtered the music signal using IIR band-pass filter
(200~2000 Hz) to highlight the vocal energies and then vocal regions were identified by
detecting high amounts of harmonicity of the filtered signal using inverse comb filterbank. They achieved 54.9% accuracy with the test set of 20 songs.
Zhang (2003) and Zhang and Kuo (2001) used a simple threshold, which is calculated
using energy, average zero crossing, harmonic coefficients and spectral flux features, to
find the starting point of the vocal part of the music. The similar technique was applied
to detect the semantic boundaries of the online audio data; that is, speech, music and
environmental sound for the classification. However, the vocal detection accuracy was
not reported.
Tsai et al. (2003) trained 64-mixture vocal GMM and an 80-mixture nonvocal GMM
with MFCCs extracted from 32-ms time length with 10-ms overlapped training data (216
song tracks). An accuracy of 79.8% was reported with their 200 testing tracks.
In our previous work (Maddage et al., 2003; Gao et al., 2003; Xu et al., 2004), we
trained different statistical learning techniques such as SVM, HMM, MLP neural
121
networks to detect the vocal and nonvocal boundaries in hierarchical fashion. Similar to
other methods, our experiments were based on speech-related features, that is, LPC, LPC
derived Cepstral coefficients, MFCCs, ZCR and Spectral Power (SP). After parameter
tuning we could reach an accuracy over 80%. In Maddage et al. (2004a), we measured both
the harmonic spacing and harmonic strength of both instrumental and vocal spectrums
in order to detect the instrumental/vocal boundaries in the music.
For singer identification, a vocal and instrumental model combination method has
been proposed in Maddage et al. (2004b). In that method, vocal and instrumental sections
of the songs were characterised using octave scale Cepstral coefficients (OSCC) and LPC
derived Cepstral coefficients (LPCC), respectively, and two GMMs (one for vocal and
the other for instrumental) were trained to highlight the singer characteristics. The experiments performed on a database of 100 songs indicated that the singer identification could
be improved (by 6%) when the instrumental models were combined with the vocal model.
The previous methods have borrowed the mature speech processing ideas such as
fixed-frame-size acoustic signal segmentation (usually 20~100-ms frame size and 50%
overlap), speech processing/coding feature extraction, and statistical learning procedures or linear threshold for segment classification, to detect the vocal/nonvocal
boundaries of the music. Although these methods have achieved up to 80% of frame level
accuracy, their performance is limited due to the fact that musical knowledge has not been
effectively exploited in these (mostly bottom-up) methods. We believe that a combination of bottom-up and top-down approaches, which combines the strength of low-level
features and high-level musical knowledge, can provide a powerful tool to improve
system performance. In the following subsections, we investigate how well the speech
processing techniques can cope with the semantic boundary detection task, and we
propose a novel approach, which considers both signal processing and musical knowledge, to detect semantic regions in acoustical music signals.
Song Structure
The popular music structure often contains Intro, Verse, Chorus, Bridge and Outro
(Ten Minute Master, 2003). The intro may be two, four or eight bars long (or longer), or
there may not be any intro in a song at all. The intro of the pop song is a flash back of
the chorus. Both verse and chorus are eight to sixteen bars long. Typically the verse is
not as strongly melodic as the chorus. However the verse and chorus of some songs like
Beatles songs are equally strong and most of the people can hum or sing their way.
Usually the gap between verse and chorus is linked by a bridge which may be only two
or four bars. There are instrumental sections in the song and they can be instrumental
versions of chorus or verse, or an entirely different tune with an altogether different set
of chords. Silence may act as a bridge between verse and chorus of a song, but such cases
are rare.
Since the sung vocal passages follow the changes in the chord pattern, we can apply
the following knowledge of chords (Goto, 2001) to the timing information of vocal
passages:
1.
2.
Chords are more likely to change on beat times than on other positions.
Chords are more likely to change on half-note times than on other positions of beat
times.
Figure 7. Time alignment of the structure in songs with bars and quarter notes
bar 1 bar 2
bar i
Intro
Verse
Bar length
i < j < k<l < n
Song: 4/4
bar k
bar j
Chorus
Verse
Quarter note length

bar n
bar l
Bridge
Chorus
Outro
Song structure
3.
Chords are more likely to change at the beginning of the measure than at other
positions of half-note times.
A typical song structure and its possible time alignment of bars and quarter notes
are shown in Figure 7. The duration of the song can be measured by the number of quarter
notes in the song, where the quarter-note time length is proportional to the interbeat
timing. The beginning and the end of Intro, Verse, Chorus Bridge and Outro usually start
and end at quarter notes. This is illustrated in Figure 7.
Thus we detect both the onsets of the musical notes and chord pattern changes to
compute the quarter-note time length with high confidence.
Overview of the Proposed Method

The block diagram of the proposed approach is shown in Figure 8. In our approach
we detect the timing information of the song in terms of quarter-note length which is
proportional to interbeat time intervals. Then the audio is segmented into frames which
are proportional to quarter-note length. To differentiate from the frequently used fixedlength segmentation, we call this beat space segmentation (BSS). The technical details
of rhythm extraction and beat space segmentation are detailed in the next section. There
are two reasons why we use BBS.
1.
2.
The careful analysis of the song structure reveals that the time lengths of the
semantic regions (PV, IMV, PI, & S) are proportional to the interbeat time interval
of the music which corresponds to the quarter-note length (see preceding section
and next section).
The dynamic behaviour of the beat spaced signal section is quasi stationary
(Sundberg, 1987; Rossing et al. , 2002). In another way, the musically driven signal
properties such as octave spectral spacing and musical harmonic structure change
in beat space time steps.
Rhythm Extraction and Beat Space Segmentation

The rhythm extraction is important to obtain metadata from the music. Rhythm can
be perceived as a combination of strong and weak beats (Goto & Muraoka, 1994). A
strong beat usually corresponds to the first and third quarter note in a measure and the
weak beat corresponds to the second and forth quarter note in a measure. If the strong
123
Figure 8. Block diagram of the proposed approach
Musical Audio
Rhythm extraction
Quarter note length proportional audio

segmentation
Beat Space
Segmentation
Silent frame detection
Musically modified feature extraction

Statistical learning & classification
Pure Vocal (PV)

Instrumental Mixed Vocals (IMV)
Semantic
Regions
Pure Instrumental (PI)
beat constantly alternates with the weak beat, the interbeat interval, which is the temporal
difference between two successive beats, would correspond to the temporal length of
a quarter note.
In our method, the beat corresponds to the sequence of equally spaced phenomenal
impulses which define the tempo for the music (Scheirer, 1998). We assume the meter to
be 4/4, this being the most frequent meter of popular songs and the tempo of the input
song to be constrained between 30-240 M.M (Mlzels Metronome: the number of quarter
notes per minute) and almost constant (Scheirer, 1998).
Our proposed rhythm tracking and extraction approach is shown in Figure 9. We
employ a discrete wavelet transform technique to decompose the music signal according
to octave scales. The frequency ranges of the octave scales are detailed in Table 1. The
system detects both onsets of musical notes (positions of the musical notes) and the
chord changes in the music signal. Then, based on musical knowledge, the quarter-note
time length is computed.
The onsets are detected by computing both frequency transients and energy
transients in the octave scale decomposed signals described in Duxburg et al. (2002). In
order to detect hard and soft onsets we take the weighted summation of onsets, detected
in each sub-band as shown in Equation (13) where Sb i (t) is the onset computed in
the i th sub-band at time t and On(t) is the weighted sum of each sub-band onset
at time t.
Figure 9. Rhythm tracking and extraction
Transient
Engergy
Autocorrelation
Sub-band 2
Frequency
Transients
Onset Detection
Sub-band 1
Moving Threholds
Steps for onset detection
Rhythm
Knowledge
Octave scale sub-band

decomposition using Wavelets
Frequency
code book for
musical chords
Musical chord
Identification
Sub-band 8
Distance measure
Audio music
Sub-band
frequency
spectrum
Peak
Tracking
Ineter beat
time
information
Chord
change time
information
Musical chord detector
On(t ) = w(i ) Sbi (t )

i =1
(13)
Based on a statistical analysis of the autocorrelation of detected strong and weak

onset times On(t), we obtain an interbeat interval corresponding to the temporal length of
a quarter note. By increasing the sensitivity of peak tracking in the regions between the
detected quarter notes, we obtain a sixteenth-note level accuracy as shown in Figure 10.
We created a code book which contains spectrums of all possible major and minor
chord patterns. Musical chords are generated synthetically by mixing prerecorded
musical notes of different musical instruments (piano, base guitar, acoustic guitar,
Roland synthesizer and MIDI tone database). The size of the chord book is reduced
following vector quantization (VQ). To detect the chord, we compare the distance
between the spectrums of the given signal segment and spectral patterns in the code
book. The closest spectral pattern of the code book is assigned.
The detected time information of chord pattern change is compared with the
interbeat time information (length of quarter note) given by the onset detector. The chord
change timings are multiples of quarter-note time length as described in the section on
Song Structure.
We divide the detected sixteenth-note level pulses into four groups that correspond to the length of a quarter note. Then the music is framed into quarter-note spaced
segments for vocal onset analysis based on the musical knowledge of chords.
Silence Detection
Silence is defined as a segment of imperceptible music, including unnoticeable
noise and very short clicks. We use short-time energy to detect silence. The short-time
energy function of a music signal is defined as
125
Figure 10. Three-second clip of a musical audio signal

3 s e c o n d s e x c e r t f r o m " L a d y in R e d - C h r is D e B u r g h "
Signal Strength
1
0 .5
0
- 0 .5
-1
1
0
1
2
1
0
1
2
x 10
1
0
1
2
x 10
x 10
20
R e s u lts o f A u to c o r r e la tio n
Energy
15
10
5
Strength
0
1
0 .8
0 .6
0 .4
0 .2
0
S ix e te e n th N o te L e v e l P lo t
4
6
8
S a m p le s ( s a m p le r a te = 4 4 1 0 0 H z )
Table 1. Musical note frequencies and their placement in the octave scales sub-bands
Musical notes
En =
01
02
03
04
05
06
07
C2 to B2
C3 to B3
C4 to B4
C5 to B5
C6 to B6
C7 to B7
C8 to B8
64 ~
128
65.406
69.296
73.416
77.782
82.407
87.307
92.499
97.999
103.826
110.000
116.541
123.471
128 ~
256
130.813
138.591
146.832
155.563
164.814
174.614
184.997
195.998
207.652
220.000
233.082
246.942
256 ~
512
261.626
277.183
293.665
311.127
329.628
349.228
369.994
391.995
415.305
440.000
466.164
493.883
512 ~
1024
523.251
554.365
587.330
622.254
659.255
698.456
739.989
783.991
830.609
880.000
932.328
987.767
1024 ~
2048
1046.502
1108.730
1174.659
1244.508
1318.510
1396.913
1479.978
1567.982
1661.219
1760.000
1864.655
1975.533
2048 ~
4096
2093.004
2217.46
2349.318
2489.016
2637.02
2793.826
2959.956
3135.964
3322.438
3520
3729.31
3951.066
4096 ~
8192
4186.008
4434.92
4698.636
4978.032
5274.04
5587.652
5919.912
6271.928
6644.876
7040
7458.62
7902.132
1
[ x(m) w(n m)]2
N m
08
All higher Octave scales in the frequency
range of (8192 ~ 22050)
Sub-band No
Octave scale - B1
Freq - range
(Hz)
0~
64
C
C#
D
D#
E
F
F#
G
G#
A
A#
B
(14)
where x(m) is the discrete time music signal, n is the time index of the short-time energy,
and w(m) is a rectangular window, which length is equal to the quarter-note time length,
that is

0 n N 1,
otherwise.
1,
w(n) =
0,
(15)
If the short-time energy function is continuously lower than a certain set of

thresholds (there may be durations in which the energy is higher than the threshold, but
the durations should be short enough and far enough apart from each other), then the
segment is indexed as silence. Silence segments will be removed from the music sequence.
Feature Extraction
We propose a new frequency scaling called the Octave Scale instead of Mel scale
to calculate Cepstral coefficients (see section of Timbral Textual Features). The fundamental frequencies F0 and the harmonic structures of musical notes are in an octave scale
as shown in Table 1. Sung vocal lines always follow the instrumental line such that both
pitch and harmonic structure variations are also in octave scale. In our approach we
divide the whole frequency band into eight sub-bands corresponding to the octaves in
the music. The frequency ranges of the sub-band are shown in Figure 11.
The useful range of fundamental frequencies of tones produced by musical
instruments is considerably less than the audible frequency range. The highest tone of
the piano has a frequency of 4186 Hz, and this seems to have evolved as a practical upper
limit for fundamental frequencies. We have considered the entire audible spectrum to
accommodate the harmonics (overtones) of the high tones. The range of fundamental
frequencies of the voice demanded in classical opera is 80~1200 Hz which corresponds
to the low end of the bass voice and the high end of the soprano voice, respectively.
Linearly placing the maximum number of filters in the bands where the majority of
the singing voice is present would give better resolution of the signal in that range. Thus,
we use 6, 8, 12, 12, 8, 8, 6 and 4 filters in sub-bands 1 to 8, respectively. Then the octave
scale Cepstral coefficients are extracted according to Equation (5).
In order to make a comparison, we also extract Cepstral coefficients from the Mel
scale. Figure 12 illustrates the deviation of the third Cepstral coefficient derived from both
scales for PV, PI and IMV classes. The frame size is 30 ms without overlap. It can be seen
that the standard deviation is lower for the coefficients derived from the Octave scale,
which would make it more robust in our application.
Figure 11. Filter-band distribution in octave scale for calculating Cepstral coefficients
Critical Band filters positions
C3 to B3
C4 to B4
C5 to B5
C6 to B6
C7 to B7
C8 to B8
C9 ~
Sub band frequency (Hz)
0~128
128~256
256~512
512~1024
1024~2048
2048~4096
4096~8192
8192~22050
Sub band number
01
02
03
05
06
07
Octaves
~B1
C2 to B2
04
08
127
Figure 12. Third Cepstral coefficient derived from Mel-scale (1~1000 frame in black
colored lines) and Octave scale (1001~2000 frames in ash colored lines)
3rd Cepstral coefficient derived from M el-Scale and O ctave Scale

(a)
Pure Vocal ( PV )
0.8
0.6
0.4
200
400
600
800
1000
1200
1400
1600
1800
2000
(b)
1
0.5
0
-0.5
Pure Instrum ental (PI)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
(c)
Instrumental M ixed Vocals (IMV)

0.5
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Fram e N um ber
Statistical Learning
To find the semantic region boundaries, we use a two-layer hierarchical classification method as shown in Figure 13. This has been proven to be more efficient than the
single layer multiclass classification method (Xu et al., 2003; Maddage et al., 2003).
In the initial experiments (Gao et al., 2003; Maddage et al., 2003), it is found that PV
can be effectively classified from PI and IMV. Thus we separate PV from other classes
in the first layer. In the second layer, we detect PV and IMV. However, PV regions in the
popular music are rare compared with both PI and IMV regions. In our experiments, SVM
and GMM are used as classifiers in layers 1 & 2.
When the classifier is SVM, layers 1 & 2 are modeled with parameter optimized radial
basis kernel function (Vapnik, 1998). The Expectation Maximization (EM) algorithm
(Bilmes, 1998) is used to estimate the parameters for layers 1 & 2 in GMM. The order of
Cepstral coefficients, used for both Mel scale and Octave scale in the layer 1 & 2, in both
classifiers (SVM & GMM) and number of Gaussian mixtures in each layers are shown in
Table 2.
Experimental Results
Our experiments are performed using 15 popular English songs (West life- Moments, If I Let You Go, Flying without Wings, My Love and Fragile Heart;
Backstreet Boys- Show me the meaning of being lonely, Quit playing games (with my
Heart), All I have to give, Shape of my Heart and Drowning; Michel Learns to
Rock (MLTR)-Paint my love, 25 Minutes, Breaking my Heart, How many Hours
and Someday) and 5 Sri Lankan songs (Maa Baa la Kale, Mee Aba Wanaye, Erata
Akeekaru, Wasanthaye Aga and Sehena Lowak). All music data are sampled from
commercial CDs at a 44.1 kHz sample rate and 16 bits per sample in stereo.
Figure 13. Hierarchical classification

Musical Audio
Layer 1
Pure Instrumental + Instrumental Mixed
Vocals (PI + IMV)
Pure Vocals
(PV)
Layer 2
Pure Instrumental
(PI)
Instrumental Mixed Vocals

(IMV)
Table 2. Orders of Cepstral coefficients and no. of GMs employed

Layers
Layer 1
Layer 2
Mel scale
18
24
Octave scale
10
12
No of Gaussian Mixtures (GMs)

24-(PV), 28-(PI+IMV)
48-(PI), 64-(IMV)
We first conduct the experiments on our proposed rhythm tracking algorithm to find
the quarter-note time intervals. We test for 30s, 60s, and full length intervals of the song
and note the average quarter-note time length calculated from the rhythm tracking
algorithm. Our system has managed to obtain 95% of average accuracy in the number of
beats detected with a 20 ms error margin on the quarter-note time intervals.
The music is then framed into quarter-note spaced segments and experiments are
conducted for the detection of the class boundaries (PV, PI, & IMV) of the music. Twenty
songs are used by cross validation where 3/2 songs of each artist are used for training
and testing respectively in each turn. We perform three types of experiments:
EXP1: training and testing songs are divided into 30 ms with 50% overlapping
frames.
EXP2: training songs are divided into 30 ms with 50% overlapping frames and
testing songs are framed according to quarter-note time interval.
EXP3: training and testing songs are framed according to quarter-note time
intervals.
Experimental results (in % accuracy) using SVM learning are illustrated in Table 3.
Mel scale is optimized by having more filter positions on lower frequencies (Deller et al.,
2000) because dominant pitches of vocals and musical instruments are in lower frequencies (< 4000 Hz) (Sundberg, 1987).
Though the training and testing sets of PV are small (not many songs have sections
of only singing voice), it is seen that in all experiments (EXP 1~3), the classification
accuracy of PV is higher than other two classes. However, when vocals are mixed with
instruments, it can be seen that finding vocal boundaries is more difficult than other two
classes.
129
Table 3: Results of hierarchal classification using SVM (in % accuracy)

Classes
Classifier
EXP1
SVM
EXP2
EXP3
PV
72.35
67.34
74.96
Mel-scale
PI
IMV
68.98
64.57
65.18
64.87
72.38
70.17
PV
73.76
75.22
85.73
Octave scale
PI
IMV
73.18
68.05
74.15
73.16
82.96
80.36
The results of EXP1 demonstrate the higher performance of the Octave scale
compared with the Mel Scale for 30 ms frame size. In EXP2, a slightly better performance
can be seen for the Octave Scale but not for the Mel Scale compared with EXP1. This
demonstrates that Cepstral coefficients are sensitive to the frame length as well as the
position of the filters in Mel or Octave scales. EXP3 is seen to achieve the best
performance among EXP1 ~ EXP3 demonstrating the importance of the inclusion of
musical knowledge in this application. Furthermore, the better results obtained by the
use of Octave scale demonstrate its ability to be able to model music signals better than
Mel scale for this application.
The results obtained for EXP3 with the use of SVM and GMM are compared as
shown in Figure 14. In Table 2, we have shown the numbers of Gaussian mixtures which
are empirically found to be good for layer 1 and layer 2 classifications. Since the number
of Gaussian mixtures in layer 2 is higher than in layer 1, it reflects that the classifying PI
from IMV is more difficult than classifying PV from the rest. It can be seen that SVM
performs better than GMM in identifying the region boundaries. We can thus infer that
this implementation of SVM, which is a polynomial learning machine that uses a radialbased kernel function, is more efficient than the GMM method that uses a probabilistic
modeling method using the EM algorithm.
Figure14. Comparison between SVM and GMM in EXP3
%
86
85.73%
SVM
GMM
84.96%
85
84
82.56%
83
80.96%
82
80.36%
81
79.96%
80
79
78
77
76
PV
PI
IMV
Future Work
In addition to semantic boundary detecting in music, the beat space segmentation
platform is useful for music structural analysis, content-based music source separation
and automatic lyrics generation.
CONCLUSION
This chapter reviews past and current technical achievements in content-based
music summarization and classification. We summarized the state of arts in music
summarization, music genre classification and semantic region detection in music
signals. We also introduced our latest work in compressed domain music summarization,
musical video summarization and semantic boundary detection in acoustical music
signal.
Although many advances have been achieved in many research areas of contentbased music summarization and classification, there are still many research issues that
need to be explored to make content-based music summarization and classification
approaches more applicable in practice. We also identified the future research directions
in music summarization, music genre classification and semantic region detection in
music signals.
REFERENCES
Assfalg, J., Bertini, M., DelBimbo, A., Nunziati, W., & Pala, P. (2002). Soccer highlights
detection and recognition using HMMs. In Proceedings IEEE International
Conference on Multimedia and Explore, 1 (pp. 825-828), Lausanne, Switzerland.
Aucouturier, J. J., & Pachet, F. (2003). Representing musical genre: A state of the art.
Journal of New Music Research, 32(1), 1-12.
Bartsch, M.A., & Wakefield, G.H. (2001). To catch a chorus: Using chroma-based
representations for audio thumbnailing. In Proceedings Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), New Paltz, New
York, (pp. 15-18).
Berenzweig, A. L., & Ellis, D. P. W. (2001). Location singing voice segments within music
signals. In Proceedings IEEE Workshop on Applications of Signal processing to
Audio and Acoustics (WASPAA), New Paltz, New York (pp. 119-122).
Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application to parameter
estimation for Gaussian mixture and hidden Markov models. Technical Report
ICSI-TR-97-021, University of California, Berkeley.
Brown, J. C. (1999, March). Computer identification of musical instruments using pattern
recognition with Capstral coefficients as features. Journal of Acoustic Society
America, 105(3), 1064-10721.
Chai, W., & Vercoe, B. (2003). Music thumbnailing via structural analysis. In Proceedings
ACM International Conference on Multimedia, Berkeley, California, (pp. 223-226).
Cooper, M., & Foote, J. (2002). Automatic music summarization via similarity analysis.
In Proceedings International Conference on Music Information Retrieval, Paris
(pp. 81-85).
131
Cosi, P., De Poli, G., & Prandoni, P. (1994). Timbre characterization with mel-cepstrum and
neural nets. In Proceedings International Computer Music Conference (pp. 42-45).
Dale, N. B. (2003). C++ plus data structures (3rd ed.). Boston: Jones and Bartlett.
Deller, J.R., Hansen, J.H.L., & Proakis, J.G. (2000). Discrete-time processing of speech
signals. IEEE Press.
DeMenthon, D., Kobla, V., & Maybury, M.T. (1998). Video summarization by curve
simplification. In Proceedings of the ACM international conference on Multimedia, Bristol, UK (pp. 211-218).
Duxburg, C., Sandler, M., & Davies, M. (2002). A hybrid approach to musical note onset
detection. In Proceedings of the International Conference on Digital Audio
Effects, Hamburg, Germany (pp. 33-28).
Eiilis, G. M. (1994). Electronic filter analysis and synthesis. Boston: Artech House.
Eronen, A., & Klapuri, A. (2000). Musical instrument recognition using cepstral cofficients
and temporal features. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing, Istanbul ,Turkey (Vol. 2, pp. II753 - II756).
Foote, J., Cooper, M., & Girgensohn, A. (2002). Creating music video using automatic
media analysis. In Proceedings ACM international conference on Multimedia,
Juan-les-Pins, France (pp. 553-560).
Fujinaga, I. (1998). Machine recognition of timbre using steady-state tone of acoustic
musical instruments. In Proceedings International Computer Music Conference
(pp. 207-210).
Gao, S., Maddage, N.C., & Lee, C.H. (2003). A hidden markov model based approach to
musical segmentation and identification. In Proceedings IEEE Pacific-Rim Conference on Multimedia (PCM), Singapore (pp. 1576-1580).
Gong, Y., Liu, X., & Hua, W. (2001). Summarizing video by minimizing visual content
redundancies. In Proceedings IEEE International Conference on Multimedia and
Explore, Tokyo (pp. 788-791).
Goto, M. (2001). An audio-based real-time beat tracking system for music with or without
drum-sounds. Journal of New Music Research, 30(2), 159-171.
Goto, M., & Muraoka, Y. (1994). A beat tracking system for acoustic signals of music.
In Proceedings of the Second ACM International Conference on Multimedia (pp.
365-372).
Grimalidi, M., Kokaram, A., & Cunningham, P. (2003). Classifying music by genre using
a discrete wavelet transform and a round-robin ensemble. Work report. Trinity
College, University of Dublin, Ireland.
Gunsel, B., & Tekalp, A. M. (1998). Content-based video abstraction. In Proceedings
IEEE International Conference on Image Processing, Chicago, Illinois (Vol. 3, pp.
128-132).
Hori, C., & Furui, S. (2000). Improvements in automatic speech summarization and
evaluation methods. In Proceedings International Conference on Spoken Language Processing ,Beijing, China (Vol. 4, pp. 326-329).
Jiang, D., Lu, L., Zhang, H., Tao, J., & Cai, L. (2002). Music type classification by spectral
contrast feature. In Proceedings IEEE International Conference on Multimedia
and Explore, Lausanne, Switzerland (Vol. 1, pp. 113-116).
Kashino, K., & Murase, H. (1997). Sound source identification for ensemble music based
on the music stream extraction. In Proceedings International Joint Conference on
Artificial Intelligence, Nagoya, Aichi, Japan (pp. 127-134).
Kim, Y. K. (1999). Structured encoding of the singing voice using prior knowledge of the
musical score. In Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York (pp. 47-50).
Kim, Y. K., & Brian, W. (2002). Singer identification in popular music recordings using
voice coding feature. In Proceedings International Symposium of Music Information Retrieval (ISMIR), Paris.
Kraft, R., Lu, Q., & Teng, S. (2001). Method and Apparatus for Music Summarization and
Creation of Audio Summaries. U.S. Patent No. 6,225,546. Washington, DC: U.S.
Patent and Trademark Office.
Logan, B., & Chu, S. (2000). Music summarization using key phrases. In Proceedings
IEEE International Conference on Audio ,Speech and Signal Processing, Istanbul,
Turkey (Vol. 2, pp. II749 - II752).
Lu, L., & Zhang, H. (2003). Automated extraction of music snippets. Proceedings ACM
International Conference on Multimedia, Berkeley, California (pp. 140-147).
Maddage, N.C., Wan, K., Xu, C., & Wang, Y. (2004a). Singing voice detection using twiceiterated composite fourier transform. In Proceedings International Conference on
Multimedia and Explore, Taibei, Taiwan.
Maddage, N.C., Xu, C., & Wang, Y. (2003). A SVM-based classification approach to
musical audio. In Proceedings International Symposium of Music Information
Retrieval, Baltimore, Maryland (pp. 243-244).
Maddage, N.C., Xu, C., & Wang, Y. (2004b). Singer identification based on vocal and
instrumental models. In Proceedings International Conference on Pattern Recognition, Cambridge, UK.
Maddage, N.C., Xu, C. S., Lee, C. H., Kankanhalli, M.S., & Tian, Q. (2002). Statistical
analysis of musical instruments. In Proceedings IEEE Pacific-Rim Conference on
Multimedia, Taibei,Taiwan (pp. 581-588).
Mani, I., & Maybury, M.T. (Eds.). (1999). Advances in automatic text summarization.
Boston: MIT Press.
Martin, K. D. (1999). Sound-source recognition: A theory and computational model.
PhD thesis, MIT Media Lab.
Nakamura,Y., & Kanade,T. (1997). Semantic analysis for video contents extraction
Spotting by association in news video. In Proceedings of ACM International
Multimedia Conference, Seattle, Washington (pp. 393-401).
Pachet, F., & Cazaly, D. (2000). A taxonomy of musical genre. In Proceedings ContentBased Multimedia Information Access Conference, Paris.
Pachet, F.,Weatermann, G., & Laigre, D. (2001). Musical data mining for EMD. In
Proceedings WedelMusic Conference, Italy.
Patel, N. V., & Sethi, I. K. (1996). Audio characterization for video indexing. In Proceedings SPIE Storage and Retrieval for Still Image and Video Databases IV, San Jose,
California, 2670 (pp. 373-384).
Pfeiffer, S., Lienhart, R., Fischer, S., & Effelsberg, W. (1996). Abstracting digital movies
automatically. Journal of Visual Communication and Image Representation, 7(4),
345-353.
Pye, D. (2000). Content-based methods for the management of digital music. In Proceedings IEEE International Conference on Audio, Speech and Signal Processing,
Istanbul ,Turkey (Vol. 4, pp. 2437-2440).
133
Rabiner, L.R., & Juang, B.H. (1993). Fundamentals of speech recognition. New York:
Prentice Hall.
Rabiner, L.R., & Schafer, R.W. (1978). Digital processing of speech signals. New York:
Prentice Hall.
Rossing, T.D., Moore, F.R., & Wheeler, P.A. (2002). Science of sound (3rd ed.). Boston:
Addison-Wesley.
Saitou, T., Unoki, M., & Akagi, M. (2002). Extraction of f0 dynamic characteristics and
developments of control model in singing voice. In Proceedings of the 8th
International Conference on Auditory Display, Kyoto, Japan (pp. 275-278).
Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of
the Acoustical Society of America, 103(1), 588-601.
Shao, X., Xu, C., & Kankanhalli, M.S. (2003). Automatically generating summaries for
musical video. In Proceedings IEEE International Conference on Image Processing, Barcelona, Spain (vol. 3, pp. II547-II550).
Shao, X., Xu, C., Wang, Y., & Kankanhalli, M.S. (2004). Automatic music summarization
in compressed domain. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada.
Sundaram, H., Xie, L., & Chang, S.F. (2002). A utility framework for the automatic
generation of audio-visual skims. In Proceedings ACM International Conference
on Multimedia, Juan-les-Pins, France (pp. 189-198).
Sundberg, J. (1987). The science of the singing voice. Northern Illinois University Press.
Ten Minute Master No 18: Song Structure. (2003). MUSIC TECH Magazine, 62 -63.
Tolonen, T., & Karjalainen, M. (2000). A computationally efficient multi pitch analysis
model. IEEE Transactions on Speech and Audio Processing, 8(6), 708-716.
Tsai, W.H., Wang, H.M., Rodgers, D., Cheng, S.S., & Yu, H.M. (2003). Blind clustering
of popular music recordings based on singer voice characteristics. In Proceedings
International Symposium of Music Information Retrieval, Baltimore, Maryland
(pp. 167-173).
Tzanetakis, G., & Cook, P. (2000). Sound analysis using MPEG compressed audio. In
Proceedings IEEE International Conference on Acoustics, Speech, and Signal
Processing, Istanbul, Turkey (Vol. 2, pp. II761-II764).
Tzanetakis G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE
Transactions on Speech and Audio Processing, 10(5), 293-302.
Tzanetakis, G., Essl, G., & Cook, P. (2001). Automatic musical genre classification of audio
signals. Proceedings International Symposium on Music Information Retrieval,
Bloomington, Indiana (pp. 205-210).
Vapnik, V. (1998). Statistical learning theory. New York: John Wiley & Sons.
Wang, Y., & Vilermo, M. (2001). A compressed domain beat detector using MP3 audio
bit streams. In Proceedings ACM International Conference on Multimedia,
Ottawa, Ontario, Canada (pp. 194-202).
Wold, E., Blum, T., Keislar, D., & Wheaton, J. (1996). Content-based classification, search
and retrieval of audio. IEEE Multimedia, 3(3), 27-36.
Xu, C., Maddage, N.C., Shao, X.,Cao, F., & Tian, Q. (2003). Musical genre classification
using support vector machines. In Proceedings IEEE International Conference
on Acoustics, Speech, and Signal Processing, Hong Kong, China (pp. V429V432).
Xu, C., Zhu,Y., & Tian, Q. (2002). Automatic music summarization based on temporal,
spectral and Cepstral features. In Proceedings IEEE International Conference on
Multimedia and Explore, Lausanne, Switzerland (pp. 117-120).
Xu, C.S., Maddage, N.C., & Shao, X. (in press). Automatic music classification and
summarization. IEEE Transaction on Speech and Audio Processing.
Yow, D., Yeo, B.L., Yeung, M., & Liu, G. (1995). Analysis and presentation of soccer
highlights from digital video. In Proceedings of Asian Conference on Computer
Vision, Singapore.
Zhang, T. (2003). Automatic singer identification. In Proceedings International Conference on Multimedia and Explore, Baltimore, Maryland (Vol. 1, pp. 33-36).
Zhang, T., & Kuo, C.C.J. (2001), Audio content analysis for online audiovisual data
segmentation and classification. In IEEE Transaction on Speech and Audio
Processing, 9(4), 441-457.
A Multidimensional Approach for Describing Video Semantics
135
Chapter 6
A Multidimensional
Approach for Describing
Video Semantics
ABSTRACT
In order to manage large collections of video content, we need appropriate video

content models that can facilitate interaction with the content. The important issue for
video applications is to accommodate different ways in which a video sequence can
function semantically. This requires that the content be described at several levels of
abstraction. In this chapter we propose a video metamodel called VIMET and describe
an approach to modeling video content such that video content descriptions can be
developed incrementally, depending on the application and video genre. We further
define a data model to represent video objects and their relationships at several levels
of abstraction. With the help of an example, we then illustrate the process of developing
a specific application model that develops incremental descriptions of video semantics
using our proposed video metamodel (VIMET).
INTRODUCTION
With the convergence of Internet and Multimedia technologies, video content
holders have new opportunities to provide novel media products and services, by
repurposing the content and delivering it over the Internet. In order to support such
136 Srinivasan & Nepal
applications, we need video content models that allow video sequences to be represented and managed at several levels of semantic abstraction. Modeling video content
to support semantic retrieval is a hard task, because video semantics means different
things to different people. The MPEG-7 Community (ISO/ISE 2001) has spent considerable effort and time in coming to grips with ways to describe video semantics at several
levels, in order to support a variety of video applications. The task of developing content
models that show the relationships across several levels of video content descriptions
has been left to application developers. Our aim in this chapter is to provide a framework
that can be used to develop video semantics for specific applications, without limiting
the modeling to any one domain, genre or application.
The Webster Dictionary defines the meaning of semantics as the study of
relationships between signs and symbols and what they represent. In a way, from
the perspective of feature analysis work (MPEG, 2000; Rui, 1999; Gu, 1998; Flickner et al.,
1995; Chang et al. 1997; Smith & Chang, 1997), low- level audiovisual features can be
considered as a subset, or a part of, visual signs and symbols that convey a meaning.
In this context, audio and video analysis techniques have provided a way to model video
content using some form of constrained semantics, so that video content can be retrieved
at some basic level such as shots. In the larger context of video information systems,
it is now clear that feature analyses alone are not adequate to support video applications.
Consequently, research focus has shifted to analysing videos to identify higher-level
semantic content such as objects and events. More recently, video semantic modeling
has been influenced by film theory or semiotics (Hampapur, 1999; Colombo et al., 2001;
Bryan-Kinns, 2000), where a meaning is conveyed through a relationship of signs and
symbols that are manipulated using editing, lighting, camera movements and other
cinematic techniques. Whichever theory or technology one chooses to follow, it is clear
that we need a video model that allows us to specify relationships between signs and
symbols across video sequences at several levels of interpretation (Srinivasan et al.,
2001).
The focus of this chapter is to present an approach to modeling video content, such
that video semantics can be described incrementally, based on the application and the
video genre. For example, while describing a basketball game, we may wish to describe
the game at several levels: the colour and texture of players uniforms, the segments that
had the crowd cheering loudly, the goals scored by a player or a team, a specific movement
of a player and so on. In order to facilitate such descriptions, we have developed a
framework that is generic and not definitive, but still supports the development of
application specific semantics.
The next section provides a background survey of some of these approaches used
to model and represent the semantics associated with video content. In the third section,
we present our Video Metamodel Framework (VIMET) that helps to model video
semantics at different levels of abstraction. It allows users to develop and specify their
own semantics, while simultaneously exploiting results of video analysis techniques. In
the fourth section, we present a data model that implements the VIMET metamodel. In
the fifth section we present an example. Finally, the last section provides some conclusions and future directions.
137
BACKGROUND
In order to highlight the challenges and issues involved in modeling video
semantics, we have organized video semantic modeling approaches into four broad
categories and cite a few related works under each category.
Feature Extraction Combined with Clustering or

Temporal Grouping
Most approaches that fall under this category have focused on extracting visual
features such as colour and motion (Rui et al., 1999; Gu, 1998; Flickner et al., 1995), and
abstracting visual signs to identify semantic information such as objects, roles, events
and actions. Some approaches have also included audio analysis to extract semantics
from videos (MPEG MAAATE, 2000). We recognize two types of features that are
extracted from videos: static features and motion-based features. Static features are
mostly perceptual features extracted from stationary images or single frames. Examples
are colour histograms, texture maps, and shape polygons (Rui et al., 1999). These features
have been exploited as attributes that represent colour, texture and shape of objects and
images in a single frame. Motion-based features are those that exploit the basic
spatiotemporal nature of videos (Chang et al., 1997). Examples of motion-based low-level
features are motion-vectors for visual features and frequency bands for audio signals.
These features have been used to segment videos into shots or temporal segments. Once
the features are extracted, they are clustered into groups or segmented into temporal units
to classify video content at a higher semantic level. Hammoud et al. (2001) use such an
approach for modeling video semantics. They extract shots, use clustering to identify
similar shots, and use a time-space graph to represent temporal relationships between
clusters for extracting scenes as semantic units. Hacid et al. (2000) present a database
approach for modeling and querying video data. They use two layers for representing
video content: feature and content layer, and semantic layer. The lowest layer is
characterized by a set of techniques and algorithms for visual features and relationships.
The top layer contains objects of interest, their description and their relationships. A
significant component of this approach is the use of temporal cohesion to attach
semantics to the video.
Relationship-Based Approach for Modeling Video

Semantics
This approach is based on understanding the different types of relationships
among video objects and their features. For example, in videos, spatial and temporal
relationships of features that occur over space and time have contributed significantly
for identifying content at a higher level of abstraction. For example, in the MPEG domain,
DCT coefficients and motion vectors are compared over a temporal interval to identify
shots and camera operations (Meng et al., 1995). Similarly, spatial relationships of objects
over multiple frames are used to identify moving objects (Gu, 1998). Smith and Benitez
(2000) present a Multimedia-Extended Entity Relationship (MM-EER) model that captures various types of structural relationships such as spatiotemporal, intersection, and
composition, and provide a unified multimedia description framework for content-based
description of multimedia in large databases. The modeling framework supports different

types of relationships such as generalization, aggregation, association, structural
relationship, and intensional relationship. Baral et al. (1998) also extended ER diagrams
with a special attribute called core. This attribute stores the real object in contrast to
the abstraction which is reflected in the rest of the attributes. Pradhan et al. (2001) propose
a set of interval operations that can be used to build an answer interval to the given query
by synthesizing video interval containing keywords similar to that of the query.
Theme-Based Annotation Synchronized to Temporal

Structure
This approach is based on using textual annotations synchronized to a temporal
structure (Yap et al. 1996). This approach is often used in television archives where
segments are described thematically and linked to temporal structures such as shots or
scenes. Hjelsvold and Midtstraum (1994) present a general data model for sharing and
reuse of video materials. Their data model is based on two basic ideas: temporal structural
representation of video and thematic annotations and relationships between them.
Temporal structure of video includes frame, shot, scene, sequence, and so forth. The data
model allows users to describe frame sequences using thematic annotations. It allows
detailed descriptions of the content of the video material, which are not necessarily linked
to structural components. More often they are linked to arbitrary frame sequences by
establishing a relationship between the frame sequences and relevant annotations,
which are independent of any structural components. Other works that fall under this
category include (Bryan-Kinns, 2000; Srinivasan, 1999).
Approach Based on Knowledge Models and Film

Semiotics
As we move up the semantic hierarchy to represent higher-level semantics in
videos, in addition to extracting features and establishing their relationships, we need
knowledge models that help us understand higher-level concepts that are specific to a
video genre. Models that aim to represent content at higher semantic levels use some
form of knowledge models or semiotic and film theory to understand the semantics of
videos. Film theory or semiotics describes how a meaning is conveyed through a
relationship of signs and symbols that are meaningful to a particular culture. This is
achieved by manipulation of editing, lighting, camera movements and other cinematic
techniques.
Colombo et al. (2001) present a model that allows retrieval of commercial content
based on their salient semantics. The semantics are defined from the semiotic perspective, that is, collections of signs and semantic features like colour, motion, and so forth,
are used to build semantics. It classifies commercials into four different categories based
on semiotics: practical, critical, utopic and playful. A set of rules is then defined to map
the set of perceptual features to each of the four semiotic categories for commercials.
Bryan-Kinns (2000) presents a framework (VCMF) for video content modeling
which represents the structural and semantic regularities of classes of video recordings.
VCMF can incorporate many domain-specific knowledge models to index video at the
semantic level.
139
Hampapur (1999) presents an approach based on the use of knowledge models to

build domain-specific video information systems. His aim is to provide higher-level
semantic access to video data. The knowledge model proposed in their paper includes
the semantic model, cinematic model and physical world model. Each model has several
sub-models; for example, the cinematic model includes a camera motion model and a
camera geometry model.
We realize that the above categories are not mutually exclusive and many models
and frameworks use a combination of two or more of these modeling strategies.
We now describe our metamodel framework. It uses feature extraction, multimodal
relationships, and semiotic theory to model video semantics at several levels of abstraction. In a way, we use a combination of the four approaches described above.
VIDEO METAMODEL (VIMET)

Figure 1 shows the video metamodel which we call VIMET. The horizontal axis
shows the temporal dimension of the video and the vertical axis shows the semantic
dimension. A third dimension represents the domain knowledge associated with specialized interpretations of a film or a video (Metz, 1974; Lindley & Srininvasan, 1999). The
metamodel represents: (1) static and motion-based features in both audio and visual
domain, (2) different types of spatial and temporal relationships, (3) multimodal, multifeature
relationships to identify events, and (4) concepts influenced by principles of film theory.
The terms and concepts used in the metamodel are described below.
The bottom layer shows the standard way in which a video is represented to show
temporal granularity.
Figure 1. A multidimensional data model for building semantics (the shadow part
represents the temporal (dynamic) component)
Domain
Knowledge
Semantic Concepts
Multi Feature
Multi Modal
Relationships
Single Feature
Relationships
Audio Video
Feature
Audio Video
Content
Representation
Structure of Features
Order of Events
Spatial Relationship
Temporal Relationship
Static
Motion-Based
Object
Frame
Shot
Scene
Clip
Object: It represents a single object within a frame. An object may be identified

manually or automatically if an appropriate object detection algorithm exists.
Frame: It represents a single image of a video sequence. There are 25 frames (or
30 in certain video formats) in a video of one-second duration.
Shot: A shot is a temporal video unit between two distinct camera operations. Shot
boundaries occur due to different types of camera operations such as cuts,
dissolves, fades and wipes. There are various techniques used to determine shots.
For example, DCT coefficients are used in MPEG videos to identify distinct shot
boundaries.
Scene: A scene is a series of consecutive shots constituting a unit from the
narrative point of view sharing some thematic visual content. For example, a video
sequence that shows Martina Hinges playing the game point in the 1997
Australian open final is a thematic scene.
Clip: A clip is an arbitrary length of video used for a specific purpose. For example,
a video sequence that shows tourist places around Sydney is a video clip made for
tourism purpose.
The semantic dimension helps model semantics in an incremental way using

features and their relationships. It uses information from the temporal dimension to
develop the semantics.
Static features: Represent features extracted from objects and frames. Shape, color
and texture are examples of static features extracted from the objects. Similarly,
global color histogram and average colors are examples of features extracted from
a frame.
Motion-based features: Represent features extracted from video using motionbased information. An example of such a feature is motion vector.
The next level of conceptualisation occurs when we group individual features to

identify objects and events using some criteria based on spatial and/or temporal
relationships.
Spatial Relationships: Two types of spatial relationships are possible for videos:
topological relationships and directional relationships. Topological relationships
include relations such as contains, covered by, and disjoint (Egenhofer & Franzosa,
1991). Directional relationships include relations such as right-of, left-of, above
and below (Frank, 1996).
Temporal Relationships: This includes the most commonly used temporal relationship from Allens 13 temporal relationships (Allen, 1983). These include
temporal relations such as before, meets, overlaps, finishes, starts, contains,
equals, during, started by, finished by, overlapped by, met by and after.
Moving further up in the semantic dimension, we use multiple features both in the
audio and the visual domain, and establish (spatial and temporal) relationships across
these multimodal features to model higher-level concepts. In the temporal dimension,
this helps us identify semantic constructs based on structure of features and order of
events occurring in a video.
141
Structure of Features: Represents patterns of features that define a semantic

construct. For example, a group of regions connected through some spatial
relationship over multiple frames, combined with some loudness value spread over
a temporal segment, could give an indication of a particular event occurring in
the video.
Order of Events: Represents recurring pattern of events identified manually or
detected automatically using multimodal feature relationships. For example, camera
pan is a cinematic event that is derived by using a temporal relationship of motion
vectors. (The motion vectors between consecutive frames due to the pan should
point in a single direction that exhibits a strong modal value corresponding to the
camera movement.) Similarly, a sudden burst of sound is a perceptual auditory
event. These cinematic and perceptual events may be arranged in certain sequence
from a narrative point of view to convey a meaning. For example, a close-up followed
by a loud sound could be used to produce a dramatic effect. Similarly, a crowd cheer
followed by scorecard could be used to determine field goals in basketball videos.
(We will describe this in greater details later in our example.)
Modeling the meaning of a video, shot, or sequence requires the description of

the video object at several levels of interpretation. Film semiotics, pioneered by the film
theorist Christian Metz (1974), has identified five levels of cinematic codification that
cover visual features, objects, actions and events depicted in images together with other
aspects of the meaning of the images. These levels are represented in the third dimension
of the metamodel. The different levels interact together, influencing the domain knowledge associated with a video.
Perceptual level: This is the level at which visual phenomena become perceptually
meaningful, the level at which distinctions are perceived by the viewer. This is the
level that is concerned with features such as colour, loudness and texture.
Cinematic level: This level is concerned with formal film and video editing
techniques that are incorporated to produce expressive artifacts. For example,
arranging a certain rhythmic pattern of shots to produce a climax, or introducing
voice-over to shift the gaze.
Diegetic level: This refers to the four-dimensional spatiotemporal world posited
by a video image or a sequence of video images, including spatiotemporal
descriptions of objects, actions, or events that occur within that world.
Connotative level: This level of video semantics is the level of metaphorical,
analogical and associative meanings that the objects and events in a video may
have. An example of connotative significance is the use of facial expression to
denote some emotion.
Subtextual level: This is the level of more specialized, hidden and suppressed
meanings of symbols and signifiers that are related to special cultural and social
groups.
The main idea of this metamodel framework is to allow users to develop their own
application models, based on their semantic notion and interpretation, by specifying
objects and relationships of interest at any level of granularity.
Next we describe the data model to implement the ideas presented in the VIMET
metamodel. The elements of the data model allow application (developers) to incrementally develop video semantics that needs to be modeled and represented in the context
of the application domain.
VIDEO DATA MODEL

The main elements of the data model are: (i) Video Object (VO) to model a video
sequence of any duration, (ii) VO Attributes that are either explicitly specified or based
on automatically detected audio, visual, spatial and temporal features, (iii) VO Attributelevel relationships for computed attribute values, (iv) Video Concept Object (VCO) to
accommodate fuzzy descriptions, and (v) Object level relationships, to support multimodal
and multifeature relationships across video sequences.
Video Object (VO)

We define a video object (VO) as an abstract object that models a video sequence
at any level of abstraction in the semantic-temporal dimension shown in Figure 1. This
is the most primitive object that can be used in a query.
Definition: A typical Video Object (VO) is a five-tuple
VO = <Xf, Sf, Tf, Vf , Af)
where
Xf is a set of textual attributes,
Sf is a set of spatial attributes,
Tf is a set of temporal attributes,
Vf is a set of visual attributes, and
Af is a set of audio attributes.
Xf represents a set of textual attributes that describe the semantics associated with
the object at different levels of abstraction. This could be metadata or any textual
description a manual annotation of the content or about the content.
Sf represents spatial attributes that specify the spatial bounds of the object. This
pertains to the space occupied by the object in a two-dimensional plane. This could be
X and Y positions of the object or a bounding box of the object.
Ti represents the temporal attributes to describe the temporal bounds of the object.
Temporal attributes includes start time, end time, and so forth. A time interval is a basic
primitive used here.
Vf represents a set of attributes that characterize the object in terms of visual
features. The values of these attributes are typically the result of visual feature extraction
algorithms. Examples of such attributes/features are colour histograms and motion
vectors.
Af represents a set of attributes that characterize the object in terms of aural
features. The values of these attributes are typically the result of audio analysis and
feature extraction algorithms. Examples of such attributes/features are loudness curves
and pitch values.
143
Figure 2. A typical video object with two sets of attributes: audio and temporal
VO
A u d io A ttr ib u te :
( n a m e , v a lu e )
L o u d n e s s h is to g r a m :
[ 1 , ..,5 ]
- -- - - -- - - -T e m p o r a l a ttr ib u te :
( n a m e , v a lu e )
D u r a tio n : 2 0 s e c
- -- - - -- - - -- - - -- - - -
A typical video object a video shot is shown in Figure 2. The diagram shows
a shot with temporal attribute duration and audio attribute loudness histogram. (To keep
the diagram simple we have shown only a subset of possible attributes of a video object.)
Each attribute has a value domain and is shown as a name value pair. For example, the
audio attribute set has an audio attribute loudness whose value is given by a loudness
histogram. Similarly, the temporal attribute set has an attribute duration whose value is
given in seconds.
The applications determine the content of the video objects modeled. For example,
a video object could be a frame in the temporal dimension, with an attribute specifying
its location in the video. In the semantic dimension, the same frame could have visual
features such as colour and texture histograms.
VO Attributes
The attributes of a video object are either intensional or extensional. Extensional
attribute values are text data, drawn from the character domain. The possible sources of
extensional attributes are annotation, transcripts, keywords, textual description, terms
from a thesaurus, and so forth. Extensional attributes fall in the feature category in Figure
1. Intensional attributes have specific value domains where the values are computed
using appropriate feature extraction functions. Where the extensional attributes of the
video objects are semantic in nature, relationships across such objects can be expressed
as association, aggregation, and generalisation as per object-oriented or EER modeling
methodologies. For intensional attributes, however, we need to establish specific
relationships for each attribute type. For example, when we consider the temporal
attribute whose value domain is time-interval, we need to specify temporal relationships and corresponding operations that are valid over time intervals.
We define two sets of value domains for intensional attributes. The first set is the
numerical domain and the second set is the linguistic domain. The purpose of providing
a linguistic domain is to allow a fuzzy description of some attributes. For example, a

temporal attribute duration could have a value equal to 20 (seconds) in the numerical
domain. For the same temporal attribute, we define a second domain, which is a
constrained set of linguistic terms such as short, long and very long. For this, we
use a fuzzy linguistic framework, where the numerical values of duration are mapped to
fuzzy terms such as short, long, and so forth. (Driankov et al., 1993). The fuzzy linguistic
framework is defined as a four tuple <X, LX, QX, MX> where
X = denotes the symbolic name of a linguistic variable (attributes in our case) such
as duration.
LX = set of linguistic terms that X can take such as short, long and very long..
QX = the numeric domain where X can take the values such as time in seconds.
MX = a semantic function which gives meaning to the linguistic terms by
mapping Xs values from LX to QX. The function MX depends on the types of
attributes and the application domain. We defined one such function in Nepal et
al. (2001).
Similarly, when we consider an audio attribute value based on the loudness

histogram/curve, we map the numerical values to semantic terms such as loud,, soft,
very loud and so on. By using a fuzzy linguistic framework, we are utilising the ability
of the human to distinguish variations in audio and visual features. Figure 3 shows the
mapping of a numerical attribute-value domain to a linguistic attribute-value domain.
VO Attribute-Level Relationships
Relationships across video objects can be defined at two levels: the object and
attribute levels. We will discuss the object level relationships in a later section. Here
we discuss attribute-level relationships to illustrate the capabilities of the data model.
Each intensional attribute value allows us to define relationships that are valid for that
Figure 3. Mapping of numerical attribute values to fuzzy attribute values
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
-----------
Soft, loud,
very soft,
very loud
------
Temporal attribute:
(name, value)
Duration: 20 sec
------------------
Short, long,
medium, very
short, very
long
145
Table 1. Attribute-level relationships

Attribute
Brightness
Contrast
Color
Loudness
Pitch
Size
Duration
Relationship Type
Visual, based on
light intensity
Visual, based on
contrast measures
Visual, based on
color
Audio, based on
sound level
Audio, based on
pitch
Visual, based on
size
Temporal, based on
duration
Relationship
Value set
{Brighter, Dimmer,
Similar}
{Higher, Lower, Similar}
(Same, Different,
Resembles}
{Louder, Softer, Similar}
{Higher, Lower, Similar}
{Bigger, Smaller, Similar}
{Shorter, Longer, Equal}
particular attribute-value domain. For example, when we consider visual relationships

based on the brightness attribute, we should be able to establish relationships such
as brighter-than, dimmer-than, and similar by grouping light intensity values. By
exploiting such attribute-level relationships across each intensional attribute, it is
possible to establish a multidimensional relationship between video objects. Here each
dimension reflects a perceivable variation in the computed values of the relevant
intensional attribute.
Table 1 shows an illustrative set of attribute-level relationships. The relationship
values are drawn from a predefined constrained set of relationships valid for a particular
attribute type. A typical relationship value set for brightness is {brighter-than, dimmerthan, similar}. Each relationship in the set is given a threshold based on the eyes ability
to discriminate between light intensity levels. The fuzzy scale helps in accommodating
subjectivity at the user level.
Next we give some examples of directional relationships used for spatial reasoning
in large-scale spaces, that is, spaces that cannot be seen or understood from a single
point of view. Such relationships are useful in application areas such as Geographical
Information Systems. In the case of video databases, it is more meaningful to define
equivalent positional relationships as we are dealing with spaces within a frame that can
be viewed from a point. The nine positional relationships are given by AP = {Top,
TopLeft, Left, BottomLeft, Bottom, BottomRight, Right, TopRight, Centre} as shown in
Figure 4. These nine relationships can be used as linguistic values for a spatial attribute
position given by a bounding box. Similarly, appropriate linguistic terms can be
developed for other attributes as shown in Table 1.
Video Concept Object

In order to support the specification of an attribute relationship (using linguistic
terms), we define a video concept object. A Video Concept Object (VCO) is a video object
with a semantic attribute value attached to each intensional attribute. The main difference
Figure 4. Positional relations in a two-dimensional space
between a VO and a VCO lies in the domain of the intensional attributes. The attributes
of a VO have numerical domains, and the attributes of VCO have a domain which is a set
of linguistic terms which forms a domain for semantic attribute value. The members of this
set are controlled vocabulary terms that reflect the (fuzzy) semantic value for each
attribute type. The domain set will be different for audio, visual, temporal and spatial
attributes.
Definition: A typical Video Concept Object (VCO) is a five-tuple
VCO = <Xc, Sc, Tc, Ac, Vc>
where
Xc is a set of textual attributes that define a concept, the attribute values are drawn from
the character domain,
Sc represents a set of spatial attributes. The value domain is a set of (fuzzy) terms that
describe relative positional and directional relationships,
Tc represents a set of temporal attributes whose values are drawn from a set of fuzzy terms
used to describe time interval or duration.
Ac represents a set of audio attributes (an example is loudness attribute whose values
are drawn from a set of fuzzy terms that describe loudness).
Vc represents a visual attribute whose values are drawn from a set of fuzzy terms that
describe that particular visual attribute.
The relationship between a VO and a VCO is established using fuzzy linguistic
mapping as shown in Figure 6.
In a fuzzy linguistic model, attribute name-relationship pair is equivalent to fuzzy
variable-value pair. In general, a user can query and retrieve video from a database using
any primitive VO/VCO. An example is retrieve all videos where an object A is at the rightbottom of the frame. In this example, the expressive power of a query is based on a single
attribute-value and is limited. A more interesting query would be retrieve all video
sequences where the loudness value is very high and the duration of the sequence is
short. Such queries would include VCOs and VOs.
147
Figure 6. An example VCO derived using a fuzzy linguistic model

Audio Attribute:
Loudness: {very soft, soft loud, veryloud}
VCO
Temporal attribute:
Duration: {long, short,very short, }
VO
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
-----------
Soft, loud,
very soft,
very loud
------
Temporal attribute:
(name, value)
Duration: 20 sec
------------------
Short, long,
medium, very
short, very
long
VO and VCO Relationships

In order to retrieve sequences with multiple relationships over different VOs and
VCOs, we need appropriate operators. In addition, we need some kind of knowledge or
heuristic rule(s) that help us interpret these relationships. This motivates us to define
a Video Semantics System.
Definition: A typical Video Semantics System (VSS) is a five-tuple
VSS = < VO/VCO, Sro, Tro, Fro, I>
where
VO/VCO is a set of Video Objects or Video Concept Objects,
Sro is a set of spatial relationship operators,
Tro is a set of temporal relationship operators, and
Fro is a set of multimodal operators.
I is an interpretation model of the video object.
Next we describe some operators and interpretation models. We define spatial
operators for manipulating relationships of objects over spatial attributes. This helps in
describing the structure of features (Figure 1). We define temporal operators for
manipulating relationships over temporal attributes. This helps to describe order of
events (Figure 1). For multimodal multifeature composition, however, we define fuzzy
Boolean operators.
Spatial Relationship Operators (SROs)

A number of spatial relationships with corresponding operators have been proposed and used in spatial databases, geographical information systems and multimedia
database systems. We describe them into the following categories. It is important to note
that SROs are valid only if the VO has attributes (values) that indicate spatial bounds.
Topological Relations: Topological relations are spatial relations that are invariant
under bijective and continuous transformations that have also continuous inverses. Topological equivalence does not necessarily preserve distances and
directions. Instead, topological notions include continuity, interior, and boundary,
which are defined in terms of neighbourhood relations. The eight topological
relationship operators are given as TR = {Disjoint, Meet, Equal, Inside, Contains,
Covered_By, Covers, Overlap} (Egenhofer & Franzosa, 1991) as shown in Figure
7. These relationship operators are binary.
Relative Positional Relations: The relative positional relationship operators are
given by RP = {Right of, Left of, Above, Below, In front of, Behind}. These
operators are used to express the relationships among VCOs. These relative
positional relationship operators are binary.
Sro = TR RP
Temporal Relationship Operators (TROs)

As a video is a continuous medium, temporal relations provide important cues for
video data retrieval. Allen (1983) has defined 13 temporal interval relations. Many
variations of Allens temporal interval relations have been proposed and used in
temporal and multimedia databases. These relations are used to define the temporal
relationships of events within a video. For example, weather news appears after the sports
news in a daily TV news broadcast. It is important to note that TROs are valid only for
VOs that have attributes with temporal bounds. The 13 temporal relationships defined
by Allen are given by TR = {Before, Meets, Overlaps, Finishes, Starts, Contains, Equals,
During, Started by, Finished by, Overlapped by, Met by, After}. (Note: This set TR
corresponds to the set of temporal relationship operators defined as Tro in VSS.)
Figure 7. Examples of eight basic topological relationships
149
Multimodal Operators
Traditionally the approach used for audiovisual content retrieval is based on
similarity measures on the extracted features. In such systems, users need to be familiar
with the underlying features and express their queries in terms of these (low-level)
features. In order to allow users to formulate semantically expressive queries, we define
a set of fuzzy terms that can be used to describe a feature using fuzzy attribute
relationships. For example, when we are dealing with the loudness feature, we use a
set of fuzzy terms such as very loud, loud, soft, and very soft to describe the
relative loudness at the attribute level. This fuzzy measure is described at the VCO level.
A query on VSS, however, can be multimodal in nature and involve many VCOs. In order
to have a generic way to combine multimodal relationships, we use simple fuzzy Boolean
operators. The fuzzy Boolean operator set is given by
FR = {AND, OR, NOT}.
(Note: This set FR corresponds to the set of multimodal relationship operators
defined as Fro in VSS.)
An example of a multimodal, multifeature query is shown in Figure 8.
Figure 8. An example of multimodal query using fuzzy and temporal operators on VCOs
(loudness, loud> AND <duration, short>) BEFORE
(<loudness, soft> AND <duration, long>)
Temporal operators
(BEFORE, AFTER, ..)
Spatial operators
(RIGHT-OF, ..)
Fuzzy operators (AND,

OR, NOT>
{<loudness, soft>, < loudness, very soft> ,

<loudness, loud>, <loudness, veryloud>}
VCO
{<duration, long>, <duration, short>,

<duration, very short>}
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
-----------
Soft, loud,
very soft,
very loud
------
Temporal attribute:
(name, value)
Duration: 20 sec
------------------
Short, long,
medium, very
short, very
long
VO
Interpretation Models
In the third section we identified five different levels of cinematic codification and
description that help us interpret the meaning of a video. Lindley and Srinivasan (1998)
have demonstrated empirically that these descriptions are meaningful in capturing
distinctions in the way images are viewed and interpreted by a nonspecialist audience,
and between this audience and the analytical terms used by filmmakers and film critics.
Modeling the meaning of a video, shot, or sequence requires the description of the
video object at any or all of the interpretation levels described in the third section.
Interpretation Models can be based on one or several of these five levels. A model based
on perceptual visual characteristics is the subject of a large amount of current research
on video content-based retrieval (Ahanger & Little, 1996). Models based on cinematic
constructs incorporate the expressive artifacts such as camera operations (Adam et al.,
2000), lighting schemes and optical effects (Truong et al., 2001). Automated detection
of cinematic features is another area of vigorous current research activity (see Ahanger
& Little). While modeling at the diegetic level the basic perceptual features of an image
are organised into a four-dimensional spatiotemporal world posited by a video image or
sequence of video images. This includes the spatiotemporal descriptions of agents,
objects, actions and events that take place within that world. The interpretation model
that we illustrate in the fifth section is a diegetic model based on interpretations at the
perceptual level. Examples of connotative meanings are the emotions connoted by
actions or the expressions on the faces of characters. The subtextual level of interpretation involves representing specialised meanings of symbols and signifiers. For both
the connotative and the subtextual levels, definitive representation of the meaning of
a video is in principle impossible. The most that can be expected is an evolving body
of interpretations.
In the next section we show how a user/author can develop application semantics
by explicitly specifying objects and relationships and interpretation models. The ability
to create different models allows users/authors to contextualise the content to be
modeled.
EXAMPLE
Here we illustrate the process of developing a specific application model to describe
video content at different semantic levels. The process is incremental, where the different
elements of the VIMET framework are used to retrieve goal segments from basketball
videos. Here we have used a top-down approach to modeling, where we first developed
the interpretation models from an empirical investigation of basketball videos (Nepal et
al., 2001). We then develop an object model for the application.
Figure 9 shows an instantiation of the metamodel.
The observation of television broadcasts of basketball games gave some insights
into commonly occurring patterns of events perceived during the course of a basketball
game. We have considered a subset of these as key events that occur repeatedly
throughout the video of the game to develop a temporal interpretation model. They were:
crowd cheer, scorecard display and change in players direction. We then identified the
audio-video features that correspond to the key events manually observed. The
151
Figure 9: An instance of metamodel for a video semantic Goal in basketball videos

using temporal interpretation model I
Domain
Knowledge
Interpretation Goal
Connotative
Multi-Modal
Multi-Feature
Relationships
Order of events Crowd

Cheer Before Score Board
Events
Crowd Cheer,
Score Card
Single Feature
Relationship
Audio Video
Feature
Diegetic
features
loudness value,
motion vector
embedded text
Diegetic
Perceptual , Cinematic
Video and audio
content
Clip - Basketball Videos
features identified include energy levels of audio signals, embedded text regions, and
change in direction of motion. Since the focus of this chapter is on building a video
semantic goal, we will not elaborate on the feature extraction algorithms in detail. We
used the MPEGMaaate audio analysis toolkit (MPEG MAAATE, 2000) to identify highenergy segments from loudness values. Appropriate thresholds were applied to these
high-energy segments to capture an event crowd cheer. The mapping to crowd cheer
expresses a simple domain heuristic (diegetic and connotative interpretation) for sports
video content. Visual feature extraction consists of identifying embedded text (Gu, 1998)
and is mapped to score card displays in the sports domain. Similarly, camera pan
(Srinivasan et al., 1997) motion is extracted and used to capture change in players
direction. The next level of semantics is developed by exploring the temporal order of
events such as crowd cheer and scorecard displays. Here we use temporal interpretation
models that show relationships on multiple features from multiple modalities.
Interpretation Models
We observed basketball videos and developed five different temporal interpretation models, as follows.
Model I
This model is based on the first key event crowd cheer. Our observation shows
that there is a loud cheer within three seconds of scoring a legal goal. Hence, in this model,
the basic diegetic interpretation is that a loud cheer follows every legal goal, and a loud
cheer only occurs after a legal goal. The model T1 is represented by
T1: Goal [3 sec] crowd cheer
Intuitively, one can see that the converse may not always be true, as there may be
other cases where a loud cheer occurs, for example, when a streaker runs across the field.
Such limitations are addressed to some extent in the subsequent models.
Model II
This model is based on the second key event scoreboard display. Our observation shows that the scoreboard display is updated after each goal. Our cinematic
interpretation in this model is that a scoreboard display appears (usually as embedded
text) within 10 seconds of scoring a legal goal. This is represented by the model T2.
T2: Goal [10 sec] Scoreboard
The limitation of this model is that the converse here may not always be true, that
is, a scoreboard display may not always be preceded by a legal goal.
Model III
This model uses a combination of two key events with a view to address the
limitations of T1 and T2. As pointed out earlier, all crowd cheers and scoreboard displays
may not always indicate a legal goal. Ideally when we classify segments that show a
shooter scoring goals, we need to avoid inclusion of events that do not show a goal, even
though there may be a loud cheer. In this model, this is achieved by temporally combining
the scoreboard display with crowd cheer. Here, our diegetic interpretation is that every
goal is followed by crowd cheer within three seconds, and by a scoreboard display within
seven seconds after the crowd cheer. This discards events that have crowd cheer, but
no scoreboard and events that have scoreboards, but no crowd cheer.
T3: Goal [3 sec] Audio Cheer [7 sec] Score Board
Model IV
This model addresses the strict constraints imposed in Model 3. Our observations
show that while the pattern shown in three is valid most of the times, there are cases where
the field goals are accompanied by loud cheers and no scoreboard display. Similarly,
there are cases where goals are followed by scoreboard displays but not crowd cheer,
as in the case of free throws. In order to capture such scenarios, we have used a
combination of models I and II and proposed a model IV.
153
T4: T1 T2
where is the union of results from models H1 and H2.
Model V
While model IV covers most cases of legal goals, due to the inherent limitations
pointed out in models I and II, model IV could potentially classify segments where there
are no goals. Our observations show that Model IV captures the maximum number of
goals, but it also identifies many nongoal segments. In order to retain the number of goal
segments and still remove the nongoal segments, we introduce the third key event
change in direction of players. In this model, if a crowd cheer appears within 10 seconds
of a change in direction, or a scoreboard appears within 10 seconds of a change in
direction, there is likely to be a goal within 10 seconds of the change in direction. This
is represented as follows.
T5: Goal [10 secs] Change in direction [10secs] Crowd cheer
OR
T5:Goal [10 secs] Change in direction [10secs] Scoreboard
Although, in all models, the time interval between two key-events is hardwired, the
main idea is to provide a temporal link between key-events to build up high-level
semantics.
Object Model
For the interpretation models described above, we combine video objects characterized by different types of intensional attributes to develop new video objects or video
concept objects. An instance of video data model is shown in Figure 10. For example, we
use the loudness feature and a temporal feature to develop a new video concept object
called crowd cheer. For example, we first define video semantics crowd cheer as.
DEFINE CrowdCheer AS
SELECT V1.Start-time, V1.End-time
FROM VCO V1
WHERE <V1.Loudness, Very-high> AND
<V1.Duration, Short>
Here we use diegetic knowledge about the domain to interpret a segment where a
very high loud value appeared for a short duration as a crowd cheer. Similarly, we can
define scorecard as follows.
DEFINE ScoreCard AS
SELECT V2.Start-time, V2.End-time
FROM VCO V2
WHERE <V2.Text-Region, Bottom-Right>
Figure 10. An instance of the data model for an example explained in this section.
Goal
Semantics
CrowdCheer BEFORE ScoreCard
CrowdCheer
ScoreCard
Temporal
Operator
BEFORE
Interpretation
Models
<Text-Region, BottomRight>
<loudness, Very high>

AND <Duration, Short>
Audio Attribute:
(name, value)
Loudness: {Soft,
Loud, }
Combining
Operator
AND
Temporal attribute:
(name, value)
Start-Time: 20 sec
End_Time: 25 sec.
Duration: {short,
long, }
Visual Attribute:
(name, value)
Text-Region:
{top-right,
bottom-right, }
Temporal
attribute:
(name, value)
Start-Time: 20 sec
End-Time: 25 sec.
VCO1
Audio Attribute:
(name, value)
Loudness: [1, ..,5]
Temporal attribute:
(name, value)
Start-Time: 20 sec
End_Time: 25 sec.
Duration: 5 sec
VO1
VCO2
Fuzzy
Linguistic
Model
Soft, loud,
very soft,
very loud
Bottom-right
Top-Left, etc.
Short, long,
medium, very
short, very
long
Visual Attribute:
(name, value)
Text-Region:
[240,10,255,25]
Temporal
attribute:
(name, value)
Start-Time: 30 sec
End-Time: 40 sec.
VO2
We then use the temporal interpretation model to build a query to develop a video
concept object that represents a goal segment. This is defined as follows:
DEFINE Goal AS
SELECT [10] MIN(V1.Start-time, V2.Start-time),
MAX(V1.End-time, V2.End-time)
FROM CrowdCheer V1, ScoreCard V2
WHERE V1 BEFORE V2
155
Table 2. A list of data sets used for our experiments and the results of our observations
Games
Description
Length
Number
of cheers
Number
of goals
31
Number of
scoreboard
displays
49
52
Number of changes in
direction of players
movements
74
Australia Vs
Cuba 1996
(Women)
Australia Vs
USA 1994
(Women)
Australia Vs
Cuba 1996
(Women)
Australia Vs
USA 1997
(Men)
00:42:00
00:30:00
27
46
30
46
00:14:51
16
16
17
51
00:09:37
13
16
18
48
Here the query returns the 10 best-fit video sequences that satisfy the query criteria
- goal.
We have implemented this application model for basketball videos. Implementing
such a model necessarily involves developing appropriate interpretation models, which
is a labour-intensive task. The VIMET framework helps this process by facilitating
incremental descriptions of video content.
We now present an evaluation of the interpretation models outlined in the previous
section.
Evaluation of Interpretation Models

The data set used to evaluate the different temporal models for building a video
semantic goal in basketball videos is shown in Table 2. The first two clips A and B are
used for evaluating observations and the last two clips are used for evaluating automatic
analysis.
The standard precision-recall method was used to compare our automatically
generated goal segments with the ones manually judged by humans.
Precision is the ratio of the number of relevant clips retrieved to the total number
of clips retrieved. The ideal situation corresponds to 100% precision, when all retrieved
clips are relevant.
precision =
| relevant retrieved |
| retrieved |
Recall is the ratio of the number of relevant clips retrieved to the total number of
relevant clips. We can achieve ideal recall (100%) by retrieving all clips from the data set,
but the corresponding precision will be poor.
Table 3. A summary of results of automatic evaluation of various algorithms

Clips
Total
number of
baskets
(relevant)
17
18
recall =
Algorithms
Total
(retrieved)
T1
T2
T3
T4
T5
T1
T2
T3
T4
T5
15
20
11
24
17
16
19
10
25
22
Correct
Decision
(relevant
retrieved)
12
15
11
16
15
11
17
10
18
16
Precision
(%)
Recall
(%)
80
75
100
66.66
88.23
68.75
89.47
100
72.0
72.72
70.50
88.23
64.70
94.1
88.23
61.11
94.44
55.55
100
88.88
| relevant retrieved |
| relevant |
We evaluated the temporal interpretation models in our example data set. The
manually identified legal goals (which include field goals and free throws) of the videos
in our data set are shown in column 2 in Table 3.
The result of automatic analysis shows that the combination of all three key events
performs much better with high recall and precision values (~88%). In model T1, the model
correctly identifies 12 goal segments out of 17 for video C and 11 out of 18 for video D.
However, the total number of crowd segments detected by our crowd cheer detection
algorithm is15 and 16 for videos C and D, respectively. That is, there are few legal goal
segments that do not have crowd cheer and vice versa. Further analysis shows that crowd
cheer events resulting from other interesting events such as fast break, clever steal
and great check or screen gives false positive results. We observed that most of the
field goals are accompanied by crowd cheer. However, many goals scored by free throws
are not accompanied by crowd cheer. We also observed that in certain cases lack of
supporters among the spectators for a team yield false negative results. In model T2, the
model correctly identifies 15 goals out of 17 in video C and 17 out of 18 in video D. Our
further analysis confirmed that legal goals due to free throws are not often accompanied
by scoreboard displays, particularly when the scoreboard is updated at the end of free
throws rather than after each free throw. Similarly, our feature extraction algorithm used
for scoreboard not only detects scoreboards but also other textual displays such as team
coach names and the number of fouls committed by a player. Such textual features
increase the number of false positive results. We plan to use the heuristics developed
here to improve our scoreboard detection algorithm in the future. The above discussion
is valid for algorithms T3, T4 and T5 as well.
157
CONCLUDING REMARKS
Emerging new technologies in video delivery such as streaming over the Internet
have made video content a significant component of the information space. Video
content holders now have an opportunity to provide new video products and services
by reusing their video collections. This requires more than content-based analysis
systems.
We need effective ways to model, represent and describe video content at several
semantic levels that are meaningful to users of video content. At the lowest level, content
can be described using low-level features such as colour and texture, and at the highestlevel the same content can be described using high-level concepts. In this chapter, we
have provided a systematic way of developing content descriptions at several semantic
levels. In the example in the fifth section, we have shown how (audio and visual) feature
extraction techniques can be effectively used together with interpretation models, to
develop higher-level semantic descriptions of content. Although the example is specific
to the sports genre, the VIMET framework and the associated data model provides a
platform to develop several descriptions of the same video content, depending on the
interpretation level of the annotator. The model supports one of the mandates of MPEG7 by supporting content descriptions, to accommodate multiple interpretations.
REFERENCES
Adam, B., Dorai, C., & Venkatesh, S. (2000). Study of shot length and motion as
contributing factors to movie tempo. ACM Multimedia, 353-355.
Ahanger, G., & Little, T.D.C. (1996). A survey of technologies for parsing and indexing
digital video. Journal of Visual Communication and Image Representation, 7(1),
28-43.
Allen, J.F. (1983). Maintaining knowledge about temporal intervals. Communication of
the ACM, 26(11), 832-843.
Baral, C., Gonzalez, G., & Son, T. (1998). Conceptual modeling and querying in multimedia
databases. Multimedia Tools and Applications, 7(), 37-66.
Bryan-Kinns, N. (2000). VCMF: A framework for video content modeling. Multimedia
Tools and Applications, 10(1), 23-45.
Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., & Zhong, D. (1997). VideoQ: An
automated content based video search system using visual cues. ACM Multimedia, 313-324, Seattle, Washington, November.
Colombo, C., Bimbo, A.D., & Pala, P. (2001). Retrieval of commercials by semantic
content: the semiotic perspective. Multimedia Tools and Applications, 13, 93-118.
Driankov, D., Hellendoorn, H., & Reinfrank, M. (1993). An introduction to fuzzy control.
Springer-Verlag.
Egenhofer, M., & Franzosa, R. (1991). Point-set topological spatial relations. International Journal of Geographic Information Systems, 5(2), 161-174.
Fagin, R. (1999). Combining fuzzy information from multiple systems. Proceedings of the
15th ACM Symposium on Principles of Database Systems (pp. 83-99).
Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M.,
Hafner, J., Lee, D., Petkovic, D., & Steele, D. (1995). Query by image and video
content: The QBIC system. Computer, 28(9), 23-32.
Frank, A.U. (1996). Qualitative spatial reasoning: Cardinal directions as an example.

International Journal of Geographic Information Systems, 10(3), 269-290.
Gu, L. (1998). Scene analysis of video sequences in the MPEG Domain. Proceedings of
the IASTED International Conference Signal and Image Processing, October 2831, Las Vegas.
Hacid, M.S., Decleir, C., & Kouloumdjian, J. (2000). A database approach for modeling
and querying video data. IEEE Transaction on Knowledge and Data Engineering,
12(5), 729-750.
Hammoud, R., Chen, L., & Fontaine, D (2001). An extensible spatial-temporal model for
semantic video segmentation. TRANSDOC project. Online at http:://transdoc.ibp.fr/
Hampapur, A. (1999). Semantic video indexing: Approaches and issues. SIGMOD
Record, 28(1), 32-39.
Hjelsvold, R., & Midtstraum, R. (1994). Modelling and querying video data. Proceedings
of the 20th VLDB Conference, Santiago, Chile (pp. 686-694).
ISO/IEC JTC1/SC29/WG11 (2001). Overview of the MPEG-7 Standard (version 5).
Singapore, March.
Jain, R., & Hampapur, A. (1994). Metadata in video databases. Sigmod Record, 23(4), 2733.
Lindley, C., & Srinivasan, U. (1998). Query semantics for content-based retrieval of video
data: An empirical investigation. Storage and Retrieval Issues in Image- and
Multimedia Databases, in conjunction with the Ninth International Conference
DEXA98, August 24-28, Vienna, Austria.
Meng, J., Juan, Y., & Chang, S.-F. (1995). Scene change detection in a mpeg compressed
video sequence. SPIE Symposium on Electronic Imaging: Science & TechnologyDigital Video Compression: Algorithms and Technologies, 2419, San Jose,
California, February.
Metz, C. (1974). Film language: A semiotics of the cinema (trans. by M. Taylor). The
University of Chicago Press.
MPEG MAAATE (2000). The Australian MPEG Audio Analysis Tool Kit. Online at http:/
/www.cmis.csiro.au/dmis/maaate/
Nepal, S., Srinivasan, U., & Reynolds, G. (2001). Automatic detection of goal segments
in basketball videos. ACM Multimedia 2001, 261-269, Sept-Oct.
Nepal, S., Srinivasan, U., & Reynolds, G. (2001). Semantic-based retrieval model for
digital audio and video. IEEE International Conference on Multimedia and
Exposition (ICME 2001), August (pp. 301-304).
Pradhan, S., Tajima, K., & Tanaka, K. (2001). A query model to synthesize answer
intervals from indexed video unit. IEEE Transaction On Knowledge and Data
Engineering, 13(5), 824-838.
Rui, Y., Huang, T.S., & Chang, S.F (1999). Image retrieval: Current techniques, promising
directions and open issues. Journal of Visual Communication and Image Representation, 10, 39-62.
Schloss, G.A., & Wynblatt, M.J.(1994). Building temporal structures in a layered
multimedia data model. ACM Multimedia, 271-278.
Smith, J.R., & Benitez, A.B. (2000). Conceptual modeling of audio-visual content. IEEE
International Conference on Multimedia and Expo (ICME 2000), July-Aug. (p.
915).
159
Smith, J.R., & Chang, S.-F. (1997). Querying by color regions using the VisualSEEk
content-based visual query system. In M. T. Maybury (Ed.), Intelligent multimedia information retrieval. IJCAI.
Srinivasan, U., Gu, L., Tsui, & Simpsom-Young, B. (1997). A data model to support
content-based search on digital video libraries. The Australian Computer Journal,
29(4), 141-147.
Srinivasan, U., Lindley, C., & Simpsom-Young, B. (1999). A multi-model framework for
video information systems. Database Semantics- Semantic Issues in Multimedia
Systems, January (pp. 85-108). Kluwer Academic Publishers.
Srinivasan, U., Nepal, S., & Reynolds, G. (2001). Modelling high level semantics for video
data management. Proceedings of ISIMP 2001, Hong Kong, May (pp. 291-295).
Tansley, R., Dobie, M., Lewis, P., & Hall, W. (1999). MAVIS 2: An architecture for content
and concept based multimedia information exploration. ACM Multimedia, 203.
Truong, B.T., Dorai, C., & Venkatesh, S. (2001). Determining dramatic intensification via
flashing lights in movies. International Conference on Multimedia and Expo,
August 22-25, Tokyo (pp. 61-64).
Yap, K., Simpson-Young, B., & Srinivasan, U. (1996). Enhancing video navigation with
existing alternate representations. First International Conference on Image Databases and Multimedia Search, Amsterdam, August.
160 Pfeiffer, Parker & Pang
Chapter 7
Continuous Media Web:
Hyperlinking, Search and

Retrieval of Time-Continuous
Data on the Web
Silvia Pfeiffer, CSIRO ICT Centre, Australia
Conrad Parker, CSIRO ICT Centre, Australia
Andr Pang, CSIRO ICT Centre, Australia
ABSTRACT
The Continuous Media Web project has developed a technology to extend the Web to
time-continuously sampled data enabling seamless searching and surfing with existing
Web tools. This chapter discusses requirements for such an extension of the Web,
contrasts existing technologies and presents the Annodex technology, which enables
the creation of Webs of audio and video documents. To encourage uptake, the
specifications of the Annodex technology have been submitted to the IETF for
standardisation and open source software is made available freely. The Annodex
technology permits an integrated means of searching, surfing, and managing a World
Wide Web of textual and media resources.
INTRODUCTION
Nowadays, the main source of information is the World Wide Web. Its HTTP
(Fielding et al., 1999), HTML (World Wide Web Consortium, 1999B), and URI (BernersLee et al., 1998) standards have enabled a scalable, networked repository of any sort of
Continuous Media Web 161
information that people care to publish in textual form. Web search engines have enabled
humanity to search for any information on any public Web server around the world. URI
hyperlinks in HTML documents have enabled surfing to related information, giving the
Web its full power. Repositories of information within organisations are also building on
these standards for much of their internal and external information dissemination.
While Web searching and surfing has become a natural way of interacting with
textual information to access their semantic content, no such thing is possible with media.
Media on the Web is cumbersome to use: it is handled as dark matter that cannot be
searched through Web search engines, and once a media document is accessed, only
linear viewing is possible no browsing or surfing to other semantically related
documents.
Multimedia research of the recent years has realised this issue. One means to enable
search on media documents is to automate the extraction of content, store the content
as index information, and provide search facilities through that index information. This
has led to extensive research on the automated extraction of metadata from binary media
data, aiming at bridging the semantic gap between automatically extracted low level
image, video, and audio features, and the high level of semantics that humans perceive
when viewing such material (see, e.g., Dimitrova et al., 2002).
It is now possible to create and store a large amount of metadata and semantic
content from media documents be that automatically or manually. But how do we exploit
such a massive amount of information in a standard way? What framework can we build
to satisfy the human need to search for content in media, to quickly find and access it
for reviewing, and to manage and reuse it in an efficient way?
As the Web is the most commonly used means for information access, we decided
to develop a technology for time-continuous documents that enables their seamless
integration into the Webs searching and surfing. Our research is thus extending the
World Wide Web with its familiar information access infrastructure to time-continuous
media such as audio and video, creating a Continuous Media Web.
Particular aims of our research are:
to enable the retrieval of relevant clips of time-continuous documents through

familiar textual queries in Web search engines,
to enable the direct addressing of relevant clips of time-continuous documents
through familiar URI hyperlinks,
to enable hyperlinking to other relevant and related Web resources while reviewing
a time-continuous document, and
to enable automated reuse of clips of time-continuous documents.
This chapter presents our developed Annodex (annotation and indexing) technology, the specifications of which have been published at the IETF (Internet Engineering
Task Force) as Internet-Drafts for the purposes of international standardisation. Implementations of the technology are available at http://www.annodex.net/. In the next
section we present related works and their shortcomings with respect to our aims. We
then explain the main principles that our research and development work adheres. The
subsequent section provides a technical description of the Continuous Media Web
(CMWeb) project and thus forms the heart of this book chapter. We round it off with a
view on research opportunities created by the CMWeb, and conclude the paper with a
summary.
BACKGROUND
The World Wide Web was created by three core technologies (Berners-Lee et al.,
1999): HTML, HTTP, and URIs. They respectively enable:
the markup of textual data integrated with the data itself giving it a structure,
metadata, and outgoing hyperlinks,
the distribution of Web documents over the Internet, and
the hyperlinking to and into Web documents.
In an analogous way, what is required to create a Web of time-continuous

documents is:
a markup language to create addressable structure, searchable metadata, and

outgoing hyperlinks for a continuous media document,
an integrated document format that can be distributed via HTTP making use of
existing caching HTTP proxy infrastructure,
and a means to hyperlink into a continuous media document.
One expects that the many existing standardisation efforts in multimedia would
cover these requirements. However, while the required pieces may exist, they are not
packaged and optimised for addressing the issues and for solving them in such a way
as to make use of the existing Web infrastructure with the least necessary adaptation efforts.
Here we look at the three most promising standards: SMIL, MPEG-7, and MPEG-21.
SMIL
The W3Cs SMIL (World Wide Web Consortium, 2001), short for Synchronized
Multimedia Interaction Language, is an XML markup language used for authoring
interactive multimedia presentations. A SMIL document describes the sequence of media
documents to play back, including conditional playback, loops, and automatically
activated hyperlinks. SMIL has outgoing hyperlinks and elements that can be addressed
inside it using XPath (World Wide Web Consortium, 1999A) and XPointer (World Wide
Web Consortium, 2002).
Features of SMIL cover the following modules:
1.
2.
3.
4.
Animation: provides for incorporating animations onto a time line.

Content Control: provides for runtime content choices and prefetch delivery.
Layout: allows positioning of media elements on the visual rendering surface and
control of audio volume.
Linking: allows navigations through the SMIL presentation that can be triggered
by user interaction or other triggering events. SMIL 2.0 provides only for in-line
link elements.
5.
6.
7.
8.
9.
10.
11.
Media Objects: describes media objects that come in the form of hyperlinks to
animations, audio, video, images, streaming text, or text. Restrictions of continuous
media objects to temporal subparts (clippings) are possible, and short and long
descriptions may be attached to a media object.
Metainformation: allows description of SMIL documents and attachment of RDF
metadata to any part of the SMIL document.
Structure: structures a SMIL document into a head and a body part, where the head
part contains information that is not related to the temporal behaviour of the
presentation and the body tag acts as a root for the timing tree.
Timing and Synchronization: provides for different choreographing of multimedia
content through timing and synchronization commands.
Time Manipulation: allows manipulation of the time behaviour of a presentation,
such as control of the speed or rate of time for an element.
Transitions: provides for transitions such as fades and wipes.
Scalability: provides for the definition of profiles of SMIL modules (1-10) that meet
the needs for a specific class of client devices.
SMIL is designed for creating interactive multimedia presentations, not for setting
up Webs of media documents. A SMIL document may result in a different experience for
every user and therefore is not a single, temporally addressable time-continuous
document. Thus, addressing temporal offsets does not generally make sense on a SMIL
document.
SMIL documents cannot generally be searched for clips of interest as they dont
typically contain the information required by a Web search engine: SMIL does not focus
on including metadata, annotations and hyperlinks, thus it does not provide for the
information necessary to be crawled and indexed by a search engine.
In addition, SMIL does not integrate the media documents required for its presentation in one single file, but instead references them from within the XML file. All media
data is only referenced and there is no transport format for a presentation that includes
all the relevant metadata, annotations, and hyperlinks interleaved with the media data to
provide a streamable format. This would not make sense anyway as some media data that
is referenced in a SMIL file may never be viewed by users as they may never activate the
appropriate action. SMIL interactions media streams will be transported on separate
connections to the initial SMIL file, requiring the client to perform all the media
synchronization tasks, and proxy caching can happen only on each file separately, not
on the complete interaction.
Note, however, that a single SMIL interaction, if recorded during playback, can
become a single time-continuous media document, which can be treated with our
Annodex technology to enable it to be searched and surfed. This may be interesting for
archiving and digital record-keeping.
MPEG-21
The ISO/MPEGs MPEG-21 (Burnett et al., 2003) standard is building an open
framework for multimedia delivery and consumption. It thus focuses on addressing how
to generically describe a set of content documents that belong together from a semantic
point of view, including all the information necessary to provide services on these digital
items. This set of documents is called a Digital Item, which is a structured representation
in XML of a work including identification, and metadata information.
The representation of a Digital Item may be composed of the following descriptors:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Container: is a structure that groups items and/or containers.

Item: a group of subitems and/or components bound to relevant descriptors.
Component: binds a resource to a set of descriptors including control or structural
information of the resource.
Anchor: binds a descriptor to a fragment of a resource.
Descriptor: associates information (i.e. text or a component) with the enclosing
element.
Condition: makes the enclosing element optional and links it to the selection(s) that
affect its inclusion.
Choice: is a set of related sections that can affect an items configuration.
Selection: is a specific decision that affects one or more conditions somewhere
within an item.
Annotation: is a set of information about an identified element of the model.
Assertion: is a fully or partially configured state of choice by asserting true/false/
undecided for predicates associated with the selections for that choice.
Resource: is an individually identifiable asset such as a video clip, audio clip, image
or textual asset, or even a physical object, locatable via an address.
Fragment: identifies a specific point or range within a resource.
Statement: is a literal text item that contains information, but is not an asset.
Predicate: is an identifiable declaration that can be true/false/undecided.
MPEG-21 further provides for the handling of rights associated with Digital Items,
and for the adaptation of Digital Items to usage environments.
As an example for a Digital Item, consider a music CD album. When it is turned into
a digital item, the album is described in an XML document that contains references to the
cover image, the text on the CD cover, the text on an accompanying brochure, references
to a set of audio files that contain the songs on the CD, ratings of the album, rights
associated with the album, information on the different encoding formats in which the
music can be retrieved, different bitrates that can be supported when downloading etc.
This description supports the handling of a digital CD album as an object: it allows you
to manage it as an entity, describe it with metadata, exchange it with others, and collect
it as an entity.
An MPEG-21 document does not typically describe just one time-continuous
document, but rather several. These descriptions are temporally addressable and
hyperlinks can go into and out of them. Metadata can be attached to the descriptions of
the documents making them searchable and indexable for search engines.
As can be seen, MPEG-21 addresses the problem of how to handle groups of files
rather than focusing on the markup of a single media file, and therefore does not address
how to directly link into time-continuous Web resources themselves. There is a important
difference between linking into and out of descriptions of a time-continuous document
and linking into and out of a time-continuous document itself integrated handling
provides for cacheablility and for direct URI access.
The aims of MPEG-21 are orthogonal to the aims that we pursue. While MPEG-21
enables a better handling of collections of Web resources that belong together in a
semantic way, Annodex enables a more detailed handling of time-continuous Web
resources only. Annodex provides a granularity of access into time-continuous resources that an MPEG-21 Digital Item can exploit in its descriptions of collections of
Annodex and other resources.
MPEG-7
ISO/MPEGs MPEG-7 (Martinez et al., 20020) standard is an open framework for
describing multimedia entities, such as image, video, audio, audiovisual, and multimedia
content. It provides a large set of description schemes to create markup in XML format.
MPEG-7 description schemes can provide the following features:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Specification of links and locators (such as time, media locators, and referencing
description tools).
Specification of basic information such as people, places, textual annotations,
controlled vocabularies, etc.
Specification of the spatio-temporal structure of multimedia content.
Specification of audio and visual features of multimedia content.
Specification of the semantic structure of multimedia content.
Specification of the multimedia content type and format for management.
Specification of media production information.
Specification of media usage (rights, audience, financial) information.
Specification of classifications for multimedia content.
Specification of user information (user description, user preferences, usage history).
Specification of content entities (still region, video/audio/audiovisual segments,
multimedia segment, ink content, structured collections).
Specification of content abstractions (semantic descriptions, media models, media
summaries, media views, media variations).
The main intended use of MPEG-7 is for describing multimedia assets such that they
can be queried or filtered. Just like SMIL and MPEG-21, the MPEG-7 descriptions are
regarded completely independent of the content itself.
An MPEG-7 document is an XML file that contains any sort of meta information
related to a media document. While the temporal structure of a media document can be
represented, this is not the main aim of MPEG-7 and not typically the basis for attaching
annotations and hyperlinks. This is the exact opposite approach to ours where the basis
is the media document and its temporal structure. Much MPEG-7 markup is in fact not
time-related and thus does not describe media content at the granularity we focus on.
Also, MPEG-7 does not attempt to create a temporally interleaved document format that
integrates the markup with the media data.
Again, the aims of MPEG-7 and Annodex are orthogonal. As MPEG-7 is a format that
focuses on describing collections of media assets, it is a primarily database-driven
approach towards the handling of information, while Annodex comes from a background
of Web-based, and therefore network-based, handling of media streams. A specialisation

(or in MPEG-7 terms profile) of MPEG-7 descriptions schemes may allow the creation of
annotations similar to the ones developed by us, but the transport-based interleaved
document format that integrates the markup with the media data in a streamable fashion
is not generally possible with MPEG-7 annotations. Annotations created in MPEG-7 may
however be referenced from inside an Annodex format bitstream, and some may even be
included directly into the markup of an Annodex format bitstream through the meta and
desc tags.
THE CHALLENGE
The technical challenge for the development of Annodex (Annodex.net, 2004) was
the creation of a solution to the three issues presented earlier:
Create an HTML-like markup language for time-continuous data,

that can be interleaved with the media stream to create a searchable media
document, and
create a means to hyperlink by temporal offset into the time-continuous document.
We have developed three specifications:
1.
2.
3.
CMML (Pfeiffer, Parker, & Pang, 2003a), the Continuous Media Markup Language which is based on XML and provides tags to mark up time-continuous data
into sets of annotated temporal clips. CMML draws upon many features of HTML.
Annodex (Pfeiffer, Parker, & Pang, 2003b), the binary stream format to store and
transmit interleaved CMML and media data.
temporal URIs (Pfeiffer, Parker, & Pang, 2003c), which enables hyperlinking to
temporally specified sections of an Annodex resource.
Aside from the above technical requirements, the development of these technologies has been led by several principles and non-technical requirements. It is important
to understand these constraints as they have strongly influenced the final format of the
solution.
Hook into existing Web Infrastructure: The Annodex technologies have been
designed to hook straight into the existing Web infrastructure with as few
adaptations as necessary. Also, the scalability property of the Web must not be
compromised by the solution. Thus, CMML is very similar to HTML, temporal URI
queries are CGI (Common Gateway Interface) style parameters (NCSA HTTPd
Development Team, 1995), temporal URI fragments are like HTML fragments, and
Annodex streams are designed to be cacheable by Web proxies.
Open Standards: The aim of the CMWeb project is to extend the existing World
Wide Web to time-continuous data such as audio and video and create a more
powerful networked worldwide infrastructure. Such a goal can only be achieved if

the different components that make up the infrastructure interoperate even when
created by different providers. Therefore, all core specifications are being published as open international standards with no restraining patent issues.
The Annodex Trademark: For an open standard, interoperability of different
implementations is crucial to its success. Any implementation that claims to
implement the specification but is not conformant and thus not interoperable will
be counterproductive to the creation of a common infrastructure. Therefore,
registering a Trademark on the word Annodex enables us to stop non-conformant
implementations from claiming to be conformant by using the same name.
Free media codecs: For the purpose of standardisation it is important that internetwide usage is encouraged to use media codecs for which no usage restrictions exist.
The codecs must be legal to use in all Internet connected devices and compatible
with existing Web infrastructure. This however does not mean that the technology
is restricted to specific codecs on the contrary: Annodex works for any timecontinuously sampled digital data file. Please also note that we do not develop
codecs ourselves, but rather provide recommendations for which codecs to
support.
Open Source: Open standards require reference implementations that people can
learn from and make use of for building up the infrastructure. Therefore, the
reference software should be published as open source software. According to Tim
Berners-Lee this was essential to the development and uptake of the Web (BernersLee et al., 1999).
Device Independence: As convergence of platforms continues, it is important to
design new formats such that they can easily be displayed and interacted with on
any networked device, be that on a huge screen, or on a small handheld device
screen. Therefore, Annodex is being designed to work independently of any
specific features an end device may have.
Generic Metadata: Metadata for time-continuous data can come in may different
structured or unstructured schemes. It can be automatically extracted or manually
extracted, follow the standard Dublin Core metadata scheme (Dublin Core Metadata
Initiative, 2003), or a company specific metadata scheme. Therefore, it is
important to specify the metadata types in a generic manner to allow free text
and any set of name-value pairs as metadata. It must be possible to develop
more industry-specific sets of metadata schemes later and make full use of them
in Annodex.
Simplicity: Above all, the goal of Annodex is to create a very simple set of tools
and formats for enabling time-continuous Web resources with the same powerful
means of exploration as text content on the Web.
These principles stem from a desire to make simple standards that can be picked up
and integrated quickly into existing Web infrastructure.
THE SOLUTION
Surfing and Searching
The double aim of Annodex is to enable Web users to:
view, access and hyperlink between clips of time-continuous documents in the

same simple, but powerful way as HTML pages, and
search for clips of time-continuous documents through the common Web search
engines, and to retrieve clips relevant to their query.
Figures 1 and 2 show screen shots of an Annodex Web browser and an Annodex
search engine. The Annodex browsers main window displays the media data, typical
Web browser buttons and fields (at the top), typical media transport buttons (at the
bottom), and a representative image (also called a keyframe) and hyperlink for the
currently displayed media clip. The story board next to the main window displays the list
of clips that the current resource consists of, enabling direct access to any clip in this
table of contents. The separate window on the top right displays the free-text annotation
stored in the description for the current clip, while the one on the lower right displays
the structured metadata stored for the resource or for the current clip. When crawling this
particular resource, a Web search engine can index all this textual information.
The Annodex search engine displayed in Figure 2 is a standard Web search engine
extended with the ability to crawl and index the markup of Annodex resources. It retrieves
clips that are relevant to the users query and presents ranked search results based on
the relevance of the markup of the clips. The keyframe of the clip and its description are
displayed.
Figure 1. Browsing a video about CSIRO astronomy research
Figure 2. Searching for radio galaxies in CSIROs science CMWeb
Architecture Overview
As the Continuous Media Web technology has to be interoperable with existing
Web technology, its architecture must be the same as the World Wide Web (see Figure
3): there is a Web client, that issues a URI request over HTTP to a Web server, who
resolves it and serves out the requested resource back to the client. In the case where
the client is a Continuous Media Web browser, the request will be for an Annodex file,
Figure 3. Continuous Media Web Architecture
which contains all the relevant markup and media data to display the content to the user.
In the case where the client is a Web crawler (e.g. part of a Web search engine), the client
may add a HTTP Content-type request header with a preference for receiving only the
CMML markup and not the media data. This is possible because the CMML markup
represents all the textual content of an Annodex file and is thus a thin representation of
the full media data. In addition it is a bandwidth-friendly means of crawling and indexing
media content, which is very important for scalability of the solution.
Annodex File Format
Annodex is the format in which media with interspersed CMML markup is transferred over the wire. Analogous to a normal Web server offering a collection of HTML
pages to clients, an Annodex server offers a collection of Annodex files. After a Web
client has issued a URI request for an Annodex resource, the Web server delivers the
Annodex resource, or an appropriate subpart of it according to the URI query parameters.
Annodex files conceptually consist of multiple media streams and one CMML
annotation stream, interleaved in a temporally synchronised way. The annotation stream
may contain several sets of clips that provide alternative markup tracks for the Annodex
file. The media streams may be complementary, such as an audio track with a video track,
or alternative, such as two speech tracks in different languages. Figure 4 shows an
example Annodex file with three media tracks (light coloured bars) and an annotation
track with a header describing the complete file (dark bar at the start) and several
interspersed clips.
One way to author Annodex files is by creating a CMML markup file and encoding
the media data together with the markup based on the authoring instructions found in
the CMML file. Figure 5 displays the principle of the Annodex file creation process: the
header information of the CMML file and the media streams are encoded at the start of
the Annodex file, while the clips and the actual encoded media data are appended
thereafter in a temporally interleaved fashion.
Figure 4. An example Annodex file, time increasing from left to right
The choice of a binary encapsulation format for Annodex files was one of the
challenges of the CMWeb project. We examined several different encapsulation formats
and came up with a list of requirements:
1.
2.
3.
4.
5.
6.
7.
8.
the format had to provide framing for binary media data and XML markup,
temporal synchronisation between media data and XML markup was necessary,
the format had to provide a temporal track paradigm for interleaving,
the format had to have streaming capabilities,
for fault tolerance, resynchronisation after a parsing error should be simple,
seeking landmarks were necessary to allow random access,
the framing information should only yield a small overhead, and
the format needed to be simple to allow handling on devices with limited capabilities.
Hierarchical formats like MPEG-4 and QuickTime did not qualify due to requirement
2, making it hard to also provide for requirements 3 and 4. An XML based format also did
Figure 5. Annodex file creation process
not qualify because binary data cannot be included in XML tags without having to
encode it in base64, which is inflating the data size by about 30% and creating
unnecessary additional encoding and decoding steps, and thus violating requirements
1 and 7. After some discussion, we adopted the Ogg encapsulation format (Pfeiffer, 2003)
developed by Xiphophorous (Xiphophorus, 2004). That gave us the additional advantage of having Open Source libraries available on all major platforms, much simplifying
the task of rolling out format support.
The Continuous Media Markup Language CMML

CMML is simple to understand, as it is HTML-like, though oriented towards a
segmentation of continuous data along its time axis into clips. A sample CMML file is
given below:
<?xml version=1.0" encoding=UTF-8" standalone=yes?>
<!DOCTYPE cmml SYSTEM cmml.dtd>
<cmml>
<stream timebase=0" utc=20040114T153500.00Z>
<import src=galaxies.mpg contenttype=video/mpeg start=npt:0"/>
</stream>
<head>
<title>Hidden Galaxies</title>
<meta name=author content=CSIRO/>
</head>
<clip id=findingGalaxies start=15">
<a href=http://www.aao.gov.au/galaxies.anx#radio>
Related video on Detection of Galaxies</a>
<img src=galaxy.jpg/>
<desc>Whats out there? ...</desc>
<meta name=KEYWORDS content=Radio Telescope, Galaxies/>
</clip>
</cmml>
As the sample file shows, CMML has XML syntax, consisting of three main types
of tags: At most one stream tag, exactly one head tag, and an arbitrary number of clip tags.
The stream tag is optional. It describes the input bitstreams necessary for the
creation of an Annodex file in the import tags, and gives some timing information
necessary for the output Annodex file. The import bitstream will be interleaved into
multiple tracks of media, even if they start at different time offsets and need to be
temporally realigned through the start attribute.
The markup of a head tag in the CMML document contains information about the
complete media document. Its essential information comprises of
structured textual annotations in meta tags, and

unstructured textual annotations in the title tag.
Structured annotations are name-value pairs which can follow a new or existing
metadata annotation scheme such as the Dublin Core (Dublin Core Metadata Initiative,
2003).
The markup of a clip tag contains information on the various clips or fragments of
the media:
Anchor points provide entry points into the media document that a URI can refer
to. Anchor points identify the start time and the name (id) of a clip. This enables
URIs to refer to Annodex clips by name.
URI hyperlinks can be attached to a clip, linking out to any other place a URI can
point to, such as clips in other annodexed media or HTML pages. These are given
by the a (anchor) tag with its href attribute. Furthermore, the a tag contains a textual
annotation of the link, the so-called anchor text (in the example above: Related video
on Detection of Galaxies) specifying why the clip is linked to a given URI. Note that
this is similar to the a tag in HTML.
An optional keyframe in the img tag provides a representative image for the clip and
enables display of a story board for Annodex files.
Unstructured textual annotations in the desc tags provide for searchability of
Annodex files. Unstructured annotation is free text that describes the clip itself.
Each clip belongs to a specific set of temporally non-overlapping clips that make
up one track of annotations for a time-continuous data file. The track attribute of a clip
provides this attribution if it is not specified, the clip belongs to the default track.
Using the above sample CMML file for authoring Annodex, the result will be a
galaxies.anx file of the form given in Figure 6.
Specifying Time Segments and Clips in URIs

Linking to Time Segments and Clips in URIs
A URI points to a Web resource, and is the primary mechanism on the Web to reach
information. Time-continuous Web resources are typically large data files. Thus, when
a Web user wants to link to the exact segment of interest within the time-continuous
resource, it is desirable that only that segment is transferred. This reduces network load
and user waiting time.
No standardised scheme is currently available to directly link to segments of
interest in a time-continuous Web resource. However, addressing of subparts of Web
resources is generally achieved through URI query specifications. Therefore, we defined
a query scheme to allow direct addressing of segments of interest in Annodex files.
Two fundamentally different ways of addressing information in a Annodex resources are necessary: addressing of clips and addressing of time offsets or time
segments.
Figure 6. Annodex file created from the sample CMML file
Linking to Clips
Clips in Annodex files are identified by their id attribute. Thus, accessing a named
clip in an Annodex (and, for that matter, a CMML) file is achieved with the following CGI
conformant query parameter specification:
id=clip_id
Examples for accessing a clip in the above given sample CMML and Annodex files
are:
http://www.annodex.net/galaxies.cmml?id=findingGalaxies
http://www.annodex.net/galaxies.anx?id=findingGalaxies
On the Annodex server, the CMML and Annodex resources will be pre-processed
as a result of this query before being served out: the file header parts will be retained,
the time basis will be adjusted and the queried clip data will be concatenated at the end
to regain conformant file formats.
Linking to Time Segments

It is also desirable to be able to address any arbitrary time segment of an Annodex
or CMML file. This is again achieved with a CGI conformant query parameter specification:
t=[time-scheme:]time_interval
Available time schemes are npt for normal play time, different smpte specifications
of the Society of Motion Pictures and Television Engineers (SMPTE), and clock for a
Universal Time Code (UTC) time specification. For more details see the specification
document (Pfeiffer, Parker, & Pang, 2003c).
Examples for requesting one or several time intervals from the above given sample
CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml?t=85.28
http://www.annodex.net/galaxies.anx?t=npt:15.6-85.28,100.2
http://www.annodex.net/galaxies.anx?t=smpte-25:00:01:25:07
http://www.annodex.net/galaxies.anx?t=clock:20040114T153045.25Z
Where only a single time point is given, this is interpreted to relate to the time
interval covered from that time point onwards until the end of the stream.
The same pre-processing as described above will be necessary on the Annodex
server.
Restricting Views to Time Segments and Clips in URIs

Aside from the query mechanism, URIs also provide a mechanism to address
subparts of Web resources locally on a Web client: URI fragment specifications. We have
found that fragments are a great mechanism to restrict views on Annodex files to a specific
subpart of the resource, e.g. when viewing or editing a temporal subpart of an Annodex
document. Again, two fundamentally different ways of restricting a time-continuous
resource are required: views on a clip and views on time segments.
Views on Clips
Restricting the view on an Annodex (or CMML) file to a named clip makes use of
the value of the id tag of the clip in a fragment specification:
#clip_id
Examples for local clip views for the above given sample CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml#findingGalaxies
http://www.annodex.net/galaxies.anx#findingGalaxies
The Web client that is asked for such a resource will ask the Web server for the
complete resource and perform its application-specific operation on the clip only. This
may for example result in a sound editor downloading a complete sound file, then
selecting the named clip for further editing. An Annodex browser would naturally behave
analogously to an existing Web browser that receives a html page with a fragment offset:
it will fast forward to the named clip as soon as that clip has been received.
Views on Time Segments

Analogously to clip views, views can be restricted to time intervals with the
following specification:
#[time-scheme:]time_interval
Examples for restrictions to one or several time intervals from the above given
sample CMML and Annodex files are:
http://www.annodex.net/galaxies.cmml#85.28
http://www.annodex.net/galaxies.anx#npt:15.6-85.28,100.2
http://www.annodex.net/galaxies.anx#smpte-25:00:01:25:07
http://www.annodex.net/galaxies.anx#clock:20040114T153045.25Z
Where only a single time point is given, this is interpreted to relate to the time
interval covered from that time point onwards until the end of the stream. The same usage
examples as described above apply in this case, too. Specifying several time segments
may make sense only in specific applications, such as an editor, where an unconnected
selection for editing may result.
FEATURES OF ANNODEX
While developing the Annodex technology, we discovered that the Annodex file
format addresses many challenges of media research that were not part of the original
goals of its development but came to it with serendipity. Some of these will be regarded
briefly in this chapter.
Multitrack Media File Format

The Annodex file format is based on the Xiph.org Ogg file format (Pfeiffer, 2003)
which allows multiple time-continuous data tracks to be encapsulated in one interleaved
file format. We have extended the file format such that it can be parsed and handled
without having to decode any of the data tracks themselves, making Annodex a generic
multitrack media file format. To that end we defined a generic data track header page which
includes a Content-type field that identifies the codec in use and provides some general
attributes of the track such as its temporal resolution. The multitrack file format now has
three parts:
1.
2.
3.
Data track identifying header pages (primary header pages)

Codec header pages (secondary header pages)
Data pages
For more details refer to the Annodex format specification document (Pfeiffer,
Parker, and Pang, 2003b).
A standardised multitrack media format is currently non-existent many applications, amongst them multitrack audio editors, will be able to take advantage of it,
especially since the Annodex format also allows inclusion of arbitrary meta information.
Multitrack Annotations
CMML and Annodex have been designed to provide a means of annotating and
indexing time-continuous data files by structuring their time-line into regions of interest
called clips. Each clip may have structured and unstructured annotations, a hyperlink and
a keyframe. A simple partitioning however does not allow for several different, potentially
overlapping subdivisions of the time-line into clips. After considering several different
solutions for such different subdivisions, we decided to adapt a multitrack paradigm for
annotations as well:
every clip of an Annodex or CMML file belongs to one specific annotation track,
clips within one annotation track cannot overlap temporally,
clips on different tracks can overlap temporally as needed,
the attribution of a clip to a track is specified through its track attribute if its not
given, its attributed to a default track.
This is a powerful concept and can easily be represented in browsers by providing

a choice of the track thats visible.
Internationalisation Support
CMML and Annodex have also been designed to be language-independent and
provide full internationalisation support. There are two issues to consider for text in
CMML elements: different character sets and different languages.
As CMML is an XML markup language, different character sets are supported
through the xml processing instructions encoding attribute containing a file-specific
character set (World Wide Web Consortium, 2000). A potentially differing character set
for an import media file will be specified in the contenttype attribute of the source tag as
a parameter to the mime type.
Any tag or attribute that could end up containing text in a different language to the
other tags may specify their own language. This is only necessary for tags that contain
human-readable text. The language is specified in the lang and dir attributes.
Search Engine Support

Web search engines are powerful tools to explore the textual information published
on Web servers. The principle they work from is that crawling hyperlinks that they find
within known Web pages will lead them to more Web pages, and eventually to most of
the Webs content. For all Web resources they can build a search index of their textual
contents and use it for retrieval of a hyperlink in response to a search query.
With binary time-continuous data files, indexing was previously not possible.
However, Annodex allows the integration of time-continuous data files into the crawling
and indexing paradigm of search engines through providing CMML files. A CMML file
represents the complete annotation of an Annodex file with HTML-style anchor tags in
its clip tags that enable crawling of Annodex files. Indexing can then happen on the level
of the complete file or on the level of individual clips. For the complete file, the tags in
the head element (title & meta tags) will be indexed, whereas for clips, the tags in the clip
elements (desc & meta tags) are necessary. The search result display should then display
the descriptive content of the title and desc tags, and the representative keyframe given
in the img tag, to provide a nice visual overview of the retrieved clip (see Figure 2).
For retrieval of the CMML file encapsulated in an Annodex file from an Annodex
server, HTTPs content type negotiation is used. The search engine only needs to
include into its HTTP request an Accept header with a higher priority on text/x-cmml than
on application/x-annodex and a conformant Annodex server will provide the extracted
CMML content for the given Annodex resource.
Caching Web Proxies

HTTP defines a mechanism to cache byte ranges of files in Web proxies. With
Annodex files, this mechanism can be used to also cache time intervals or clips of timecontinuous data files, which are commonly large-size files. To that end, the Web server
must provide a mapping of the clip or the time intervals to byte ranges. Then, the Web
proxy can build up a table of ranges that it caches for a particular Annodex resource. If
it receives an Annodex resource request for a time interval or clip that it already stores,
it can serve out the data straight out of its cache. Just like the Web server, it may however
need to process the resource before serving it: the file header parts need to be prepended
to the data, the timebase needs to be adjusted, and the queried data needs to be
concatenated at the end to regain a conformant Annodex file format. As Annodex allows
parsing of files without decoding, this is a fairly simple operation, enabling a novel use
of time-continuous data on the Web.
Dynamic Annodex Creation

Current Web sites use scripting extensively to automatically create HTML content
with up to date information extracted from databases. As Annodex and CMML provide
clip structured media data, it is possible to create Annodex content by scripting. The
annotation and indexing information of a clip may then be stored in a metadata database
with a reference to the clip file. A script can then select clips based by querying the
metadata database and create an Annodex file on the fly. News bulletins and video blogs
are application examples which can be built with such a functionality.
RESEARCH OPPORTUNITIES
There are a multitude of open research opportunities related to Annodex, some of
which are mentioned in this Section.
Further research is necessary for exploring transcoding of metadata. A multitude
of different markup languages for different kinds of time-continuous data already exist.
CMML is a generic means to provide structured and unstructured annotations on clips
and media files. Many of the existing ways to markup media may be transcode into CMML,
and utilise the power of Annodex. Transcoding is simple to implement for markup that
is also based on XML, because XSLT (World Wide Web Consortium, 1999C) provides
a good tool to implement such scripts.
Transcoding of metadata directly leads to the question of interoperability with other
standards. MPEG-7 is such a metadata standard for which it is necessary to explore
transcoding, however MPEG-7 is more than just textual metadata and there may be more
to find. Similarly, interoperability of Annodex with standards like RTP/RTSP (Schulzrinne

et al. 1996 and 1998), DVD (dvdforum, 2000), MPEG-4 (MPEG Industry Forum, 2002), and
MPEG-21(Burnett et al, 2003) will need to be explored.
Another question that frequently emerges for Annodex is the question of annotating and indexing regions of interest within a videos imagery. We decided that structuring
the spatial domain is out of scope for the Annodex technologies and may be re-visited
at a future time. Annodex is very specifically designed to solve problems for timecontinuous data, and that data may not necessarily have a spatial domain (such as audio
data). Also, on different devices the possible interactions have to be very simple, so e.g.
selecting a spatial region while viewing a video on a mobile device is impractical.
However, it may be possible for specific applications to use image maps with CMML clips
to also hyperlink and describe in the spatial domain. This is an issue to explore in the
future.
Last but not least there are many opportunities to apply and extend existing
multimedia content analysis research to automatically determine CMML markup.
CONCLUSION
This chapter presented the Annodex technology, which brings the familiar searching and surfing capabilities of the World Wide Web to time-continuously sampled data
(Pfeiffer, Parker, and Schremmer, 2003). At the core of the technology are the Continuous
Media Markup Language CMML, the Annodex stream and file format, and clip- and timereferencing URI hyperlinks. These enable the extensions of the Web to a Continuous
Media Web with Annodex browsers, Annodex servers, and Annodex search engines.
Annodex is however more powerful as it also represents a standard multitrack media file
format with multitrack annotations, which can be cached on Web proxies and used in Web
server scripts for dynamic content creation. Therefore, Annodex and CMML present a
Web-integrated means for managing multimedia semantics.
ACKNOWLEDGMENT
The authors greatly acknowledge the comments, contributions, and proofreading
of Claudia Schremmer, who is making use of the Continuous Media Web technology in
her research on metadata extraction of meeting recordings.
REFERENCES
Annodex.net (2004). Open standards for annotating and indexing networked media.
Retrieved January 2004 from http://www.annodex.net
Berners-Lee, T., Fielding, R., & Masinter, L. (1998, August). Uniform resource identifiers
(URI): Generic syntax. Internet Engineering Task Force, RFC 2396. Retrieved
January 2003 from http://www.ietf.org/rfc/frc2396.txt
Berners-Lee, T., Fischetti, M. & Dertouzos, M.L. (1999). Weaving the Web: The original
design and ultimate destiny of the World Wide Web by its inventor. San Francisco:
Harper.
Burnett, I., Van de Walle, R., Hill, K., Bormans, J., & Pereira, F. (2003). MPEG-21: Goals
and achievements. IEEE Multimedia Magazine, Oct-Dec, 60-70.
Dimitrova, N., Zhang, H.-J., Shahraray, B., Sezna, I., Huang, T. & Zakhor A. (2002).
Applications of video-content analysis and retreival. IEEE Multimedia Magazine,
July-Sept, 42-55.
Dublin Core Metadata Initiative. (2003). The Dublin Core Metadata Element Set, v1.1.
February. Rectrieved January 2004 from http://dublincore.org/documents/2003/
02/04/dces
dvdforum (2000, September). DVD Primer. Retrieved January 2004 from http://
www.dvdforum.org/tech-dvdprimer.htm
Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L., Leach, P, & Berners-Lee, T.
(1999, June). Hypertext Transfer Protocol HTTP/1.1. Internet Engineering Task
Force, RFC 2616. Received January 2004 from http://www.ietf.org/rfc/rfc2616.txt
Martinez, J.M., Koenen, R., & Pereira, F. (2002). MPEG-7: The generic multimedia content
description standard. IEEE Multimedia Magazine, April-June, 78-87.
MPEG Industry Forum. (2002, February). MPEG-4 users frequently asked questions.
Retrieved January 2004 from http://www.mpegif.org/resources/mpeg4userfaq.php
NCSA HTTPd Development Team (1995, June). The Common Gateway Interface (CGI).
Retrieved January 2004 from http://hoohoo.ncsa.uiuc.edu/cgi/
Pfeiffer, S. (2003, May). The Ogg encapsulation format version 0. Internet Engineering
Task Force, RFC 3533. Retrieved January 2004 from http://www.ietf.org/rfc/
rfc3533.txt
Pfeiffer, S., Parker, C., & Pang, A. (2003a). The Continuous Media Markup Language
(CMML), Version 2.0 (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http://www.annodex.net/TR/draft-pfeiffercmml-01.txt
Pfeiffer, S., Parker, C., & Pang, A. (2003b). The Annodex annotation and indexing format
for time-continuous data files, Version 2.0 (work in progress). Internet Engineering
Task Force, December 2003. Retrieved January 2004 from http://www.annodex.net/
TR/draft-pfeiffer-annodex-01.txt
Pfeiffer, S., Parker, C., & Pang, A. (2003c). Specifying time intervals in URI queries and
fragments of time-based Web resources (BCP) (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http://
www.annodex.net/TR/draft-pfeiffer-temporal-fragments-02.txt
Pfeiffer, S., Parker, C., & Schremmer, C. (2003). Annodex: A simple architecture to enable
hyperlinking, search & retrieval of time-continuous data on the Web. Proceedings
5th ACM SIGMM International Workshop on Multimedia Information Retrieval
(MIR), Berkeley, California, November (pp. 87-93).
Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (1996, January). RTP: A
transport protocol for real-time applications. Internet Engineering Task Force, RFC
1889. Retrieved January 2004 from http://www.ietf.org/rfc/rfc1889.txt
Schulzrinne, H., Rao, A., & Lanphier, R. (1998, April). Real Time Streaming Protocol
(RTSP). Internet Engineering Task Force, RFC 2326. Retrieved January 2003 from
http://www.ietf.org/rfc/rfc2326.txt
World Wide Web Consortium (1999A). XML Path Language (XPath). W3C XPath,
November 1999. Retrieved January 2004 from http://www.w3.org/TR/xpath/
World Wide Web Consortium (1999B). HTML 4.01 Specification. W3C HTML, December
1999. Retrieved January 2004 from http://www.w3.org/TR/html4/
World Wide Web Consortium (1999C). XSL Transformations (XSLT) Version 1.0. W3C
XSLT, November 1999. Retrieved January 2004 from http://www.w3.org/TR/xslt/
World Wide Web Consortium (2000, October). Extensible Markup Language (XML) 1.0.
W3C XML. Retrieved January 2004 from http://www.w3.org/TR/2000/REC-xml20001006
World Wide Web Consortium (2001, August). Synchronized Multimedia Integration
Language (SMIL 2.0). W3C SMIL. Retrieved January 2004 from http://www.w3.org/
TR/smil20/
World Wide Web Consortium (2002, August). XML Pointer Language (XPointer). W3C
XPointer. Retrieved January 2004 from http://www.w3.org/TR/xptr/
Xiphophorus (2004). Building a new era of Open multimedia. Retrieved January 2004 from
http://www.xiph.org/
182 Srinivasan & Divakaran
Chapter 8
Management of
Multimedia Semantics
Using MPEG-7
Ajay Divakaran, Mitsubishi Electric Research Laboratories, USA
ABSTRACT
This chapter presents the ISO/IEC MPEG-7 Multimedia Content description Interface
Standard from the point of view of managing semantics in the context of multimedia
applications. We describe the organisation and structure of the MPEG-7 Multimedia
Description schemes which are metadata structures for describing and annotating
multimedia content at several levels of granularity and abstraction. As we look at
MPEG-7 semantic descriptions, we realise they provide a rich framework for static
descriptions of content semantics. As content semantics evolves with interaction, the
human user will have to compensate for the absence of detailed semantics that cannot
be specified in advance. We explore the practical aspects of using these descriptions
in the context of different applications and present some pros and cons from the point
of view of managing multimedia semantics.
INTRODUCTION
MPEG-7 is an ISO/IEC Standard that aims at providing a standard way to describe
multimedia content, to enable fast and efficient searching and filtering of audiovisual
content. MPEG-7 has a broad scope to facilitate functions such as indexing, management, filtering, authoring, editing, browsing, navigation, and searching content descripCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
Management of Multimedia Semantics Using MPEG-7
183
tions. The purpose of the standard is to describe the content in a machine-readable format
for further processing determined by the application requirements.
Multimedia content can be described in many different ways depending on the
context, the user, the purpose of use and the application domain. In order to address the
description requirement of a wide range of applications, MPEG-7 aims to describe content
at several levels of granularity and abstraction to include description of features,
structure, semantics, models, collections and metadata about the content.
Initial research focused on feature extraction techniques influenced the description
of content at the perceptual feature level. Examples of visual features that can be extracted
using image-processing techniques are colour, shape and texture. Accordingly, there are
several MPEG-7 Descriptors (Ds) to describe visual features. Similarly there are a number
of low-level Descriptors to describe audio content at the level of spectral, parametric and
temporal features of an audio signal. While these Descriptors describe objective
measures of audio and visual features, they are inadequate for describing content at a
higher level of semantics to describe relationships among audio and visual descriptors
within an image or over a video segment. This need is addressed through the construct
called Multimedia Descriptions Scheme (MDS), also referred to simply as Description
Scheme (DS). Description schemes are designed to describe higher-level content
features such as regions, segments, objects and events, as well as metadata about the
content, its usage, and so forth. Accordingly, there are several groups or categories of
MDS tools.
An important factor that needs to be considered while describing audiovisual
content is the recognition that humans start to interpret and describe the meaning of the
content that goes far beyond visual features and cinematic constructs introduced in
films. While such meanings and interpretations cannot be extracted automatically,
because they are contextual, they can be described using free text descriptions. MPEG7 handles this aspect through several description schemes that are based on structured
free text descriptions.
As our focus is on management of multimedia semantics, we look at MPEG-7 MDS
constructs from two perspectives: (a) the level of granularity offered while describing
content, and (b) the level of abstraction available to describe multimedia semantics. The
second section provides an overview of the MPEG-7 constructs and how they hang
together. The third section looks at MDS tools to manage multimedia semantics at
multiple levels of granularity and abstraction. The fourth section takes a look at the whole
framework from the perspective of different applications. The last section presents some
discussions and conclusions.
MPEG-7 CONTENT DESCRIPTION

AND ORGANISATION
The main elements of MPEG-7 as described in the MPEG-7 Overview document
(Martnez, 2003) are a set of tools to describe the content, a language to define the syntax
of the descriptions, and system tools to support efficient storage and transmission,
execution and synchronization of binary encoded descriptions.
The Description Tools provide a set of Descriptors (D) that define the syntax and
the semantics of each feature, and a library of Description Schemes (DS) that specify the
structure and semantics of the relationships between their components, that may be both
Descriptors and Description Schemes. A description of a piece of audiovisual content
is made up of a number of Ds and DSs determined by the application. The description
tools can be used to create such descriptions which form the basis for search and
retrieval. A Description Definition Language (DDL) is used to create and represent the
descriptions. DDL is based on XML and hence allows the processing of descriptions
in a machine-readable format. Content descriptions created using these tools could be
stored in a variety of ways. The descriptions could be physically located with the content
in the same data stream or the same storage system, allowing efficient storage and
retrieval. However, there could be instances where content and its descriptions may not
be colocated. In such cases, we need effective ways to synchronise the content and its
Descriptions. System tools support multiplexing of description, synchronization issues,
transmission mechanisms, file format, and so forth. Figure 1 (Martnez, 2003) shows the
main MPEG-7 elements and their relationships.
MPEG-7 has a broad scope and aims to address the needs of several types of
applications (Vetro, 2001). MPEG-7 descriptions of content could include
Information describing the creation and production processes of the content

(director, title, short feature movie).
Information related to the usage of the content (copyright pointers, usage history,
and broadcast schedule).
Information of the storage features of the content (storage format, encoding).
Structural information on spatial, temporal or spatiotemporal components of the
content (example: scene cuts, segmentation in regions, region motion tracking).
Information about low-level audio and visual features in the content (example:
colors, textures, sound timbres, melody description).
Figure 1. MPEG-7 elements
Description Definition Language

Tags
Definition
D1
D2
D3
D5
D6
Descriptors
DS1
D4
structuring
D7
DS2
D3
DS4
DS5
DS6
DS7
Description
Schemes
Instantiation
<scene id=5>
<time>.
<camera>
<annotation>
</scene>
<scene id=6>
Encoding and
Delivery
Description
185
Figure 2. Overview of MPEG-7 multimedia description schemes (Martnez, 2003)

Collection &
Classification
Content organization
User
interaction
Models
Navigation &
Access
Creation &
production
User
Preferences
Summary
Media
Usage
Content management
Views
User
History
Content description
Structural
aspects
Conceptual
aspects
Variations
Basic Elements
Schema tools
Basic datatypes
Link & media

localization
Basic Tools
Conceptual information of the reality captured by the content (example: objects and
events, interactions among objects).
Information about how to browse the content in an efficient way (example:
summaries, variations, spatial and frequency subbands).
Information about collections of objects.
Information about the interaction of the user with the content (user preferences,
usage history).
MPEG-7 Multimedia Description Schemes (MDS) are metadata structures for

describing and annotating audiovisual content at several levels of granularity and
abstraction (to describe what is in the content) and metadata (a description about the
content). These Multimedia Descriptions Schemes are described using XML to support
readability at the human level and processing capability at the machine level.
MPEG-7 Multimedia DSs are categorised and organised into the following groups:
Basic Elements, Content Description, Content Management, Content Organization,
Navigation and Access, and User Interaction. Figure 2 shows the different categories and
presents a big picture view of Multimedia DSs.
Basic Elements
Basic elements provide the fundamental constructs in defining MPEG-7 DSs. This
includes basic data types and a set of extended data types such as vectors, matrices to
describe the features and structural aspects of the content. The basic elements also
include constructs for linking media files, localising specific segments, describing time
and temporal information, place, individual(s), groups, organizations, and other textual
annotations.
Content Description
MPEG-7 DSs for content description are organised into two categories: DSs for
describing structural aspects, and DSs for describing conceptual aspects of the content.
The structural DSs describe audiovisual content at a structural level organised around
a segment. The Segment DS represents the spatial, temporal or spatiotemporal structure
of an audiovisual segment. The Segment DS can be organised into a hierarchical structure
to produce a table of contents for indexing and searching audiovisual content in a
structured way. The segments can be described at different perceptual levels using
Descriptors for colour, texture, shape, motion, and so on. The conceptual aspects are
described using semantic DS, to describe objects, events and abstract concepts. The
structure DSs and semantic DSs are related by a set of links that relate different semantic
concepts to content structure. The links relate semantic concepts to instances within the
content described by the segments. Many of the content description DSs are linked to
Ds which are, in turn, linked to DSs in a content management group.
Content Management
MPEG-7 DSs for content management include tools to describe information pertaining to creation and production, media coding, storage and file formats, and content
usage.
Creation information provides information related to the creators of the content,
creation locations, dates, other related material, and so forth. These could be textual
annotations or other multimedia content such as an image of a logo. This also includes
information related to classification of the content from a viewers point of view.
Media information describes information including location, storage and delivery
formats, compression and coding schemes, and version history based on media profiles.
Usage information describes information related to usage rights, usage record, and
related financial information. While rights management is not handled explicitly, the
Rights DS provides references in the form of unique identifiers to external rights owners
and regulatory authorities.
Navigation and Access

The DSs under this category facilitate browsing and retrieval of audiovisual
content. There are DSs that facilitate browsing in different ways based on summaries,
partitions and decompositions and other variations. The Summary DSs support hierarchical and sequential navigation modes. Hierarchical summaries can be described at
different levels of granularity, moving from coarse high-level descriptions to more
detailed summaries of audiovisual content. Sequential summaries provide a sequence of
images and frames synchronised with audio, and facilitate a slide show style of browsing
and navigation.
Content Organisation
The DSs under this category facilitate organising and modeling collections of
audiovisual content descriptions. The Collection DS helps to describe collections at the
level of objects, segments, and events, based on common properties of the elements in
the collection.
187
User Interaction
This set of DSs describes user and usage preferences, usage history to facilitate
personalization of content access, presentation and consumption.
For more details of the full list of DSs, the reader is referred to the MPEG-7 URL at
http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm and Manjunath et al.
(2002).
REPRESENTATION OF
MULTIMEDIA SEMANTICS
In the previous section we described the MPEG-7 constructs and the method of
organising the MDS from a functional perspective, as presented in various official MPEG7 documents. In this section we look at the Ds and DSs from the perspective of addressing
multimedia semantics and its management. We look at the levels of granularity and
abstraction that MPEG-7 Ds and DSs are able to support. The structural aspects of
content description are meant to describe content at different levels of granularity
ranging from visual descriptors to temporal segments. The semantic DSs are developed
for the purpose of describing content at several abstract levels in a free text, but in a
structured form.
MPEG-7 deals with content semantics by considering narrative worlds. Since
MPEG-7 targets description of multimedia content, which is mostly narrative in nature,
it is reasonable for it to view the participants, background, context, and all the other
constituents of a narrative as the narrative world. Each narrative world can exist as a
distinct semantic description. The components of the semantic descriptions broadly
consist of entities that inhabit the narrative worlds, their attributes, and their relationships with each other.
Levels of Granularity
Let us consider a video of a play that consists of four acts. Then we can segment
the video temporally into four parts corresponding to the acts. Each act can be further
segmented into scenes. Each scene can be segmented into shots while a shot is defined
as a temporally continuous segment of video captured by a single camera. The shots can
in turn be segmented into frames. Finally, each frame can be segmented into spatial
regions. Note that each level of the hierarchy lends itself to meaningful semantic
description. Each level of granularity lends itself to distinctive Ds. For instance, we
could use the texture descriptor to describe the texture of spatial regions. Such a
description is clearly confined to the lowest level of the hierarchy we just described. The
2-D shape descriptors are similarly confined by definition. Each frame can also be
described using the scalable color descriptor, which is essentially a color histogram. A
shot consisting of several frames, however, has to be described using the group of frames
color descriptor, which aggregates the histograms of all the constituent shots using, for
instance, the median. Note that while it is possible to extend the color description to a
video segment of any length of time, it is most meaningful at the shot level and below.
The MotionActivity descriptor can be used to meaningfully describe any length of video,
since it merely captures the pace or action in the video. Thus, a talking head segment
would be described as low action while a car chase scene would be described as
high action. A one-hour movie that mostly consists of car chases could reasonably
be described as high action. The motion trajectory descriptor, on the other hand, is
meaningful only at the shot level and meaningless at any lower or higher level. In other
words, each level of granularity has its own set of appropriate descriptors that may or
may not be appropriate at all levels of the hierarchy. The aim of such description is to
enable content retrieval at any desired level of granularity.
Levels of Abstraction
Note that in the previous section, the hierarchy stemmed from the temporal and
spatial segmentation, but not from any conceptual point of view. Therefore such a
description does not let us browse the content at varying levels of semantic abstraction
that may exist at a given constant level of temporal granularity. For instance, we may be
only interested in dramatic dialogues between character A and character B in one case,
and in any interactions between character A and character B in another. Note that the
former is an instance of the latter and therefore is at a lower level of abstraction. In the
absence of multilayered abstraction, our content browsing would have to be either
excessively general through restriction to the highest level of abstraction, or excessively
particular through restriction to the lowest level of abstraction. Note that to a human
being, the definition of too general and too specific depends completely on the need
of the moment, and therefore is subject to wide variation. Any useful representation of
the content semantics has to therefore be at as many levels of abstraction as possible.
Returning to the example of interactions between the characters A and B, we can
see that the semantics consists of the entities A and B, with their names being their
attributes and whose relationship with each other consists of the various interactions
they have with each other. MPEG-7 considers two types of abstraction. The first is media
abstraction, that is, a description that can describe more than one instance of similar
content. We can see that the description all interactions between characters A and B,
is an example of media abstraction since it describes all instances of media in which A
and B interact. The second type of abstraction is formal abstraction, in which the pattern
common to a set of multimedia examples contains placeholders. The description interaction between any two of the characters in the play is an example of such formal
abstraction. Since the definition of similarity depends on the level of detail of the
description and the application, we can see that these two forms of abstraction allow us
to accommodate a wide range of abstraction from the highly abstract to the highly
concrete and detailed.
Furthermore, MPEG-7 also provides ways to describe abstract quantities such as
properties, through the Property element, and concepts, through the Concept DS. Such
quantities do not result from an abstraction of an entity, and so are treated separately.
For instance, the beauty of a painting is a property and is not the result of somehow
generalizing its constituents. Concepts are defined as collections of properties that
define a category of entities but do not completely characterize it.
Semantic entities in MPEG-7 mostly consist of narrative worlds, objects, events,
concepts, states, places and times. The objects and events are represented by the Object
and Event DSs respectively. The Object DS and Event DS provide abstraction through
189
a recursive definition that allows, for example, subcategorization of objects into subobjects.
In that way, an object can be represented at multiple levels of abstraction. For instance,
a continent could be broken down into continent-country-state-district, and so forth, so
that it can be described at varying levels of semantic granularity. Note that the Object
DS accommodates attributes so as to allow for the abstraction we mentioned earlier, that
is, abstraction that is related to properties rather than generalization of constituents such
as districts. The hospitable nature of the continents inhabitants for instance cannot
result from abstraction of districts to states to countries, and so forth.
Semantic entities can be described by labels, by a textual definition, or in terms of
properties or of features of the media or segments in which they occur. The SemanticBase
DS contains such descriptive elements. The AbstractionLevel data type in the
SemanticBase DS describes the kind of abstraction that has been performed in the
description of the entity. If it is not present, then the description is considered concrete.
If the abstraction is a media abstraction, then the dimension of the AbstractionLevel
element is set to zero. If a formal abstraction is present, the dimension of the element is
set to 1 or higher. The higher the value is, the higher is the abstraction. Thus, a value of
2 would indicate an abstraction of an abstraction.
The Relation DS rounds off the collection of representation tools for content
semantics. Relations capture how semantic entities are connected with each other. Thus,
examples of a relation is doctor-patient, student-teacher, and so forth. Note that
since each of the entities in the relation lends itself to multiple levels of abstraction and
the relations in turn have properties, there is further abstraction that results from
relations.
APPLICATIONS
As we cover MPEG-7 semantic descriptions, we realize that they provide a rich
framework for static description of content semantics. Such a framework has the inherent
problem of providing an embarrassment of riches, which makes the management of the
browsing very difficult. Since MPEG-7 content semantics is very graph oriented, it is clear
that it does not scale well as the number of concepts/events/objects goes up. Creation
of a deep hierarchy through very fine semantic subdivision of the objects would result
in the same problem of computational intractability. As the content semantic representation is pushed more and more towards a natural language representation, evidence from
natural language processing research indicates that the computational intractability will
be exacerbated. In our view, therefore, the practical utility of such representation is
restricted to cases in which either the concept hierarchies are not unmanageably broad,
or the concept hierarchies are not unmanageably deep, or both.
Our view is that in interactive systems, the human uses will compensate for the
shallowness or narrowness of the concept hierarchies through their domain knowledge.
Since humans are known to be quick at sophisticated processing of data sets of small size,
the semantic descriptions should be at a broad scale to help narrow down the search
space. Thereafter, the human can compensate for the absence of detailed semantics
through use of low-level feature-based video browsing techniques such as video
summarization. Therefore, MPEG-7 semantic representations would be best used in
applications in which a modest hierarchy can help narrow down the search space
considerably. Let us consider some candidate applications.
Educational Applications
At first glance, since education is, after all, intended to be systematic acquisition
of knowledge, a semantics-based description of all the content seems reasonable. Our
experience indicates that restriction of the description to a narrow topic allows for a rich
description within the topic of research and makes for a successful learning experience
for the student. Any application in which the intention is to learn abstract concepts, an
overly shallow concept hierarchy will be a hindrance. Hence, our preference for narrowing the topic itself to limit the breadth of the representation so as to buy some space for
a deeper representation. The so called edutainment systems fall in the same general
category with varying degrees of compromise between the richness of the descriptions
and the size of the database. Such applications include tourist information, cultural
services, shopping, social, film and radio archives, and so forth.
Information Retrieval Applications

Applications that require retrieval from an archive based on a specific query rather
than a top-down immersion in the content, typically consist of very large databases in
which even a small increase in the breadth and depth of the representation would lead
to an unacceptable increase in computation. Such applications include journalism,
investigation services, professional film and radio archives, surveillance, remote sensing, and so forth. Furthermore, in such applications, the accuracy requirements are much
more stringent. Our view is that only a modest MPEG-7 content semantics representation
would be feasible for such applications. However, even a modest semantic representation would be a vast improvement over current retrieval.
Generation of MPEG-7 Semantic Meta-Data

It is also important to consider how the descriptions would be generated in the first
place. Given the state of the art, the semantic metadata would have to be manually
generated. That is yet another challenge posed by large-scale systems. Once again, the
same strategy of either tackling modest databases, or creating modest representations
or a combination of both would be reasonable. Once again, if the generation of the
metadata is integrated with its consumption in an interactive application, the user could
enhance the metadata over time. This is perhaps a challenge for future researchers.
DISCUSSION AND CONCLUSION

Managing multimedia content has evolved around textual descriptions and/or
processing audiovisual information and indexing content using features that can be
automatically extracted. The question is how do we retrieve the content in a meaningful
way. How can we correlate users semantics with archivists semantics? Even though
MPEG-7 DSs provide a framework to support such descriptions, MPEG-7 is still a
standard for describing features of multimedia content. Although there are DSs to
describe the metadata related to the content, there is still a gap in describing semantics
191
that evolves with interaction and users context. There is a static aspect to the
descriptions, which limits adaptive flexibility needed for different types of applications.
Nevertheless, a standard way to describe the relatively unambiguous aspects of content
does provide a starting point for many applications where the focus is content management.
The generic nature of MPEG-7 descriptions can be both a strength and a weakness.
The comprehensive library of DSs are aimed to support a large number of applications,
and there are several tools to support the development of descriptions required for a
particular application. However, this requires a deep knowledge of MPEG-7, and the large
scope becomes a weakness, as it becomes impossible to pick and choose from a huge
library without understanding the implications of the choices made. As discussed in
section 4, often a modest set of content descriptions, DSs and elements may suffice for
a given application. This requires an application developer to first develop the
descriptions in the context of the application domain, determine the DSs to support the
descriptions, and then identify the required elements in the DSs. This is an involved
process and cannot be viewed in isolation of the domain and application context. As
MPEG-7 compliant applications start to be developed, it is possible that there could be
context-dependent elements and DSs that are essential to the application, but not
described in the standard, because the application context cannot be predetermined
during the definition stage.
In conclusion, these are still early days for MPEG-7 and their deployment in
managing the semantic aspects of multimedia applications. As the saying goes, the
proof of the pudding lies in the eating, and the success of the applications will determine
the success of the standard.
REFERENCES
Manjunath, B.S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-: Multimedia
content description interface. New York: John Wiley & Sons.
Martnez, J.M. (2003, March). MPEG-7 overview (version 9). ISO/IEC JTC1/SC29/
WG11N5525.
Vetro, A. (2001, January). MPEG-7 applications document version10. ISO/IEC JTC1/
SC29/WG11/N3934.
Section 3
User-Centric Approach
to Manage Semantics
Interactive Browsing of Personal Photo Libraries
193
Chapter 9
Visualization, Estimation
and User Modeling for
Interactive Browsing of
Personal Photo Libraries
Baback Moghaddam, Mitsubishi Electric Research Laboratories, USA
Neal Lesh, Mitsubishi Electric Research Laboratories, USA
Chia Shen, Mitsubishi Electric Research Laboratories, USA
ABSTRACT
Recent advances in technology have made it possible to easily amass large collections
of digital media. These media offer new opportunities and place great demands for new
digital content user-interface and management systems which can help people construct,
organize, navigate, and share digital collections in an interactive, face-to-face social
setting. In this chapter, we have developed a user-centric algorithm for visualization
and layout for content-based image retrieval (CBIR) in large photo libraries. Optimized
layouts reflect mutual similarities as displayed on a two-dimensional (2D) screen,
hence providing a perceptually intuitive visualization as compared to traditional
sequential one-dimensional (1D) content-based image retrieval systems. A framework
194 Tian, Moghaddam, Lesh, Shen & Huang
for user modeling also allows our system to learn and adapt to a users preferences. The
resulting retrieval, browsing and visualization can adapt to the users (time-varying)
notions of content, context and preferences in style and interactive navigation.
INTRODUCTION
Personal Digital Historian (PDH) Project
Recent advances in digital media technology offer opportunities for new storysharing experiences beyond the conventional digital photo album (Balabanovic et al.,
2000; Dietz & Leigh, 2001). The Personal Digital Historian (PDH) project is an ongoing
effort to help people construct, organize, navigate and share digital collections in an
interactive multiperson conversational setting (Shen et al., 2001; Shen et al., 2003). The
research in PDH is guided by the following principles:
1.
2.
3.
4.
The display device should enable natural face-to-face conversation: not forcing
everyone to face in the same direction (desktop) or at their own separate displays
(hand-held devices).
The physical sharing device must be convenient and customary to use: helping to
make the computer disappear.
Easy and fun to use across generations of users: minimizing time spent typing or
formulating queries.
Enabling interactive and exploratory storytelling: blending authoring and presentation.
Current software and hardware do not meet our requirements. Most existing
software in this area provides users with either powerful query methods or authoring
tools. In the former case, the users can repeatedly query their collections of digital
content to retrieve information to show someone (Kang & Shneiderman, 2000). In the
latter case, a user experienced in the use of the authoring tool can carefully craft a story
out of his or her digital content to show or send to someone at a later time. Furthermore,
current hardware is also lacking. Desktop computers are not suitably designed for group,
face-to-face conversation in a social setting, and handheld story-telling devices have
limited screen sizes and can be used only by a small number of people at once. The
objective of the PDH project is to take a step beyond.
The goal of PDH is to provide a new digital content user-interface and management
system enabling face-to-face casual exploration and visualization of digital contents.
Unlike conventional desktop user interface, PDH is intended for multiuser collaborative
applications on single display groupware. PDH enables casual and exploratory retrieval,
and interaction with and visualization of digital contents.
We design our system to work on a touch-sensitive, circular tabletop display
(Vernier et al., 2002), as shown in Figure 1. The physical PDH table that we use is a
standard tabletop with a top projection (either ceiling mounted or tripod mounted) that
displays on a standard whiteboard as shown in the right image of Figure 1. We use two
Mimio (www.mimio.com/meet/mimiomouse) styluses as the input devices for the first set
195
Figure 1. PDH table (a) an artistic rendering of the PDH table (designed by Ryan
Bardsley, Tixel HCI www.tixel.net) and (b) the physical PDH table
(a)
(b)
of user experiments. The layout of the entire tabletop display consists of (1) a large storyspace area encompassing most of the tabletop until the perimeter, and (2) one or more
narrow arched control panels (Shen et al., 2001). Currently, the present PDH table is
implemented using our DiamondSpin (www.merl.com/projects/diamondspin) circular
table Java toolkit. DiamondSpin is intended for multiuser collaborative applications
(Shen et al., 2001; Shen et al., 2003; Vernier et al., 2002).
The conceptual model of PDH is to focus on developing content organization and
retrieval metaphors that can be easily comprehended by users without distracting from
the conversation. We adopt a model of organizing the materials using the four questions
essential to storytelling: who, when, where, and what (the four Ws). We do not currently
Figure 2. An example of navigation by the four-Ws model (Who, When, Where, What)
support why, which is also useful for storytelling. Control panels located on the perimeter
of the table contain buttons labeled people, calendar, location, and events,
corresponding to these four questions. When a user presses the location button, for
example, the display on the table changes to show a map of the world. Every picture in
the database that is annotated with a location will appear as a tiny thumbnail at its
location. The user can pan and zoom in on the map to a region of interest, which increases
the size of the thumbnails. Similarly, by pressing one of the other three buttons, the user
can cause the pictures to be organized by the time they were taken along a linear timeline,
the people they contain, or the event keywords with which the pictures were annotated.
We assume the pictures are partially annotated. Figure 2 shows an example of navigation
of a personal photo album by the four-Ws model. Adopting this model allows users to
think of their documents in terms of how they would like to record them as part of their
history collection, not necessarily in a specific hierarchical structure. The user can make
selections among the four Ws and PDH will automatically combine them to form rich
Boolean queries implicitly for the user (Shen et al., 2001; Shen, Lesh, Vernier, Forlines,
& Frost, 2002; Shen et al., 2003; Vernier et al., 2002).
The PDH project combines and extends research in largely two areas: (i) humancomputer interaction (HCI) and interface (the design of the shared-display devices, user
interface for storytelling and online authoring, and storylistening) (Shen et al., 2001, 2002,
2003; Vernier et al., 2002); (ii) content-based information visualization, presentation and
retrieval (user-guided image layout, data mining and summarization) (Moghaddam et al.,
197
2001, 2002, 2004; Tian et al., 2001, 2002). Our work has been done along these two lines.
The work by Shen et al. (2001, 2002, 2003) and Vernier et al. (2002) focused on the HCI
and interface design issue of the first research area. The work in this chapter is under the
context of PDH but focuses on the visualization, smart layout, user modeling and retrieval
part. In this chapter, we propose a novel visualization and layout algorithm that can
enhance informal storytelling using personal digital data such as photos, audio and
video in a face-to-face social setting. A framework for user modeling also allows our
system to learn and adapt to a users preferences. The resulting retrieval, browsing and
visualization can adapt to the users (time-varying) notions of content, context and
preferences in style and interactive navigation.
Related Work
In content-based image retrieval (CBIR), most current techniques are restricted to
matching image appearance using primitive features such as color, texture, and shape.
Most users wish to retrieve images by semantic content (the objects/events depicted)
rather than by appearance. The resultant semantic gap between user expectations and
the current technology is the prime cause of the poor takeup of CBIR technology. Due
to the semantic gap (Smeulders et al., 2000), visualization becomes very important for user
to navigate the complex query space. New visualization tools are required to allow for
user-dependent and goal-dependent choices about what to display and how to provide
feedback. The query result has an inherent display dimension that is often ignored. Most
methods display images in a 1D list in order of decreasing similarity to the query images.
Enhancing the visualization of the query results is, however, a valuable tool in helping
the user navigate query space. Recently, Horoike and Musha (2000), Nakazato and Huang
(2001), Santini and Jain (2000), Santini et al. (2001), and Rubner (1999) have also explored
toward content-based visualization. A common observation in these works is that the
images are displayed in 2D or 3D space from the projection of the high-dimensional
feature spaces. Images are placed in such a way that distances between images in 2D or
3D reflect their distances in the high-dimensional feature space. In the works of Horoike
and Musha (2000) and Nakazato and Huang (2001), the users can view large sets of images
in 2D or 3D space and user navigation is allowed. In the works of Nakazato and Huang
(2001), Santini et al. (2000, 2001), the system allows user interaction on image location
and forming new groups. In the work of Santini et al. (2000, 2001), users can manipulate
the projected distances between images and learn from such a display.
Our work (e.g., Tian et al., 2001, 2002; Moghaddam et al., 2001, 2002, 2004) under the
context of PDH shares many common features with the related work (Horoike & Musha,
2000; Nakazato & Huang, 2001; Santini et al., 2000, 2001; Rubner, 1999). However, a
learning mechanism from the display is not implemented in Horoike and Musha (2000),
and 3D MARS (Nakazato & Huang, 2001) is an extension to our work (Tian et al., 2001;
Moghaddam et al. 2001) from 2D to 3D space. Our system differs from the work ofRubner
(1999) in that we adopted different mapping methods. Our work shares some features with
the work by Santini and Jain (2000) and Santini et al. (2001) except that our PDH system
is currently being incorporated into a much broader system for computer human-guided
navigating, browsing, archiving, and interactive storytelling with large photo libraries.
The part of this system described in the remainder of this chapter is, however, specifically
geared towards adaptive user modeling and relevance estimation and based primarily on
visual features as opposed to semantic annotation as in Santini and Jain (2000) and
Santini et al. (2001).
The rest of the chapter is organized as follows. In Content-Based Visualization, we
present designs for uncluttered visualization and layout of images (or iconic data in
general) in a 2D display space for content-based image retrieval (Tian et al., 2001;
Moghaddam et al., 2001). In Context and User Modeling, we further provide a mathematical framework for user modeling, which adapts and mimics the users (possibly changing)
preferences and style for interaction, visualization and navigation (Moghaddam et al.,
2002, 2004; Tian et al., 2002). Monte Carlo simulations in the Statistical Analysis section
plus the next section on User Preference Study have demonstrated the ability of our
framework to model or mimic users, by automatically generating layouts according to
users preference. Finally, Discussion and Future Work are given in the final section.
CONTENT-BASED VISUALIZATION
With the advances in technology to capture, generate, transmit and store large
amounts of digital imagery and video, research in content-based image retrieval (CBIR)
has gained increasing attention. In CBIR, images are indexed by their visual contents
such as color, texture, and so forth. Many research efforts have addressed how to extract
these low-level features (Stricker & Orengo, 1995; Smith & Chang, 1994; Zhou et al., 1999),
evaluate distance metrics (Santini & Jain, 1999; Popescu & Gader, 1998) for similarity
measures and look for efficient searching schemes (Squire et al. 1999; Swets & Weng,
1999).
In this section, we present a user-centric algorithm for visualization and layout for
content-based image retrieval. Image features (visual and/or semantic) are used to
display retrievals as thumbnails in a 2D spatial layout or configuration which conveys
pair-wise mutual similarities. A graphical optimization technique is used to provide
maximally uncluttered and informative layouts. We should note that one physical
instantiation of the PDH table is that of a roundtable, for which we have in fact
experimented with polar coordinate conformal mappings for converting traditional
rectangular display screens. However, in the remainder of this chapter, for purposes of
ease of illustration and clarity, all layouts and visualizations are shown on rectangular
displays only.
Traditional Interfaces
The purpose of automatic content-based visualization is augmenting the users
understanding of large information spaces that cannot be perceived by traditional
sequential display (e.g., by rank order of visual similarities). The standard and commercially prevalent image management and browsing tools currently available primarily use
tiled sequential displays that is, essentially a simple 1D similarity based visualization.
However, the user quite often can benefit by having a global view of a working
subset of retrieved images in a way that reflects the relations between all pairs of images
that is, N2 measurements as opposed to only N. Moreover, even a narrow view of ones
immediate surroundings defines context and can offer an indication on how to explore
the dataset. The wider this visible horizon, the more efficient the new query will be
199
formed. Rubner (1999) proposed a 2D display technique based on multidimensional

scaling (MDS) (Torgeson, 1998). A global 2D view of the images is achieved that reflects
the mutual similarities among the retrieved images. MDS is a nonlinear transformation
that minimizes the stress between high-dimensional feature space and low-dimensional
display space. However, MDS is rotation invariant, nonrepeatable (nonunique), and
often slow to implement. Most critically, MDS (as well as some of the other leading
nonlinear dimensionality reduction methods) provide high-to-low-dimensional projection operators that are not analytic or functional in form, but are rather defined on a pointby-point basis for each given dataset. This makes it very difficult to project a new dataset
in a functionally consistent way (without having to build a post-hoc projection or
interpolation function for the forward mapping each time). We feel that these drawbacks
make MDS (and other nonlinear methods) an unattractive option for real-time browsing
and visualization of high-dimensional data such as images.
Improved Layout and Visualization

We propose an alternative 2D display scheme based on Principal Component
Analysis (PCA) (Jolliffe, 1996). Moreover, a novel window display optimization technique is proposed which provides a more perceptually intuitive, visually uncluttered and
informative visualization of the retrieved images.
Traditional image retrieval systems display the returned images as a list, sorted by
decreasing similarity to the query. The traditional display has one major drawback. The
images are ranked by similarity to the query, and relevant images (as for example used
in a relevance feedback scenario) can appear at separate and distant locations in the list.
We propose an alternative technique to MDS (Torgeson, 1998) that displays mutual
similarities on a 2D screen based on visual features extracted from images. The retrieved
images are displayed not only in ranked order of similarity from the query but also
according to their mutual similarities, so that similar images are grouped together rather
than being scattered along the entire returned 1D list.
Visual Features
We will first describe the low-level visual feature extraction used in our system.
There are three visual features used in our system: color moments (Stricker & Orengo,
1995), wavelet-based texture (Smith & Chang, 1994), and water-filling edge-based
structure feature (Zhou et al., 1999).
The color space we use is HSV because of its decorrelated coordinates and its
perceptual uniformity (Stricker & Orengo, 1995). We extract the first three moments
(mean, standard deviation and skewness) from the three-color channels and therefore
have a color feature vector of length 33 = 9.
For wavelet-based texture, the original image is fed into a wavelet filter bank and
is decomposed into 10 decorrelated subbands. Each subband captures the characteristics of a certain scale and orientation of the original image. For each subband, we extract
the standard deviation of the wavelet coefficients and therefore have a texture feature
vector of length 10.
For water-filling edge-based structure feature vector, we first pass the original
images through an edge detector to generate their corresponding edge map. We extract
eighteen (18) elements from the edge maps, including max fill time, max fork count, and
so forth. For a complete description of this edge feature vector, interested readers are
referred to Zhou et al. (1999).
Dimension Reduction and PCA Splats

To create such a 2D layout, Principal Component Analysis (PCA) (Jolliffe, 1996) is
first performed on the retrieved images to project the images from the high-dimensional
feature space to the 2D screen. Image thumbnails are placed on the screen so that the
screen distances reflect as closely as possible the similarities between the images. If the
computed similarities from the high-dimensional feature space agree with our perception,
and if the resulting feature dimension reduction preserves these similarities reasonably
well, then the resulting spatial display should be informative and useful.
In our experiments, the 37 visual features (nine color moments, 10 wavelet moments
and 18 water-filling features) are preextracted from the image database and stored off-line.
Any 37-dimensional feature vector for an image, when taken in context with other images,
can be projected onto the 2D {x, y} screen based on the first two principal components
normalized by the respective eigenvalues. Such a layout is denoted as a PCA Splat. We
implemented both linear and nonlinear projection methods using PCA and Kruskals
algorithm (Torgeson, 1998). The projection using the nonlinear method such as the
Kruskals algorithm is an iterative procedure, slow to converge and converged to the
local minima. Therefore the convergence largely depends on the initial starting point and
cannot be repeated. On the contrary, PCA has several advantages over nonlinear
methods like MDS. It is a fast, efficient and unique linear transformation that achieves
the maximum distance preservation from the original high-dimensional feature space to
2D space among all possible linear transformations (Jolliffe, 1996). The fact that it fails
to model nonlinear mappings (which MDS succeeds at) is in our opinion a minor
compromise given the advantages of real-time, repeatable and mathematically tractable
linear projections.
We should add that nonlinear dimensionality reduction (NLDR) by itself is a very
large area of research and mostly beyond the scope of this chapter. We only comment
on MDS because of its previous use by Rubner (1999) for CBIR. Use of other iterative
NLDR techniques as principal curves or bottleneck auto-associative feedforward
networks is usually prohibited by the need to perform real-time and repeatable projections. More recent advances such as IsoMap (Tennenbaum et al., 2000) and Local Linear
Embedding (LLE) (Roweis & Saul, 2000) are also not amendable to real-time or closedform computation. The most recent techniques such as Laplacian Eigenmaps (Belkin &
Niyogi, 2003) and charting (Brand, 2003) have only just begun to be used and may promise
advances useful in this application domain, although we should hasten to add that the
formulation of subspace weights and their estimation (see section on Context and User
Modeling) is not as straightforward as with the case of linear dimension reduction (LDR)
methods like PCA.
Let us consider a scenario of a typical image-retrieval engine at work in which an
actual user is providing relevance feedback for the purposes of query refinement. Figure
3 shows an example of the retrieved images by the system (which resembles most
traditional browsers in its 1D tile-based layout). The database is a collection of 534
images. The first image (building) is the query. The other nine relevant images are ranked
in second, third, fourth, fifth, ninth, 10th, 17th, 19th and 20 th places, respectively.
201
Figure 3. Top 20 retrieved images (ranked top to bottom and left to right; query is shown
first in the list)
Figure 4. PCA Splat of top 20 retrieved images in Figure 3
Figure 4 shows an example of a PCA Splat for the top 20 retrieved images shown in
Figure 3. In addition to visualization by layout, in this particular example, the sizes
(alternatively contrast) of the images are determined by their visual similarity to the
query. The higher the rank, the larger is the size (or higher the contrast). There is also
a number next to each image in Figure 4 indicating its corresponding rank in Figure 4. The
view of query image, that is, the top left one in Figure 3, is blocked by the images ranked
19 th, fourth, and 17th in Figure 4. A better view is achieved in Figure 7 after display
optimization.
Clearly the relevant images are now better clustered in this new layout as opposed
to being dispersed along the tiled 1D display in Figure 3. Additionally, PCA Splats
convey N2 mutual distance measures relating all pair-wise similarities between images,
whereas the ranked 1D display in Figure 3 provides only N.
Display Optimization
However, one drawback of PCA Splat is that some images can be partially or totally
overlapped, which makes it difficult to view all the images at the same time. The overlap
will be even worse when the number of retrieved images becomes larger, for example,
larger than 50. To solve the overlapping problem between the retrieved images, a novel
optimized technique is proposed in this section.
Given a set the retrieved images and their corresponding sizes and positions, our
optimizer tries to find a solution that places the images at the appropriate positions while
deviating as little as possible from their initial PCA Splat positions. Assume the number
of images is N. The image positions are represented by their center coordinates (xi, yi),
i = 1, ..., N, and the initial image positions are denoted as (xoi, yoi), i = 1, ..., N. The minimum
and maximum coordinates of the 2D screen are [xmin, x max, ymin, ymax]. The image size is
represented by its radius ri for simplicity, i = 1, ..., N and the maximum and minimum image
size is r max and rmin in radius, respectively. The initial image size is r oi, i = 1, ..., N.
To minimize the overlap, the images can be automatically moved away from each
other to decrease the overlap between images, but this will increase the deviation of the
images from their initial positions. Large deviation is certainly undesirable because the
initial positions provide important information about mutual similarities between images.
So there is a trade-off problem between minimizing overlap and minimizing deviation.
Without increasing the overall deviation, an alternative way to minimize the overlap is
to simply shrink the image size as needed, down to a minimum size limit. The image size
will not be increased in the optimization process because this will always increase the
overlap. For this reason, the initial image size r oi is assumed to be rmax.
The total cost function is designed as a linear combination of the individual cost
functions taking into account two factors. The first factor is to keep the overall overlap
between the images on the screen as small as possible. The second factor is to keep the
overall deviation from the initial position as small as possible.
J = F ( p) + S G ( p )
(1)
where F(p) is the cost function of the overall overlap and G(p) is the cost function of the
overall deviation from the initial image positions, S is a scaling factor which brings the
range of G(p) to the same range of F(p), and S is chosen to be (N1)/2. is a weight and
0. When is zero, the deviation of images is not considered in overlapping
minimization. When is less than one, minimizing overall overlap is more important than
minimizing overall deviation, and vice versa for is greater than one.
203
Figure 5. Cost function of overlap function f(p)
The cost function of overall overlap is designed as

N
F( p) = f ( p)
i =1 j =i +1
f
u>0
1
e
f ( p) =
0 u 0
(2)
(3)
where , u = ri + rj ( xi x j ) 2 + ( yi y j ) 2 is a measure of overlapping. When u 0 , there

is no overlap between the i th image and the jth image, thus the cost is 0. When u > 0, there
is partial overlap between the i th image and the j th image. When u = 2 rmax, the ith image
and the j th image are totally overlapped. f is a curvature-controlling factor.
Figure 5 shows the plot of f(p). With the increasing value of u(u > 0), the cost of
overlap is also increasing.
From Figure 5, in Equation (3) is calculated by setting T=0.95 when u = rmax.
f = ln(1uT ) |u =rmax
2
(4)
The cost function of overall deviation is designed as

N
G ( p) = g ( p)
i =1
(5)
Figure 6. Cost function of function g(p)
g ( p) = 1 e
2
g
(6)
where v = ( xi xio )2 + ( yi yio )2 , v is the measure of deviation of the i th image from its
initial position. g is a curvature-controlling factor. (xi,yi) and (xoi,yoi) are the optimized and
initial center coordinates of the ith image, respectively, i = 1, ..., N.
Figure 6 shows the plot of g(p). With the increasing value of v, the cost of deviation
is also increasing.
From Figure 6, g in Equation (6) is calculated by setting T=0.95 when v = maxsep.
In our work, maxsep is set to be 2r max.
g = ln(1vT ) |v=maxsep
2
(7)
The optimization process is to minimize the total cost J by finding a (locally) optimal
set of size and image positions. The nonlinear optimization method was implemented by
an iterative gradient descent method (with line search). Once converged, the images will
be redisplayed based on the new optimized sizes and positions.
Figure 7 shows the optimized PCA Splats for Figure 3. The image with a yellow frame
is the query image in Figure 3. Clearly, the overlap is minimized while the relevant images
are still close to each other to allow a global view. With such a display, the user can see
the relations between the images, better understand how the query performed, and
subsequently formulate future queries more naturally. Additionally, attributes such as
contrast and brightness can be used to convey rank. We note that this additional visual
aid is essentially a third dimension of information display. For example, images with
higher rank could be displayed with larger size or increased brightness to make them
stand out from the rest of the layout. An interesting example is to display time or
205
Figure 7. Optimized PCA Splat of Figure 3
timeliness by associating the size or brightness with how long ago the picture was
taken, thus images from the past would appear smaller or dimmer than those taken
recently. A full discussion of the resulting enhanced layouts is deferred to future work.
Also we should point out that despite our ability to clean-up layouts for maximal
visibility with the optimizer we have designed, all subsequent figures in this chapter show
Splats without any overlap minimization, because, for illustrating (as well as comparing)
the accuracy of the estimation results in subsequent sections, the absolute position was
necessary and important.
CONTEXT AND USER MODELING

Image content and meaning is ultimately based on semantics. The users notion
of content is a high-level concept, which is quite often removed by many layers of
abstraction from simple low-level visual features. Even near-exhaustive semantic (keyword) annotations can never fully capture context-dependent notions of content. The
same image can mean a number of different things depending on the particular
circumstance. The visualization and browsing operation should be aware of which
features (visual and/or semantic) are relevant to the users current focus (or working set)
and which should be ignored. In the space of all possible features for an image, this
problem can be formulated as a subspace identification or feature weighting technique
that is described fully in this section.
Estimation of Feature Weights

By user modeling or context awareness we mean that our system must be
constantly aware of and adapting to the changing concepts and preferences of the user.
A typical example of this human-computer synergy is having the system learn from a usergenerated layout in order to visualize new examples based on identified relevant/
irrelevant features. In other words, design smart browsers that mimic the user, and over
time, adapt to their style or preference for browsing and query display. Given information
from the layout, for example, positions and mutual distances between images, a novel
feature weight estimation scheme, noted as -estimation is proposed, where is a
weighting vector for features, for example, color, texture and structure (and semantic
keywords).
We now describe the subspace estimation of for visual features only, for example,
color, texture, and structure, although it should be understood that the features could
include visual, audio and semantic features or any hybrid combination thereof.
In theory, the estimation of weights can be done for all the visual features if given
enough images in the layout. The mathematical formulation of this estimation problem
follows.
The weighing vector is ={1, 2, ..., L} , where L is the total length of color, texture,
and structure feature vector, for example, L = 37 in this chapter. The number of images
in the preferred clustering is N, and X is an LN matrix where the i th column is the feature
vector of the ith image, i, j = 1, ..., N. The distance, for example Euclidean-based between
the ith image and the jth image, for i, j = 1, ..., N, in the preferred clustering (distance in 2D
space) is dij. These weights 1, 2, ..., L are constrained such that they always sum to 1.
We then define an energy term to minimize with an L p norm (with p = 2). This cost
function is defined in Equation (8). It is a nonnegative quantity that indicates how well
mutual distances are preserved in going from the original high-dimensional feature space
to 2D space. Note that this cost function is similar to MDS stress, but unlike MDS, the
minimization is seeking the optimal feature weights . Moreover, the low-dimensional
projections in this case are already known. The optimal weighting parameter recovered
is then used to weight original feature-vectors before applying a PCA Splat which will
result in the desired layout.
N
J = (dij p k p | Xi ( k ) X j ( k ) | p )2
i =1 j =1
k =1
(8)
The global minimum of this cost function corresponding to the optimal weight
parameter , is easily obtained using a constrained (nonnegative) least-squares. To
minimize J, take the partial derivative of J relative to l p for l = 1, ..., L and set them to
zero, respectively.
J
l p
=0
l = 1,L , L
(9)
207
We thus have
L
| X
k =1
i =1 j =1
(l )
X j (l ) | p |X i ( k ) X j ( k ) | p = dij p | Xi (l ) X j (l ) | p l = 1,L , L
i =1 j =1
(10)
Define
N
R(l , k ) = | Xi (l ) X j (l ) | p | Xi ( k ) X j ( k ) | p
i =1 j =1
(11)
r (l ) = dij p | Xi (l ) X j (l ) | p
i =1 j =1
(12)
and subsequently simplify Equation (10) to:

L
k =1
R (l , k ) = r (l ) l = 1,L , L
(13)
Using the following matrix/vector definitions
1 p
r (1)
R (1,1) R (1, 2) L R (1, L )
p
r (2)
R (2,1) R (2, 2) L R (2, L)
= 2
r=
R=
M and
M
M
M
M
M
p
L
r ( L)
R( L,1) R( L, 2) L R( L, L)
Equation (13) is simplified to

R = r
(14)
Subsequently is obtained as a constrained ( > 0) linear least-squares solution

of the above system. The weighting vector kis then simply determined by the p-th root
of where we typically use p=2.
We note that there is an alternative approach to estimating the subspace weighting
vector in the sense of minimum deviation, which we have called deviation-based
-estimation. The cost function in this case is defined as follows:
N
J = | p ( x, y) p (i ) ( x, y) |
i =1
(i )
(15)
where p(i)(x,y) and p (i ) ( x, y ) are the original and projected 2D locations of the ith image,
respectively. This formulation is a more direct approach to estimation since it deals with
the final position of the images in the layout. Unfortunately, however, this approach
requires the simultaneous estimation of both the weight vectors as well as the projection
basis and consequently requires less-accurate iterative re-estimation techniques (as
opposed to more robust closed-form solutions possible with Equation (8)). A full
derivation of the solution for our deviation-based estimation is shown in Appendix A.
Compare two different estimation methods: stress-based and deviation-based. The
former is most useful, robust, and identifiable in the control theory sense of the word.
The latter uses a somewhat unstable re-estimation framework and does not always give
satisfactory results. However, we still provide a detailed description for the sake of
completeness. The shortcomings of this latter method are immediately apparent from the
solution requirements. This discussion can be found in Appendix A.
For the reasons mentioned above, in all the experiments reported in this chapter, we
use only the stress-based method of Equation (8) for estimation.
We note that in principle it is possible to use a single weight for each dimension of
the feature vector. However, this would lead to a poorly determined estimation problem
since it is unlikely (and/or undesirable) to have that many sample images from which to
estimate all individual weights. Even with plenty of examples (an over-determined
system), chances are that the estimated weights would generalize poorly to a new set of
images this is the same principle used in a modeling or regression problem where the
order of the model or number of free parameters should be less than the number of
available observations.
Therefore, in order to avoid the problem of over-fitting and the subsequent poor
generalization on new data, it is ideal to use fewer weights. In this respect, the less
weights (or more subspace groupings) there are, the better the generalization performance. Since the origin of all visual features, that is, 37 features, is basically from three
different (independent) visual attributes: color, texture and structure, it seems prudent
to use three weights corresponding to these three subspaces. Furthermore, this number
is sufficiently small to almost guarantee that we will always have enough images in one
layout from which to estimate these three weights. Therefore, in the remaining portion
of the chapter, we only estimated a weighting vector = { c , t , s }T , where c is the
weight for color feature of length Lc, t is the weight for texture feature of length Lt, s
and is the weight for structure feature of length , respectively. These weights c, t, s
are constrained such that they always sum to 1, and L = Lc + L t + Ls.
Figure 8 shows a simple user layout where three car images are clustered together
despite their different colors. The same is performed with three flower images (despite
their texture/structure). These two clusters maintain a sizeable separation, thus suggesting two separate concept classes implicit by the users placement. Specifically, in this
layout the user is clearly concerned with the distinction between car and flower
regardless of color or other possible visual attributes.
Applying the -estimation algorithm to Figure 8, the feature weights learned from
this layout are c = 0.3729, t = 0.5269 and s = 0.1002. This shows that the most important
feature in this case is texture and not color, which is in accord with the concepts of car
versus flower as graphically indicated by the user in Figure 8.
209
Figure 8. An example of a user-guided layout
Figure 9. PCA Splat on a larger set of images using (a) estimated weights (b) arbitrary
weights
(a)
(b)
Now that we have the learned feature weights (or modeled the user) what can we
do with them? Figure 9 shows an example of a typical application: automatic layout of a
larger (more complete data set) set of images in the style indicated by the user. Figure
9(a) shows the PCA Splat using the learned feature weight for 18 cars and 19 flowers. It
is obvious that the PCA Splat using the estimated weights captures the essence of the
configuration layout in Figure 8. Figure 9(b) shows a PCA Splat of the same images but
with a randomly generated , denoting an arbitrary but coherent 2D layout, which in this
case, favors color (c = 0.7629). This comparison reveals that proper feature weighting
is an important factor in generating the user-desired and sensible layouts. We should
point out that a random does not generate a random layout, but rather one that is still
coherent, displaying consistent groupings or clustering. Here we have used such
random layouts as substitutes for alternative (arbitrary) layouts that are nevertheless
valid (differing only in the relative contribution of the three features to the final design
Figure 10. (a) An example layout. Computer-generated layout based on (b)

reconstruction using learned feature weights, and (c) the control (arbitrary weights)
of the layout). Given the difficulty of obtaining hundreds (let alone thousands) of real
user layouts that are needed for more complete statistical tests (such as those in the next
section), random layouts are the only conceivable way of simulating a layout by a
real user in accordance with familiar visual criteria such as color, texture or structure.
Figure 10(a) shows an example of another layout. Figure 10(b) shows the corresponding computer-generated layout of the same images with their high-dimensional
feature vectors weighted by the estimated , which is recovered solely from the 2D
configuration of Figure 10(a). In this instance the reconstruction of the layout is near
perfect, thus demonstrating that our high-dimensional subspace feature weights can in
fact be recovered from pure 2D information. For comparison, Figure 10(c) shows the PCA
Splat of the same images with their high-dimensional feature vectors weighted by a
random .
Figure 11 shows another example of user-guided layout. Assume that the user is
describing her family story to a friend. In order not to disrupt the conversational flow,
she only lays out a few photos from her personal photo collections and expects the
computer to generate a similar and consistent layout for a larger set of images from the
same collection. Figure 11(b) shows the computer-generated layout based on the learned
feature weights from the configuration of Figure 11(a). The computer-generated layout
is achieved using the -estimation scheme and postlinear, for example, affine transform
or nonlinear transformations. Only the 37 visual features (nine color moments (Stricker
& Orengo 1995), 10 wavelet moments (Smith & Chang, 1994) and 18 water-filling features
(Zhou et al., 1999)) were used for this PCA Splat. Clearly the computer-generated layout
211
Figure 11: Usermodeling for automatic layout. (a) a user-guided layout: (b) computer
layout for larger set of photos (four classes and two photos from each class)
(a)
(b)
is similar to the user layout with the visually similar images positioned at the userindicated locations. We should add that in this example no semantic features (keywords)
were used, but it is clear that their addition would only enhance such a layout.
STATISTICAL ANALYSIS
Given the lack of sufficiently large (and willing) human subjects, we undertook a
Monte Carlo approach to testing our user-modeling and estimation method. Monte Carlo
simulation (Metropolis & Ulam, 1949) randomly generates values for uncertain variables
over and over to simulate a model. Thereby simulating 1000 computer generated layouts
(representing ground-truth values of s), which were meant to emulate 1000 actual
userlayouts or preferences. In each case, estimation was performed to recover the
original values as best as possible. Note that this recovery is only partially effective due
to the information loss in projecting down to a 2D space. As a control, 1000 randomly
generated feature weights were used to see how well they could match the user layouts
(i.e., by chance alone).
Our primary test database consists of 142 images from the COREL database. It has
7 categories of car, bird, tiger, mountain, flower, church and airplane. Each class has about
20 images. Feature extraction based on color, texture and structure has been done offline and prestored. Although we will be reporting on this test data set due to its
common use and familiarity to the CBIR community we should emphasize that we have
also successfully tested our methodology on larger and much more heterogeneous image
libraries. (For example, real personal photo collections of 500+ images, including family,
friends, vacations, etc.). Depending on the particular domain, one can obtain different
degrees of performance, but one thing is for sure: for narrow application domain (for
example, medical, logos, trademarks, etc.) it is quite easy to construct systems which work
extremely well, by taking advantages of the limiting constraints in the imagery.
The following is the Monte Carlo procedure that was used for testing the significance and validity of user modeling with estimation:
Figure 12. Scatter plot of estimation, estimated weights versus original weights
1.
2.
3.
4.
5.
6.
7.
Randomly select M images from the database. Generate arbitrary (random) feature
weights in order to simulate a user layout.
Do a PCA Splat using this ground truth .
From the resulting 2D layout, estimate and denote the estimated as .
Select a new distinct (nonoverlapping) set of M images from the database.
Do PCA Splats on the second set using the original , the estimated , and a third
random ' (as control).
Calculate the resulting stress in Equation (8), and layout deviation (2D position
error) in Equation (9) for the original, estimated and random (control) values of ,
, and ', respectively.
Repeat 1,000 times.
The scatter plot of estimation is shown in Figure 12. Clearly there is a direct linear
relationship between the original weights and the estimated weights . Note that when
the original weight is very small (<0.1) or very large (>0.9), the estimated weight is zero
or one correspondingly. This means that when one particular feature weight is very large
(or very small), the corresponding feature will become the most dominant (or least
dominant) feature in the PCA, therefore the estimated weight for this feature will be either
one or zero. This saturation phenomenon in Figure 12 is seen to occur more prominently
for the case of structure (lower left of the rightmost panel) that is possibly more
pronounced because of the structure feature vector being so (relatively) high dimensional. Additionally, structure features are not as well defined compared with color and
texture (e.g., they have less discriminating power).
In terms of actual measures of stress and deviation we found that the -estimation
scheme yielded the smaller deviation 78.4% of the time and smaller stress 72.9%. The main
reason these values are less than 100% is due to the nature of the Monte Carlo testing
and the fact that working with low-dimensional (2D) spaces, random weights can be close
213
Figure 13. Scatter plot of deviation scores (a) equal weights (y-axis) versus estimation
weights (x-axis) (b) random weights (y-axis) versus estimated weights (x-axis)
(a) Equal versus estimated
(b) Random versus estimated
to the original weights and hence can often generate similar user layouts (in this case
apparently about 25% of the time).
We should add that an alternative control or null hypothesis to that of random
weights = { 13 , 13 , 13}T is that of fixed equal weights. This weighting scheme corresponds
to the assumption that there are to be no preferential biases in the subspace of the
features, that they should all count equally in forming the final layout (or default PCA).
But the fundamental premise behind the chapter is that there is a change or variable bias
in the relative importance of the different features as manifested by different user layout
and styles. In fact, if there was to be no bias in the weights (i.e., they were set equal) then
there would be no usermodeling or adaptation necessary since there would always be
just one type or style of layout (the one resulting from equal weights). In order to
understand this question fully, we compare the results of random weights versus equal
weights (compared to the estimation framework advocated).
In an identical set of experiments, replacing random weights for comparison layouts
with equal weights = { 13 , 13 , 13}T , we found a similar distribution of similarity scores. In
particular, since the goal is obtaining accurate 2D layouts where positional accuracy is
critical, we look at the resulting deviation in the case of both random weights and equal
weight versus estimated weights. We carry out a large Monte Carlo experiments (10,000
trials) and Figure 13 shows the scatter plot of the deviation scores. Points above the
diagonal (not shown) indicate a deviation performance worse than that of weight
estimation. As can be seen here, the test results are roughly comparable for equal and
random weights.
In Figure 13, we noted that the -estimation scheme yielded the smaller deviation
72.6% of the time compared to equal weights (as opposed to the 78.4% compared with
random weights). We therefore note that the results and conclusions of these experiments are consistent despite the choice of equal or random controls, and ultimately direct
estimation of a user layout is best.
Figure 14. Comparison of the distribution of -estimation versus nearest-neighbor

deviation scores
Finally we should note that all weighting schemes (random or not) define sensible
or coherent layouts. The only difference is in the amount by which color, texture and
structure is emphasized. Therefore even random weights generate nice or pleasing
layouts, that is, random weights do not generate random layouts.
Another control other than random (or equal) weights is to compare the deviation
of an -estimation layout generator to a simple scheme which assigns each new image
to the 2D location of its (un-weighted or equally weighted) 37-dimensional nearest
neighbor (NN) from the set of images previously laid out by the user. This control
scheme essentially operates on the principle that new images should be positioned on
screen at the same location as their nearest neighbors in the original 37-dimensional
feature space (the default similarity measure in the absence of any prior bias) and thus
essentially ignores the operating subspace defined by the user in a 2D layout. The NN
placement scheme would place the test picture, despite their similarity score, directly on
top of whichever image, currently on the table, that it is closest to. To do otherwise,
for example to place it slightly shifted away, and so forth, would simply imply the
existence of a nondefault smart projection function which defeats the purpose of this
control. The point of this particular experiment is to compare our smart scheme with
one which has no knowledge or preferential subspace weightings and see how this would
subsequently map to (relative) position on the display. The idea behind that is that a
dynamic user-centric display should adapt to varying levels of emphasis to color, texture,
and structure.
The distributions of the outcomes of this Monte Carlo simulation are shown in
Figure 14 where we see that the layout deviation using estimation (red: = 0.9691,
= 0.7776) was consistently lower by almost an order of magnitude than the nearest
neighbor layout approach (blue: = 7.5921 = 2.6410). We note that despite the
noncoincident overlap of the distributions tails in Figure 14, in every one of the 1,000
215
random trials the -estimation deviation score was found to be smaller than that of
nearest-neighbour (a key fact not visible in such a plot).
USER PREFERENCE STUDY

In addition to the computer-generated simulations, we have in fact conducted a
preliminary user study that has also demonstrated the superior performance of
estimation over random feature weighting used as a control. The goal was to test whether
the estimated feature weights would generate a better layout on a new but similar set of
images than random weightings (used as control). The user interface is shown in Figure
15 where the top panel is a coherent layout generated by a random on reference image
set. From this layout, an estimate of was computed and used to redo the layout. A layout
generated according to random weights was also generated and used as a control. These
two layouts were then displayed in the bottom panels with randomized (A vs. B) labels
(in order to remove any bias in the presentation). The users task was to select which
layout (A or B) was more similar to the reference layout in the top panel.
In our experiment, six nave users were instructed in the basic operation of the
interface and given the following instructions: (1) both absolute and relative positions
of images matter, (2) in general, similar images, like cars, tigers, and so forth, should
cluster and (3) the relative positions of the clusters also matter. Each user performed 50
forced-choice tests with no time limits. Each test set of 50 contained redundant (randomly
recurring) tests in order to test the users consistency.
We specifically aimed at not priming the subjects with very detailed instructions
(such as, Its not valid to match a red car and a red flower because they are both red.).
Figure 15. -estimation-matters user test interface
Table 1. Results of user-preference study
User 1
User 2
User 3
User 4
User 5
User 6
Average
Preference for
estimates
90%
98%
98%
95%
98%
98%
96%
Preference for
random weights
10%
2%
2%
5%
2%
2%
4%
Users consistency rate

100%
90%
90%
100%
100%
100%
97%
In fact, the nave test subjects were told nothing at all about the three feature types
(color, texture, structure), the associated or obviously the estimation technique. In this
regard, the paucity of the instructions was entirely intentional: whatever mental grouping
that seemed valid to them was the key. In fact, this very same flexible association of the
user is what was specifically tested for in the consistency part of the study.
Table 1 shows the results of this user study. The average preference indicated for
the -estimation-based layout was found to be 96% and an average consistency rate of
a user was 97%. We note that the -estimation method of generating a layout in a similar
style to the reference was consistently favored by the users. A similar experimental
study has shown this to also be true even if the test layouts consist of different images
than those used in the reference layout (i.e., similar but not identical images from the same
categories or classes).
DISCUSSIONS AND FUTURE WORK

We have designed our system with general CBIR in mind but more specifically for
personalized photo collections. An optimized content-based visualization technique is
proposed to generate a 2D display of the retrieved images for content-based image
retrieval. We believe that both the computational results and the pilot user study support
our claims of a more perceptually intuitive and informative visualization engine that not
only provides a better understanding of query retrievals but also aids in forming new
queries.
The proposed content-based visualization method can be easily applied to project
the images from high-dimensional feature space to a 3D space for more advanced
visualization and navigation. Features can be multimodal, expressing individual visual
features, for example, color alone, audio features and semantic features, for example,
keywords, or any combination of the above. The proposed layout optimization technique
is also quite general and can be applied to avoid overlapping of any type of images,
windows, frames or boxes.
The PDH project is at its initial stage. We have just begun our work in both the user
interface design and photo visualization and layout algorithms. The final visualization
and retrieval interface can be displayed on a computer screen, large panel projection
217
screens, or, for example, on embedded tabletop devices (Shen et al., 2001; Shen et al.,
2003) designed specifically for purposes of storytelling or multiperson collaborative
exploration of large image libraries.
Many interesting questions still remain as our future research in the area of contentbased information visualization and retrieval. The next task is to carry out an extended
user-modeling study by having our system learn the feature weights from various sample
layouts provided by the user. We have already developed a framework to incorporate
visual features with semantic labels for both retrieval and layout.
Another challenging area is automatic summarization and display of large image
collections. Since summarization is implicitly defined by user preference, a estimation for
user modeling will play a key role in this and other high-level tasks where context is
defined by the user.
Finally, incorporation of relevance feedback for content-based image retrieval
based on the visualization of the optimized PCA Splat seems very intuitive and is
currently being explored. By manually grouping the relevant images together at each
relevance feedback step, a dynamic user-modeling technique will be proposed.
ACKNOWLEDGMENTS
This work was supported in part by Mitsubishi Electric Research Laboratories
(MERL), Cambridge, MA, and National Science Foundation Grant EIA 99-75019.
REFERENCES
Balabanovic, M., Chu, L., & Wolff, G. (2000). Storytelling with digital photographs.
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
The Hague, The Netherlands (pp. 564-571).
Belkin, M., & Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation. Neural Computation,15(6),1373-1396.
Brand, M. (2003). Charting a manifold. Mitsubishi Electric Research Laboratories
(MERL), TR2003-13.
Dietz, P., & Leigh, D. (2001). DiamondTouch: A multi-user touch technology. The
Proceedings of the 14th ACM Symposium on User Interface Software and Technology, Orlando, Florida (pp. 219-226).
Horoike, A., & Musha, Y. (2000). Similarity-based image retrieval system with 3D
visualization. Proceedings of IEEE International Conference on Multimedia and
Expo, New York, New York (Vol. 2, pp. 769-772).
Jolliffe, I. T. (1996). Principal component analysis. New-York: Springer-Verlag.
Kang, H., & Shneiderman, B. (2000). Visualization methods for personal photo collections: Browsing and searching in the photofinder. Proceedings of IEEE International Conference on Multimedia and Expo, New York, New York.
Metropolis, N. & Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247), 335-341.
Moghaddam, B., Tian, Q., & Huang, T. S. (2001). Spatial visualization for content-based
image retrieval. Proceedings of IEEE International Conference on Multimedia
and Expo, Tokyo, Japan.
Moghaddam, B., Tian, Q., Lesh, N., Shen, C., & Huang, T.S. (2002). PDH: A human-centric
interface for image libraries. Proceedings of IEEE International Conference on
Multimedia and Expo, Lausanne, Switzerland (Vol. 1, pp. 901-904).
Moghaddam, B., Tian, Q., Lesh, N., Shen, C., & Huang, T.S. (2004). Visualization and usermodeling for browsing personal photo libraries. International Journal of Computer Vision, Special Issue on Content-Based Image Retrieval, 56(1-2), 109-130.
Nakazato, M., & Huang, T.S. (2001). 3D MARS: Immersive virtual reality for contentbased image retrieval. Proceedings of IEEE International Conference on Multimedia and Expo, Tokyo, Japan.
Popescu, M., & Gader, P. (1998). Image content retrieval from image databases using
feature integration by choquet integral. Proceeding of SPIE Conference on
Storage and Retrieval for Image and Video Databases VII, San Jose, California.
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500), 2323-2326.
Rubner, Y. (1999). Perceptual metrics for image database navigation. Doctoral dissertation, Stanford University.
Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image
databases. IEEE Transactions on Knowledge and Data Engineering, 13(3), 337351.
Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 21(9), 871-883.
Santini, S., & Jain, R., (2000, July-December). Integrated browsing and querying for image
databases. IEEE Multimedia Magazine, 26-39.
Shen, C., Lesh, N., & Vernier, F. (2003). Personal digital historian: Story sharing around
the table. ACM Interactions, March/April (also MERL TR2003-04).
Shen, C., Lesh, N., Moghaddam, B., Beardsley, P., & Bardsley, R. (2001). Personal digital
historian: User interface design. Proceedings of Extended Abstract of SIGCHI
Conference on Human Factors in Computing Systems, Seattle, Washington (pp.
29-30).
Shen, C., Lesh, N., Vernier, F., Forlines, C., & Frost, J. (2002). Sharing and building
digital group histories. ACM Conference on Computer Supported Cooperative
Work.
Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based
image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(12), 1349-1380.
Smith, J. R., & Chang, S. F. (1994). Transform features for texture classification and
discrimination in large image database. Proceedings of IEEE International Conference on Image Processing, Austin, TX.
Squire, D. M., Mller, H., & Mller, W. (1999). Improving response time by search pruning
in a content-based image retrieval system using inverted file techniques. Proceedings of IEEE Workshop On Content-Based Access of Image and Video Libraries
(CBAIVL), Fort Collins, CO.
Stricker, M., & Orengo, M. (1995). Similarity of color images. Proceedings of. SPIE
Storage and Retrieval for Image and Video Databases, San Diego, CA.
Swets, D., & Weng, J. (1999). Hierarchical discriminant analysis for image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 21(5), 396-401.
219
Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework
for nonlinear dimensionality reduction. Science, 290, 2319-2323.
Tian, Q., Moghaddam, B., & Huang, T. S. (2001). Display optimization for image browsing.
The Second International Workshop on Multimedia Databases and Image Communications, Amalfi, Italy (pp. 167-173).
Tian, Q., Moghaddam, B., & Huang, T. S. (2002). Visualization, estimation and usermodeling for interactive browsing of image libraries. International Conference on
Image and Video Retrieval, London (pp. 7-16).
Torgeson, W. S. (1998). Theory and methods of scaling. New York: John Wiley & Sons.
Vernier, F., Lesh, N., & Shen, C. (2002). Visualization techniques for circular tabletop
interface. Proceedings of Advanced Visual Interfaces (AVI), Trento, Italy (pp. 257266).
Zhou, S. X., Rui, Y., & Huang, T. S. (1999). Water-filling algorithm: A novel way for image
feature extraction based on edge maps. Proceedings of IEEE International
Conference on Image Processing, Kobe, Japan.
Zwillinger, D. (Ed.) (1995). Affine transformations, 4.3.2 in CRC standard mathematical tables and formulae (pp. 265-266). Boca Raton, FL: CRC Press.
APPENDIX A
xi
xi
Let Pi = p ( i ) ( x, y) = and P i = p ( i ) ( x, y) = , and Equation (15) is rewritten as
yi
yi
N
J = || Pi P i ||2
(A.1)
i =1
X (ci )
Let Xi be the column feature vector of the ith image, where Xi = Xt(i ) , i = 1,L , N .
X (si )
(i)
(i)
X (i)
c , X t and X s are the corresponding color, texture and structure feature vector of
c X(i)
(i)
the ith image and their lengths are L c, Lt and L s, respectively. Let Xi = t X t be the
s X(i)
weighted high-dimensional feature vector. These weights c, t, s are constrained such

as they always sum to 1.
P i is estimated by linearly projecting the weighted high-dimensional features to 2D.

Let X = [X1, X2, ..., XN], it is an LN matrix, where L = Lc + L t + L s. P i is estimated by
P i = U T ( Xi - Xm )
i = 1,L , N
(A.2)
where U is a L2 projection matrix, X m is an L1 mean column vector of X'i, i = 1, ..., N.

Substitute P by Equation (A.2) into Equation (A.1), the problem is therefore one of
i
seeking the optimal feature weights , projection matrix U, and column vector Xm such
as J in Equation (A.3) is minimized, given Xi, Pi, i = 1, ..., N.
N
J = || U T ( Xi X m ) Pi ||2
i =1
(A.3)
In practice, it is almost impossible to estimate optimal , U and Xm simultaneously

based on the limited available data X i, Pi, i = 1, ..., N. We thus make some modifications.
Instead of estimating , U and X m simultaneously, we modified the estimation process
to be a two-step re-estimation procedure. We first estimate the projection matrix U and
221
column vector Xm, and then estimate feature weight vector based on the computed U
and Xm, and iterate until convergence.
Let U(0) be the eigenvectors corresponding to the largest two eigenvalues of the
covariance matrix of X, where X = [X1, X2, ..., XN], X (0)
m is the mean vector of X.
We have
Pi(0) = (U (0) )T ( Xi X m(0) )
(A. 4)
Pi(0) is the projected 2D coordinates of the unweighted high-dimensional feature

vector of the i th image. Ideally its target location is Pi. To consider the alignment
correction, a rigid transform (Zwillinger, 1995) is applied.
P i(0) = A Pi(0) + T
(A. 5)
where A is a 22 matrix and T is a 21 vector. A, T are obtained by minimizing the L2-norm

of Pi P i(0) .
Therefore J in Equation (A. 3) is modified to
N
2
J = || AU (0) ( Xi X (0)
m ) ( Pi T ) ||
i =1
(A. 6)
Let U = UA(0), X m = X (0)

and Pi = Pi T, we still have the form of Equation (A. 3).
m
U11,LU1( Lc + Lt + Ls )
T
=
U
Let us rewrite
U 21,L ,U 2( Lc + Lt + Ls )
After some simplifications on Equation (A. 3), we have
N
J = || c Ai + t Bi + s Ci Di ||2
i =1
Lc
(k )
U1k X c (i )
k =1
where Ai = L
c
(k )
U 2 k Xc (i)
k =1
(A. 7)
Lt
Ls
(k )
(k )
X
(
)
U
i
1( k + Lc ) t
U1( k + Lc + Lt ) X s (i)
k =1
C = k =1
Bi = L
i
Ls
t
(k )
(k )
U 2( k + Lt ) Xt (i)
U 2( k + Lc + Lt ) X s (i)
k =1
k =1
and Di = U T X m + Pi ,
Ai, Bi, Ci and Di are the 21feature vectors, respectively.

To minimize J, we take the partial derivatives of J relative to c, t, s and set them
to zero, respectively.
J
c
=0
J
t
=0
J
s
=0
(A. 8)
We thus have:
E = f
N T
Ai Ai
i=1
N
where E = BiT Ai
i=1
N T
C i Ai
i=1
(A. 9)
A
i =1
B
i =1
C
i =1
Bi
Bi
Bi
N T
Ci
Ai Di
i =1
i=1
N T
T
Bi Ci , f = Bi Di
i =1
i=1
N T
T
Ci Ci
Ci Di
i =1
i =1
and is obtained by solving the linear Equation (A.9).
Multimedia Authoring 223
Chapter 10
Multimedia Authoring:
Human-Computer Partnership
for Harvesting Metadata
from the Right Sources
Brett Adams, Curtin University of Technology, Australia
Svetha Venkatesh, Curtin University of Technology, Australia
ABSTRACT
This chapter takes a look at the task of creating multimedia authoring tools for the
amateur media creator, and the problems unique to the undertaking. It argues that a
deep understanding of both the media creation process, together with insight into the
precise nature of the relative strengths of computers and users, given the domain of
application, is needed before this gap can be bridged by software technology. These
issues are further demonstrated within the context of a novel media collection
environment, including a real- world example of an occasion filmed in order to
automatically create two movies of distinctly different styles. The authors hope that
such tools will enable amateur videographers to produce technically polished and
aesthetically effective media, regardless of their level of expertise.
INTRODUCTION
Accessibility to the means of authoring multimedia artifacts has expanded to
envelop the majority of desktops and homes of the industrialized world in the last decade.
Forrester Research predicts that by 2005, 92% of online consumers will create personal
224 Adams & Venkatesh
multimedia content at least once a month (Casares et al., 2002). To take the crude analogy
of the written word, we all now have the paper and pencils at hand, the means to author
our masterpieces or simply communicate. Or rather, we would do if only we knew how to
write. Simply adding erasers, coloured pencils, sharpeners, scissors, and other such
items to the writing desk doesnt help us write War and Peace, or a Goosebumps novel,
nor even a friendly epistle. Similarly, software tools that allow us to cut, copy, and paste
video do not address the overarching difficulty of forming multimedia artifacts that
effectively (and affectively) achieve the desired communication or aesthetic integrity.
There is a large and diverse research community, utilizing techniques from a far
flung variety of fields including human-computer interaction, signal processing, linguistic analysis, computer graphics, video and image databases, information sciences and
knowledge representation, computational media aesthetics, and so forth, which has
grown up around this problem, and offering solutions and identifying problems that are similarly
varied. Issues and questions pertinent to the problem include, but are not limited to
What are we trying to help the user do?

What is the users role in the multimedia authoring process?
Who should carry the burden for which parts of the media creation process?
The nature and choice of metadata.
The purpose and power of that metadata what does in enable?
Computer-user roles in determining what to capture or generate.
The objective of this chapter is to emphasize the importance of clearly defining the
domain and nature of the creative/authoring activities of the user that we are seeking to
support with technology: The flow on effects of this decision are vitally important to the
whole authoring endeavour and impact all consequent stages of the process. Simply put,
definition of the domain of application for our technology the user, audience, means,
mood, and so forth of the authoring situation enables us to more precisely define the
lack our technology is seeking to supply, and in turn the nature and extent of metadata
or semantic information necessary to achieve the result, as well as the best way of going
about getting it. We will use our existing media creation framework, aimed at amateur
videographers, to help demonstrate the principles and possible implementations.
The structure of the remainder of the chapter is as follows: We first explore related
work with reference to the traditional three-part media creation process. This alerts us
to the relative density and location of research efforts, notes the importance of a holistic
approach to the media creation process, and helps define the questions we need to
answer when building our own authoring systems. We next examine those questions
arising in more detail. In particular, we note the importance of defining the domain of the
technology, which has flow on implications regarding the nature of the gap that our
technology is trying to close, and the best way of doing so. Finally, we present an example
system in order to offer further insight into the issues discussed.
BACKGROUND
Computer technology, like any other technology, is applied to a problem in order
to make the solution easier or possible. Questions that we should ask ourselves include:
What is the potential user trying to do? What is the lack that we need to supply? What
materials or information do we need to achieve this?
Let us consider solutions and ongoing research in strictly video related endeavours,
a subset of all multimedia authoring-related research, for the purpose of introducing some
of the terms and issues pertinent to the more detailed discussion that follows. We will,
at times, stray outside this to a consideration of related domains (e.g., authoring of
computer-generated video) for the purpose of illuminating ideas and implications with
the potential for beneficial cross-pollination. We will not be considering research
aimed at abstracting media as opposed to authoring it (e.g., Informedia Video Skims
(Wactlar et al., 1999) or the MoCA groups Video Abstracting (Pfeiffer & Effelsberg,
1997)), although it could be considered authoring in one sense.
Traditionally the process of creating a finished video presentation contains something like the following three phases: Preproduction, production and postproduction.
For real video (as opposed to entirely computer-generated media), we could be more
specific and label the phases: Scripting/Storyboarding (both aural and visual), Capture,
and Editing that is, decide what footage to capture and how, capture the footage, and
finally compose by selection, ordering and fitting together of footage, with all of the major
and minor crosstalk and feedback present in any imperfect process.
In abstract terms the three phases partition the process into Planning, Execution and
Polishing, and from this we can see that this partitioning is not the exclusive domain of
multimedia authoring, but is indeed appropriate, even necessary, for any creative or
communicative endeavour. Of course, the relative size of each stage, and the amount and
nature of feedback or revisitation for each stage will be different depending on the
particular kind of multimedia being authored and the environment in which it is being
created, but they nevertheless remain a useful partitioning of this process1. Figure 1 is
a depiction of the media creation workflow with the three phases noted.
One way of classifying work in the area of multimedia authoring technology is to
consider which stage(s) of this authoring process the work particularly targets or
emphasizes. Irrespective of the precise domain of application, where do they perceive the
most problematic lack to be?
Figure 1. Three authoring phases in relation to the professional film creation workflow
Planning
Execution
Author
Screenwriter
creativity and style
language rules
adaptation constraints,
screenplay rules etc.
genre conventions
[author intent]
Polishing
Director/
Cameraman
Editor
Viewer
perception
film gram mar,

montage etc.
other movies
film gram mar,

cinematography etc.
[director intent]
[screenwriter intent]
Novel
Raw footage
[dvd index, ...]
Movie
Experience
Edit
We will start from the end, the editing phase, and work back toward the beginning.
Historically, this seems to have received the most attention in the development of
software solutions.
Technologies to aid the low-level operations of movie editing are now so firmly
established as to be the staples of commercial software. Software including iMovie,
Pinnacle Studio, Power Director, Video Wave and a host of others, provide the ability to
cut and paste or drag and drop footage, and align video and audio tracks, usually by
means of some flavour of timeline representation for the video under construction.
Additionally, the user is able to create complex transitions between shots and add titles
and credits. The claims about iMovie even run to it having single-handedly made
cinematographers out of parents, grandparents, students (iMovie, 2003). Although the
materials have changed somewhat scissors and film to mouse and bytes conceptually the film is still treated as an opaque entity, the operations blind to any aesthetic
or content attributes intrinsic to the material under the scalpel. All essentially provide
help for doing a mechanical process. They are the hammer and drill of the garage. As with
the professional version, what makes for the quality of the final product is the skill of the
hand guiding the scissors.
The next rung up the ladder of solution power sees a more complex approach to
adding smarts to the editing phase. It is here that we still find a lot of interest in the
research community.
Girgensohn et al. (2000) describe their semiautomatic video editor, Hitchcock, which
uses an automatic suitability metric based upon how erratic camera work is deemed for
a given piece of footage. They also employ a spring model, based upon the unsuitability metric, in order to provide a more global mechanism for manipulating the total output
video length. There are a number of similar applications, both commercial and research,
that follow this approach of bringing such smarts to the editing phase, including SILVER
(Casares et al., 2002), muvee autoProducer and ACD Video Magic.
Muvee autoProducer is particularly interesting in that it allows the user to specify
a style, including such options as Chaplinesque, Fifties TV and Cinema. These
presumably translate to allowable clip lengths, image filters and transitions an example
of broad genre conventions influencing automatic authoring, which are themselves
the end product of cinesthetic considerations. It provides an extreme example of low user
burden. All that is required of the user is that they provide a video file, a song music
selection, and select a style.
Roughly, these approaches all infer something about the suitability or desirability
or otherwise of a given piece of film based on a mapping between that property and a lowlevel feature of the video signal. This knowledge is then used to relieve the user of the
burden of actually having to carry out the mechanical operations of footage cut and paste
described above. The foundations of these approaches are as strong or weak as the link
between low-level feature and inferred cinematic property. They offer the equivalent
of a spellchecker for our text. These approaches attempt to automatically generate simple
metadata about video footage, a term which is becoming increasingly prominent.
Up to this point we can observe that none of these approaches either demand or
make use of information about footage related to its meaning, its semantics. They might
be able to gauge that a number of frames are the result of erratic camera work, but they
are not able to tell us that the object being filmed so poorly is the users daughter, who
also happened to be the subject of the preceding three shots.
Lindley et al. (2001) present work on interactive and adaptive generation of news
video presentations, using Rhetorical Structure Theory (RST) as an aid to construction.
Video segments, if labeled with a rhetorical functional role, such as elaboration or
motivation, may be combined automatically into a hierarchical structure of RST relations,
and thence into one or more coherent linear or interactive video presentations by
traversing that structure. Here the semantic information relates to rhetorical role and the
emphasis is on the coherency of the presentation.
Lindley et al. (2001, p. 8) note that, Content representation scheme[s] that can
indicate subject matter and bibliographical material such as the sources and originating
dates of the video contents of the database is necessary supplemental metadata for their
chosen genre of news presentation generation. Additionally, they suggest that narrative or associative/categorical techniques may help provide an algorithmic basis for
sequencing material so as to avoid problems such as continuity within subtopics. This
introduces the potential need for narrative theory or, more broadly, some sort of
discoursive theory in addition to the necessary intrinsic (denotative and connotative)
content semantic information.
Of interest is the work of Nack and Parkes (1995) and Nack (1996), who propose a
film editing model, and attempt to automatically compile humorous scenes from existing
footage. Their system, AUTEUR, aims to achieve a video sequence that realises an
overall thematic specification. They define the two prime problems of the video editing
process: Composing the film such that it is perceptible in its entirety, and in a manner
that engages the viewer emotionally and intellectually. The presentation should be
understandable and enthralling, or at least kind of interesting.
In addition to its own logic of story generation, the editor draws upon a knowledge
base which is labeled as containing World, Common sense, Codes, Filmic representations, Individual. Such knowledge is obviously hard to come by, but illustrates the type
of metadata that needs to be brought to bear upon the undertaking.
An interesting tangent to this work is the search for systems that resolve the ageold dilemma of the interactive narrative. For example, see Skov and Andersen (2001), Lang
(1999), and Sack and Davis (1994).
It is apparent that these last few approaches we casually lump together by virtue
of their revolving around (a) some theory of presentation, be it narrative, rhetorical or
whatever is appropriate for the particular domain of application, coupled with (b) some
knowledge about the available or desirable raw material by which inferences relating to
their function in terms of that theory may be made. It would also be fair to say that the
tacit promise is: The more you tell me about the footage, the more I can do for you.
Capture
Capture refers to the process of actually creating the raw media. For a home movie,
it means the time when the camera is being toted, gathering all of those waves and
embarrassed smiles. Unlike the process of recording incident sound, where the creative
input into the capture perhaps runs to adjusting the overall volume or frequency filter,
there is a whole host of parameters that come into play, which the average home
videographer is unaware of (due in part to the skill of the professional filmmaker, no
doubt; Walter Murch, a professional film editor, states that the best cut is an invisible
one). Any given shot capture attempt, where shot is defined to be the contiguous video
captured between when the start and stop buttons are pressed for a recording, or its
analog in simulated camera footage, be it computer generated or traditionally animated,
veritably bristles with parameters; light, composition, camera and object motion, camera
mounting, z-axis blocking, duration, focal length, and so forth, all impact greatly on the
final captured footage, its content and aesthetic potential.
So, given this difficulty for the amateur (or even professional) camera operator, what
can we do? To belabour our writing analogy, what do we do when we recognize that the
one holding the pen has little idea about what is required of them? We give them a form.
Or we provide friendly staff to answer their queries or offer suggestions.
There is research which focuses on the difficulties of this stage of the multimedia
authoring process. Bobick and Pinhanez (1995) describe work focusing on smart
cameras, able to follow simple framing requests originating with the director of a
production within a constrained environment (their example is a cooking show). This
handles the translation of cinematic directives to physical camera parameters, no mean
contribution on its own, but is reliant on the presence of a knowledgeable director.
There also exists a large literature that addresses the problem of capturing shots in
a virtual environment all the more applicable in these days of purely computergenerated media. The system described by He et al. (1996) takes a description of the
events taking place in the virtual world and uses simple film idioms, where an idiom might
be an encoded rule as to how to capture a two-way conversation, to produce camera
placement specifications in order to capture the action in a cinematically pleasing manner.
Or see Tomlinson et al. (2000) who attempt automated cinematography via a cameracreature that maps viewed agent emotions to cinematic techniques that best express
those emotions.
These approaches, however, are limited to highly constrained environments, real
or otherwise, which rules out the entire domain of the home movie.
Barry and Davenport (2003) describe interesting work aimed at transforming the role
of the camera from tool to creative partner. Their approach aims to merge subject sense
knowledge, everyday common sense knowledge stored in the Openmind Commonsense
database, and formal sense knowledge, the sort of knowledge gleaned from practiced
videographers, in order to provide on-the-spot shot suggestions. The aim is to help
during the capture process such that the resulting raw footage has the potential to be
sculpted into an engaging narrative come composition time. If taken, shot suggestions
retain their own metadata about the given shot.
The idea here is to influence the quality of the footage that will be presented to the
editing phase for those tools to take advantage of, because, from the point of view of the
editor, garbage in, garbage out.
Planning
Well, if that old adage is just as applicable here, why not shift the quality control
even farther upstream? Instead of the just-in-time (JIT) approach, why not plan for a good
harvest of footage right from the beginning? There are some who take this approach.
There are also tools for mocking up visualizations of the presentation in a variety
of manifestations. The storyboard is a popular representation a series of panels on
which sketches depicting important shots or scenes are arranged and has been a staple
of professional filmmaking for a long time. Productions today often take advantage of
three- dimensional (3D) modeling techniques to get an initial feel for appropriate shots
and possible difficulties. Baecker et al. (1996) provide an example in their Movie [and
lecture presentation] Authoring and Design (MAD) system. Aimed at a broad range of
users, it uses a variety of metaphors and views, allowing top-down and bottom-up
structuring of ideas, and even the ability to preview the production as it begins to take
shape. They note that an 11-year-old girl was able to create a two-minute film about
herself using a rough template for an autobiography provided by the system designers,
and this without prior experience using the software.
Bailey et al. (2001) present a multimedia storyboarding tool targeted at exploring
numerous behavioral design ideas early in the development of an interactive multimedia
application. One goal of the editor is to help authors determine narration and visual
content length and synchronization in order to achieve a desirable pace to the presentation.
In the field of (semi)automated media production, Kennedy and Mercer (2001) state
that, There is a rich environment for automated reasoning and planning about cinematographic knowledge (p. 1), referring to the multitude of possibilities available to the
cinematographer for mapping high-level concepts, such as mood, into decisions regarding which cinematic techniques to use. They present a semiautomated planning system
that aids animators in presenting intentions via cinematographic techniques. Instead of
limiting themselves to a specific cinematic technique, they operate at the meta level,
focusing on animator intentions for each shot. The knowledge base that they refer to is
part of the conventions of film making (e.g., see Arijon, 1976, or Monaco, 1981), including
lighting, colour choice, framing, and pacing to enhance expressive power. This is an
example of a planning tool that leverages a little knowledge about the content of the
production.
This technology, where applicable, moves the problem of getting decent footage
to the editing phase one step earlier it is no longer simply impromptu help at capture
time, but involves prior cognition.
Holistic: Planning to Editing

If any one of the above parts of the media authoring process is able to result in
better multimedia artifacts, then does it follow that considering the whole process
within a single clearly defined framework will result in larger gains?
Some work in the area appears to place certain demands on all phases of the media
creation process, to be overtly conscious of the relative place and importance of each,
resulting in interdependencies each with the other.
Agamanolis and Bove (2003) present interesting work aimed at video productions
that can re-edit themselves. Although their emphasis is on the re-editable aspect of
the production, they are nevertheless decidedly whole-process conscious. Rather than
the typical linear video stream produced by conventional systems, the content here is
dynamic, able to be altered client-side in response to a viewers profile. Their system,
VIPER, is designed to be a common framework, able to support responsive video
applications of differing domain. The purpose of the production, for example, educational
or home movie, dictates the nature and scope of manual annotations attached to the video
clips, which in turn support the editing model. Editing models are constituted by rules
written in a computer language specifically for the production at hand. They leverage
designer-declared, viewer-settable response variables, which are parameterizations that
dictate the allowable final configurations of the presented video, and are coded specifically for a given production in a computer language. An example might be a tell me the
same story faster button.
Davis (2003) frames the major problems of multimedia creation to be solved as
enabling mass customization of media presentations and making media creation more
accessible for the average home user. He calls into question the tripartite media
production process outlined above as being inappropriate to the defined goals. That
issue aside for now, it is interesting to note the specific deficiencies that his Media
Streams system seeks to address. In calling for a new paradigm for media creation, the
needs illuminated include: (1) capture of guaranteed quality reusable assets and rich
metadata by means of an Active Capture model, and (2) the redeploying of (a) domain
knowledge to the representation in software of media content and structure and the
software functions that dictate recombination and adaptation of those artifacts, and (b)
the creative roles to the designers of adaptive media templates, which are built of those
functions in order to achieve a purpose whilst allowing the desired personalization of that
message (or whatever). He uses two analogies to illustrate the way the structure is fixed
in one sense whilst customisable in another: Lego provides a building block, a fixed
interface, simple parts from which countless wholes may be made this is analagous
to syntagmatic substitution; Mad Libs, a game involving blind word substitution into
an existing sentence template, is an example of paradigmatic substitution, where the user
shapes the (probably nonsensical, but nevertheless amusing) meaning of the sentence
within the existing syntactical bounds of the sentence. The title, Editing out editing,
alludes to the desire to enable the provider or user to push a button and have the media
automatically assembled in a well-formed manner: movies as programs.
These approaches rely on the presence of strong threads running through the
production process from beginning to end. The precise nature of those threads varies.
It may be clearly defined production purpose (this will be an educational presentation,
describing the horrors of X), or a guaranteed chain of consistent annotation (e.g., content
expressed reliably in terms of the ontology in play this family member is present in
this shot), or assumptions about the context of captured material and so forth, or a
combination of these, but it is this type of long-range coherency and consistency of
assumptions and information transmission that enables a quality media artifact in the
final analysis, by whatever criteria quality is judged in the particular instance.
There is another aspect to the multimedia creation process, which we have neglected thus far. It has to do with reusing or repurposing media artifacts, and the
corresponding phase might be called packaging for further use. The term reuse is
getting a lot of press these days in connection with multimedia. It is a complex topic and
we will not deal with it here except to say that some of the issues that come into play have
already cropped up in the preceding discussion. For example, those systems that
automatically allow for multiple versions of the same presentations contain a sort of
reuse. In order to do that our system needs to know something about the data and about
the context into which we are seeking to insert it.
ISSUES IN DESIGNING
MULTIMEDIA AUTHORING TOOLS
Defining Your Domain
The preceding, somewhat loosely grouped, treatment of research related to multimedia authoring serves to highlight an important issue, one so obviously necessary as
to be often neglected: namely, definition of the domain of application of the ideas or
system being propounded (do we assume our audience is well aware of it? ours is the most
interesting after all, isnt it?) By domain is meant the complex of assumed context (target
user, physical limitations, target audience(s), etc.) and rules, conventions or proprieties operating (including all levels of genre as commonly understood, and possibly even
intrinsic genre ala Hirsch (1967)) which together constitute the air in which the solution
lives and becomes efficient in achieving its stated goals.
Consider, just briefly, a few of the possibilities: Intended users can be reluctantly
involved hobbyists, amateurs or professionals. Grasp of, and access to, technology may
range from knowing where the record button is on the old low-resolution, mono camera,
to being completely comfortable with multiple handheld devices and networks of remote
computing power. Some things might come easily to the user, while others are a struggle.
The setting might be business or pleasure, the environment solo or collaborative. The
user might have a day to produce the multimedia artifact, or a year. The intended audience
may be self, family and friends, the unknown interested, the unknown uninterested, or
all of the above, each with constraints reflecting his or her own.
That is not to say that each and every aspect must be explicitly enumerated and
instantiated, nor that they should all be specified to a fine point. Genres and abstractions,
catchalls, exist precisely because of their ability to specify ranges or sets that are more
easily handled. In one sense, they are the result of a consideration of many of the above
factors and serve to funnel all of those concerns into a set of manageable conventions.
What is important, though, is that the bounds you are assuming are made explicit.
Defining the Gap

But why is domain important to know? Cant we forget issues of context and purpose
they are kind of hard to determine sometimes anyway and concentrate on simple
concretes, such as the type of data that were dealing with? Video capture and
manipulation, how hard can that be? Such a data-type-centric approach is appealing, but
it doesnt help define the contours of the chief problem of technology highlighted above:
the lack our technology is trying to supply.
What is the nature of the gap that we are trying to bridge? Consider an example by
way of illustration: In the editing room of a professional feature film, the editor often
needs to locate a piece of footage known to exist to fill a need the problem is retrieval,
and the gap is information regarding the whereabouts of the desired artifact. For example,
in making the movies of Lord of the Rings, literally hours of footage was chiseled to
seconds in some scenes. For the computer-generated movie Final Fantasy the problem
was locating resources in the vast web of remote repositories, versions and stages of
processing. Contrast that situation with a home videographer trying to assemble a
vacation movie from assembled clips. How much of what and which should go where?
The problem in this case is which sequence makes for an interesting progression, and
the corresponding gap is an understanding of film grammar and narrative principles (or
some other theory of discourse structure). The difference is between finding a citation
and knowing the principles of essay writing.
Closing the Gap

Having named the gap, there are, no doubt, a few possible solutions which offer
themselves. In the case of the amateur videographer above, which we will take for our
example from here on in, the solution might be to either educate the user or lower his
expectations. (Education of expectations might be needful prior to education in the actual
craft of film making, as sometimes the problem is that the user does not know what is
possible!) Dismissing the latter as embarrassingly defeatist, we may set ourselves to
educate the user. In the final analysis, this would result in the highest quality multimedia
artifact, but, short of sending the user to film school, and noting that there is probably
a reason that the user is an amateur in the first place (I dont have the time or energy
for a full-blown course on this stuff; no, not even a book), what is possible?
What would we do for the writer of our analogy? We can provide a thesaurus and
dictionary, or put squiggly lines under the offending text (there are way too many under
these words). We could go a step further and provide grammar checking. By doing these
things we are offering help regarding the isolated semantics (partial semantics really, as
linguists tell us semantics come in sentences (Hirsch, 1967, p. 232), The sentence is the
fundamental unit of speech.), in the case of the spellcheck, dictionary, and thesaurus,
and syntactical aid via the grammar checker.
But this essentially doesnt help us to write our essay! How do we connect our
islands of meaning into island chains, then into a cogent argument or message call it
discourse and finally express that discourse well in the chosen medium and genre?
Let us consider the problem as one of first formulating the discourse we wish to
convey (e.g., this scene goes before this one and supports this idea), and secondly
expressing that discourse in a surface manifestation (e.g., video) using the particular
powers of the medium to heighten its effectiveness2.
Formulate the Discourse

Let us delve a level deeper. Continuing with the example of the amateur videographer,
what can we fairly assume that he or she does know? Humans are good at knowing the
content (this shot footage contains my son and me skiing) and, at least intuitively, the
relationships among the parts (sons have only one father, and skiing can be fun and
painful). This is precisely the sort of knowledge that we find difficult to extract from media,
model and ply with computation. Capturing this sense common to us all and the explosion
of connotational relationships among its constituents is an active and, to put it mildly,
daunting area of research in the knowledge modeling community. Humans, on the other
hand, do this well (is there another standard by which to make the comparison?).
But it is the knitting together of these parts, film clips in this case, which is the first
part of the media authoring process the amateur finds difficult. We can liken the selection
and ordering of footage to the building of an argument. Knowledge about the isolated
pieces (sons and fathers) isnt sufficient to help us form a reasoned chain of logic. We
might sense intuitively where we want to end up (an enjoyable home movie), but we dont
know how to get there. Those very same pieces of knowledge (sons and fathers) do not
tell us anything about how they should be combined when plucked from their semantic
webs and put to the purpose of serving the argument.
But this is the sort of thing we have had success at getting computers to do.
Enumerating possibilities from a given set of rules and constraints is something that we
find easy to express algorithmically.
The particular model that helps us generate our discourse, its level of complexity
and emphasis, will vary. But deciding upon an appropriate model is the necessary first
step. If the target is an entertaining home movie, one choice is some form of narrative
model: it may be simple resolutions follow climaxes and so forth or more involved,
like Dramatica (www.dramatica.com), involving a highly developed view of story as
argument, with its Story Mind and four throughlines (e.g., Impact character
throughline, who represents an alternate approach to the main character). Following
Dramaticas queues is meant to help develop stories without holes in the argument.
If the target is verity, something like RST may be more appropriate, given its emphasis
on determining an objective basis for the coherency of a document. The nature of its
relations seem to be appropriate for a genre like news, which purports to deal in fact, and
we have already seen Lindley et al. (2001) use it for this purpose.
Given that we have a discourse laid out for us in elements. How do we fill them in
with content? (Lang, 1999) uses a generative grammar to produce terminals that are firstorder predicate calculus schemas about events, states, goals and beliefs. The story logic
has helped us to define what is needed at this abstract level, but how do we instantiate
the terminal with a piece of living, breathing content that matches the abstract contract
but is nuanced by the superior human world model?
Express the Discourse

In addition to not knowing how to build the discourse, the user doesnt understand
how to express those discourse elements (now instantiated to content) using the medium
of choice, film in this case. We desire to produce a surface manifestation that uses the
particular properties of the medium and genre to heighten the effectiveness of the
underlying discourse. Our story formulation might have generated a point of conflict.
Fear can produce conflict, and perhaps we instantiate this with an eye to our childs first
encounter with a swimming pool. But how should I film the child? One way of supporting
the idea that the child is fearful is to use the well-known high camera angle, thus shrinking
him in proportion to the apparently hostile environment and exacerbating that fear in the
viewers eyes.
The amateur videographer cannot be expected to know this. We noted earlier that
there are many parameters to be considered to a given video clip capture. However, there
are conventions known to the educated user (i.e., professional filmmakers) that define
appropriate parameterizations for given discourse-related goals. For example, relating to
emphasis, such as shot duration patterning and audio volume, and emotional state, such
as colour atmosphere, and spatial location, such as focal depth and audio clues, and
temporal location, such as shot transition type and textual descriptions. In other words,
there exists the possibility of encoding a mapping of discourse goals to cinematic
parameters.
Figure 2. Role of human and computer in the generation of quality media

Structure of
discourse
Computer
Computer
Computer
Human
Manifestation
directive
Final (Re-)
manifestation
Content
Human
Instantiated
manifestation
* Constrained by success of
instantiation (i.e., captured footage)
Where would this leave us in relation to respective roles for human and computer
in the generation of quality media which communicates or entertains and does so well?
An important point is that the degree of freedom of the manifestation parameterization asked for, for example, give me a medium shot of your son next to his friend, is not
unlimited if the context of the user is impromptu or the videographer is purely an observer
and cannot effect the environment being filmed, that attend the home movie maker
(unlike, for example, automated text generation).
What about Metadata?

It could be argued that if the process under discussion proceeds from the conception of the multimedia creation to the capturing of all manifestation directives perfectly,
there need not be any metadata attached to those manifestations. The job is done. But
if our domain specifies that it is a particularly noisy process liable to missed or skewed
directive following and this is the case with the amateur videographer, or if we desire
the ability to make changes to the presentation at the discourse level after the capturing
of the manifestation (or indeed we seek a new home for the media via reuse at a later date)
then we need enough semantic information about the clips in order to reconnect them
with the discourse logic and recompile them into new statements.
What do we need to know about the captured media in order to generate a longer
version of the holiday with the kids, if there is footage, for the grandparents, and a shorter
version for our friends? Or maybe we want a more action-packed version for the kids
themselves, and a more thoughtful version for the parents to watch.
The first point would be to not attempt to record everything. Metadata by definition
is data about data. But where does it stop? Surely we could talk about data about data
about data ad infinitum. Who can circumnavigate this semantic web? This forces upon
us the need for a relevant scope of metadata. We suggest that the domain and genre
provide the scope and nature of metadata required. Obviously, we must record metadata
related to discoursive function. For example, this shot is part of the act climax and
contains the protagonist. But we must also record manifestation-related metadata, (e.g.,
this is a high-angle, medium shot of three actors). Indeed these representations are
needed in order to generate the manifestation directives in the first place, but also to
regenerate them in reaction to changes at the discourse level. They should be as domain
specific as possible, because without precise terminology, appropriate representations

and structures, we can only make imprecise generative statements about the media. For
example, for the case of the movie domain, representations such as shots, scenes, and
elements of cinematography such as framing type and motion type, are appropriate, as
they form the vocabulary of film production, theory and criticism.
They also must provide a sound basis for inferring other, higher-order, properties
of the media and at differing resolutions, dependent upon whether more or less detail can
be called for or reliably known. For example, representations like shot and motion allow
us to infer knowledge of the film tempo (Adams et al., 2002), of which they are
constituents. Certain types of inference may require a degree of orderliness to the
representations used, such as that imparted by means of hierarchical systems of
classification among parts taxonomic, partonomic ontological. For example,
knowledge that a particular location forms the setting for a scene implies that it is always
present, always part of the scene (even if it is not visible due to other filming parameters),
whereas actors may enter and leave the scene. Another example: a partonomy of threeact narrative structure allows us to infer that any dramatic events within the first act are
likely to carry the dramatic function of setup, that is, some aspect of the characters or
situation is being introduced.
Finally, some forethought as to how easy the user will find the metadata to
understand and verify is pertinent also. Asking an amateur videographer to film a shot
with a cluttered z-axis (a term associated with placement of objects or actors in an
imaginary line along the cameras point of view into the scene) probably isnt all that wise.
However, creative ways of conveying ideas not familiar to a user will be necessary in any
case. A 3D mockup might help in this case.
In short, the discussion could be reduced to one principle: Get humans to do what
they do well and computers to do what they do well. Our multimedia authoring tool may
well need to be a semiautomatic solution, depending on your domain, of course.
EXAMPLE SOLUTION TO
THE RAISED ISSUES
In this section, in order to further concretize the issues raised, we will consider a
specific example of a multimedia authoring system which endeavours to address them.
Our Domain
The domain of this system is, for the amateur/home user, seeking to make media to
be shared with friends and family, with potentially enough inherent interest to be sharable
with a wider audience for pleasure. Planning level is assumed to be anything up to last
minute, with obvious benefits for the more prior notice. Some mechanism for adapting
to time constraints, such as I only have an hour of filming left, is desirable. At present
we only assume a single camera available, and ideally a handheld personal computer (PC).
Assumed cinematographic skill of user is from point and click upward. Scope for personal
reuse and multiple views of the same discourse parameterized by genre is a desired goal.
Generally impromptu context with some ability to manipulate scene existents is assumed
to be possible although not necessary.
The user in this context generally wants to better communicate with their home
movies, and implicitly wants something a bit more like what can be seen on TV or at the
movies. We can visualize what amateur videographers produce on a scale from largely
unmediated record easily obtained by simply pointing the camera at anything vaguely
interesting to the user being what they have to a movie, edited well, expressing a
level of continuity and coherency, with a sense of implicit narrative or even overt, and
using the particulars of the film medium (camera angle, motion, focal distance, etc.) to
heighten expression of content being what they want.
It is instructive to consider the genres on this scale more closely.
Record: This is what we are calling the most unmediated of all footage which the
home user is likely to collect. It may be a stationary camera capturing anything
happening within its field of view (golf swing, party, etc.). We note that many of
the low-level issues, such as adjustment for lighting conditions and focus, are dealt
with automatically by the hardware. But that is all you get. That means the resulting
footage, left as is, is only effective for a limited range of purposes mainly as
information, as hinted at by the label record. In other words, this is what my golf
swing looks like where the purpose might be to locate problems. In the case of
the party, the question might be who was there?
Moving photo-album: This is where the video camera is being used something like
a still camera but with the added advantage of movement and sound. The user
typically walks around and snaps content of interest. As with still images, he may
more or less intuitively compose the scene for heightened mediation for example,
close-ups of the faces of kids having fun at the sea but there is little thought of
clip-to-clip continuity. The unit of coherence is the clip.
Thematic or revelatory narrative: Here is where we first find loose threads of
coherence between clips/shots, and that coherence lies in an overarching theme
or subject. In the case of a thematic continuity, an example might be a montage
sequence of baby. By revelatory narrative, we mean concentrated on simply
observing the different aspects and nuances of a situation as it stands, unconcerned with a logical or emotional progression of any sort. There are ties that bind
shots together which need to be observed, but they are by no means stringent.
Traditional narrative home movie: By traditional narrative we mean a story in the
generally conceived sense, displaying that movement toward some sort of resolution, where There is a sense of problem-solving, of things being worked out in
some way, of a kind of ratiocinative or emotional teleology (Chatman, 1978, p. 48).
The units of semantics are larger than shots, which are subordinated to the larger
structures of scenes and sequences. Greater demands are placed upon the humble
shot, as it must now snap into scenes. The bristle of shot parameters (framing type,
motion, angle) now feed into threads, such as continuity, running between shots
that must be observed, makes them more difficult to place. Therefore we need more
forethought so that when we get to the editing stage we have the material we need.
How Do We Close the Gap?

How do we move the user and his authored media up this scale? What does the user
lack in order to move up it? The gap is the understanding of discourse in this case
narrative and the transformation of discourse into well-formed surface manifestations

in this case the actual video footage, with its myriad parameters. That is, we note that
what got harder above was determining good sequences as well as the links we require
between shots.
Lets consider a solution aimed at helping our user achieve something like a
traditional narrative in their home movies, and leave realization of the other genres for
another time.
Incidentally, there is a lot of support for the notion of utilizing narrative ideas as
a means to improving the communicative and expressive properties of home movies.
Schultz and Schultz (1972, p. 16) note that The essential trouble with home movies ...
is the lack of a message, and Beal (1974, p. 60) comments that the raison detre of
amateur film, the family film, needs, at least, the tenuous thread of a story.
We need
A method of generating simple narratives say three-act narratives built of

turning points, climaxes and resolutions, as used by many feature films with
reference to user provided constraints (e.g., there are n actors).
A method of transforming a generated story structure into manifestation directives,
specifically shot directives, with reference to a user parameterizable style or genre.
A way for the user to verify narrative and cinematic metadata of captured footage
which adds little extra burden to the authoring process.
A method for detecting poorly captured shot directives, assessing their impact on
local and global film aesthetics, such as continuity or tempo, and transformations
to recover from this damage where possible, without requiring that the shot
directives be refilmed.
System Overview
We have implemented a complete video production system that attempts to achieve
these goals, constituted by a storyboard, direct and edit life cycle analogous to the
professional film production model. Space permits only a high-level treatment of the
system. The salient components of the framework are
a narrative template: an abstraction of an occasion, such as a wedding or birthday

party, viewed in narrative terms (climaxes, etc.), which may be easily built for a given
occasion or obtained from a preexisting library,
user-specified creative purpose, in terms of recognizable genres, such as action or
documentary, which are in turn mapped to affective goals,
those goals are then taken up by a battery of aesthetic structuralizing agents,
which automatically produce a shooting-script or storyboard from the narrative
template and user purpose,
which provides a directed interactive capture process, resulting in footage that
may be automatically edited into a film or altered for affective impact.
Figure 3 to Figure 5 present different views of the media creation process. We will
now discuss what happens at each stage of the workflow depicted in Figure 5. Note that,
while similar to Figure 1, Figure 5 is the amateur workflow in this case.
Figure 3. Overview of media creation framework
Purpose
STAR T
STO P
Directing
Storyboarding
Editing
Shooting-scripter
Aesthetic
structuralizers
Shot to footage map
Capture record
Film assembler
Storyboard
Narrative
template
Movie
Film
Figure 4. Functional transformation view of media creation framework

Role
Author
Stage
(author)
Directives
Screenwriter
(mediate)
Director/Cameraman
(affect)
(capture)
Editor
(align)
(redress)
Story
Media
Potential "Video"
Potential Movie
Raw Footage
"Rough Cut"
Movie
Event or Scene
Shot directive
Author: The purpose of the first stage is to create the abstract, media nonspecific
story for the occasion (wedding, party, anything) that is to be the object of the home
movie. That is to say, the given occasion separated into parts, selected and ordered,
and thus made to form the content of a narrative. It culminates in a narrative
template, which is the deliverable passed to the next stage. Events may have a
relative importance attached to them, which may be used when time constraints
come into play. Templates may be created by the user, either from scratch or
through composition, specialization or generalization of existing templates, built
with the aid of a wizard using a rule-based engine or generative grammar seeded
Figure 5. Workflow of amateur media creation

Planning
Execution
Author
Screenwriter
Director/C'man
Editor
Viewer
[apply creativity]
story m etadata
story m etadata
story m etadata
perception
cinesthetic metadata
other movies
formulated rules of
film gramm ar
cinematic m etadata
select story idea

story m etadata
formulated adaption
constraints
Story
Idea
Polishing
cinematic m etadata
Raw footage
formulate edit rules

cinematic m etadata
Movie
rich com posite m etadata

(MPEG-7 stream, ...)
Experience
with user input or else simply selected as is from a library. The user is offered
differing levels of input allowing for his creativity or lack thereof. We have our
discourse.
Mediate: The purpose of this stage is to apply or specialize the narrative
template obtained in the author stage to our specific media and domain of the home
movie. This stage encapsulates the knowledge required to manifest the abstract
events of the discourse in a concrete surface manifestation in this case the pixels
and sound waves of video.
Affect: The purpose of this stage is to transform the initial media-specific directives
produced by the mediate stage into directives that maintain correct or well-formed
use of the film medium, such as observing good film convention like continuity, and
also better utilize the particular expressive properties of film in relation to the story,
such as raising the tempo toward a climax, and this with reference to the style or
genre chosen by the user, for example, higher tempo is allowed if the user wants
an action flick. The end result is a storyboard of shot directives for the user to
attempt to capture. Shot directives are the small circles below the larger scene
squares of the storyboard in Figure 3. They can also be seen as small circles in the
rectangular storyboards of Figure 4. The affect and mediate stages taken together
achieve the transformation from story structure to surface manifestation.
Capture: The purpose of this stage is simply to realize all shot directives with actual
footage. The user attempts to capture shots in the storyboard. He is allowed to do
this in any order, and may attempt a given shot directive any number of times if
unhappy with it. A capture is deemed a success or failure with respect to the shot
directive, consisting of all of its cinematic parameters. For example, a shot directive
might require the user to capture the bride and groom in medium shot, at a different
angle than the previous shot. Some parameters are harder to capture than others,
and the level of difficulty may be thresholded by the user. But this does affect which
metadata are attached to the footage when the user verifies it as a success. For
example, if the user is currently not viewing the camera angle parameter of the shot
directive, it is not marked as having been achieved in the given footage. This simple
Success/Failure protocol with respect to a stated target avoids burdensome

annotation but is deceptively powerful.
Align: The purpose of this stage is to check (where possible) whether the footage
captured for a given shot directive has indeed been captured according to the
directive, and in the case where more than the required duration has been captured,
select the optimal footage according to the shot directive. The shot directive plus
the users verification of it as applying to the footage provides a context that can
be used to seed and improve algorithms which normally have low reliability in a
vacuum, such as face recognition. It can also potentially point out inconsistencies
in the users understanding of the meaning of shot directive parameters. For
example, You consider these three shots to be close-ups, yet the face in the third
is twice as large as the other two; do you understand what framing type is?
Redress: Following the align stage, we may have affect goals, that is, metashot
properties a tempo ramp turned into a plateau or broken continuity, gone awry
due to failures in actual capture. Therefore, this stage attempts to achieve, or get
closer to, the original affect goals using the realized shot directives. For example,
can we borrow footage from another shot to restore the tempo ramp?
Discussion of Problems that Surfaced

We will now highlight two problems that directly impinge on the crucial issue of
leveraging human knowledge in order to verify metadata in a manner which puts little extra
burden on these problems.
Misunderstanding of Directives
The unit of instruction for the user is the shot directive. Each shot directive consists
of a number of cinematic primitives which the user is to attempt to achieve in the shot.
The problem lies in the fact that even these terms are subject to misinterpretation by our
average user, remembering that the user may have little grasp of cinematic concepts.
The system has allowances for differing degrees of user comfort with cinematic
directives, a requirement stemming from definition of our target user as average that
is, variable in skill. Shot directive parameters may be thresholded in number and difficulty.
For example, a novice might only want to know what to shoot and whether to use camera
motion, whereas someone more comfortable with the camera might additionally want
directives concerning angle and aspect and perhaps even motivation for the given shot
in term of high-level movie elements, such as tempo.
But this still doesnt solve the problem of when a user thinks he understands a shot
directive parameter, but in actual fact does not.
In these thumbnails, Figure 6, taken from a recent home movie of a holiday in India,
built with the authoring system, we see that the same shot directive parameter framing
type, in this case calling for a close-up in two shots, has been shot once correctly as
a close-up and then as something more like a medium shot. The incorrectly filmed shot
will undoubtedly impact negatively on whatever aesthetic or larger scene orchestration
goals for which it was intended.
One possible solution to this problem would be to use a face detection algorithm
to cross-check the users understanding. The algorithm may be improved with reference
to the other cinematic primitives of shot directive in question (e.g., the subject is at an
Figure 6. Thumbnails of incorrectly and correctly shot close-ups, respectively
oblique angle) as well as footage resulting from similar shot directives. Obviously this
solution is only applicable to shots where the subject is human.
The goal is to detect this inconsistency and alert the user to the possibility that they
have a misunderstanding about the specific parameter request. Of course, good visualization, 3D or otherwise, of what the system is requesting goes a long way in helping the
user understand.
Imprecision in Metadata Verification

A second problem has to do with metadata imprecision. The system uses humans
to verify content metadata, and that metadata in turn is coupled with the narrative
metadata in order to enable automated (re)sequencing and editing of footage via
inferences about high-level film elements (e.g., continuity, visual approach, movie tempo
and rhythm ). Some of these inferences rely on an assumed level of veracity in the
metadata values. An example will help.
Figure 7 contains thumbnails from three shots intended to serve the scene function
of Familiar Image. A familiar image shot is a view that is returned to again and again
throughout a scene or section of a scene in order to restabilize the scene for the viewer
Figure 7. Thumbnails from three shots intended to serve the scene function of Familiar
Image
after the new information of intervening shots. It cues the viewer back into the spatial
layout of the scene, and provides something similar to the reiteration of points covered
so far of an effective lecturing style.
The two shots on the bottom are shot in a way that the intended familiar image
function is achieved; they are similar enough visually. The first shot, however, is not.
It was deemed close enough by the user, with reference to the shot directive, but in terms
of achieving the familiar image function for which the footage is intended, it fails. In actual
fact, the reason for the difference stemmed from an uncontrollable element; a crowd
gathered at the first stall.
Here, given the impromptu context of the amateur videographer, part of the solution
lies in stressing the importance of the different shot directive parameters. There is the
provision in the system for ascribing differing levels of importance to shot parameters
on a shot-by-shot basis. For a shot ultimately supposed to provide the familiar image
function, this would amount to raising the importance of all parameters related to
achieving a visual shot composition similar to another shot (aspect, angle, etc., included). This in turn requires that the system prioritize from the discourse level down;
At this point, is the stabilizing function of the familiar image more important, impacting
on the clarity of the presentation, or is precise subject matter more needful?
FUTURE FOR MULTIMEDIA AUTHORING

If we had to guess at a single factor that will cause the greatest improvement of
multimedia authoring technology, it would be the continued drawing together of different
fields of knowledge over all of the phases of authoring, from purpose to repackaging. It
should never be easy to say, Not my problem.
Weve seen discourse theory, media aesthetics, human computer interaction
issues, all the way through to content description standards (e.g., MPEG-7) impact
authoring technology. But none of these fields has the ultimate answer, none is stagnant,
and therefore must continually be queried for new insights that may impact in the
authoring endeavour.
We use the field of applied media aesthetics to structure our surface manifestations
in order to achieve a certain aesthetic goal. A question we encountered when designing
the system was: How do we resolve conflict among agents representative of different
aesthetic elements (e.g., movie clarity and complexity) who desire mutually conflicting
shot directive parameterizations? We prioritized in a linear fashion, first come first served,
after first doing our best to separate the shot directive parameters that each agent was
able to configure. Is this the only way to do it? Is it the best way? We can ask our friends
in media aesthetics, and in so doing we might even benefit their field. It is interesting to
note that Rhetorical Structure Theory was originally developed as part of studies of
computer-based text generation, but now has a status in linguistics that is independent
of its computational uses (Mann, 1999).
Improvements in hardware, specifically camera technology, will hopefully allow us
to provide one point of reference for the storyboard and its shot directives the camera
rather than the current situation which requires the videographer to have both the
camera and a handheld device. This may seem a trivial addition, but it helps cross the
burden line for the user, and is therefore very much part of the problem.
Undoubtedly, technology that sees humans and computers doing what they each
do best will be the most effective.
CONCLUSION
We have considered the problem of creating effective multimedia authoring tools.
This need has been created by the increased ease with which we generate raw media
artifacts images, sound, video, text a situation resulting from the growing power
of multimedia enabling hardware and software, coupled with an increasing desire by
would-be authors to create and share their masterpieces.
In surveying existing approaches to the problem, we considered the emphases of
each in relation to the traditional process of media creation: planning, execution, and
polishing. We finished with a treatment of some examples of research with a particular
eye to the whole process.
We then identified some of the key issues to be addressed in developing multimedia
authoring tools. They include definition of the target domain, recognition of the nature
of the gap our technology is trying to bridge, and the importance of considering both the
deeper structures relating to content and how it is sequenced and the surface manifestations in media to which they give rise. Additionally, we highlighted the issue of
deciding upon the scope and nature of metadata, and the question of how it is
instantiated.
We then presented an implementation of a multimedia authoring system for building
home movies, in order to demonstrate the issues raised. In simple terms, we found it
effective to algorithmically construct the underlying discourse humanly fill metadata
and algorithmically shape the raw media to the underlying discourse and given genre
by means of that same metadata.
REFERENCES
Adams, B., Dorai, C., & Venkatesh, S. (2002). Towards automatic extraction of expressive
elements from motion pictures: Tempo. IEEE Transactions on Multimedia, 4(4),
472-481.
Agamanolis, S., & Bove, V., Jr. (2003). Viper: A framework for responsive television. 10(I),
88-98.
Arijon, D. (1976). Grammar of the film language. Silman-James Press.
Baecker, R., Rosenthal, A., Friedlander, N., Smith, E., & Cohen, A. (1996). A multimedia
system for authoring motion pictures. ACM Multimedia, 31-42.
Bailey, B., Konstan, J., & Carlis, J. (2001). DEMAIS: Designing multimedia applications
with interactive storyboards. In the Ninth ACM International Conference on
Barry, B., & Davenport, G. (2003). Documenting life: Videography and common sense.
In the 2003 International Conference on Multimedia and Expo, Baltimore, MD.
Beal, J. (1974). Cine craft. London: Focal Press.
Bobick, A., & Pinhanez, C. (1995). Using approximate models as source of contextual
information for vision processing. In Proceedings of the ICCV95 Workshop on
Context-Based Vision (pp. 13-21).
Casares, J., Myers, B., Long, A., Bhatnagar, R., Stevens, S., Dabbish, L., et al. (2002).
Simplifying video editing using metadata. In Proceedings of Designing Interactive
Systems (DIS 2002) (pp. 157-166).
Chatman, S. (1978). Story and discourse: Narrative structure in fiction and film. Ithaca,
NY: Cornell University Press.
Davis, M. (2003). Editing out editing. In IEEE Multimedia Magazine (Special Edition)
(pp. 54-64). Computational Media Aesthetics. IEEE Computer Society.
Girgensohn, A., Boreczky, J., Chiu, P., Doherty, J., Foote, J., Golovchinsky, G., et al.
(2000). A semi-automatic approach to home video editing. In Proceedings of the
13th Annual ACM Symposium on User Interface Software and Technology (pp. 8189).
He, L.-W., Cohen, M., & Salesin, D. (1996). The virtual cinematographer: A paradigm for
automatic real-time camera control and directing. Computer Graphics, 30(Annual
Conference Series), 217-224.
Hirsch Jr., E. (1967). Validity in interpretation. New Haven, CT:Yale University Press.
iMovie, A. (2003). The new imovie video and audio snap into place. [Brochure].
Kennedy, K., & Mercer, R. E. (2001). Using cinematography knowledge to communicate
animator intentions. In Proceedings of the First International Symposium on
Smart Graphics, Hawthorne, New York (pp. 47-52).
Lang, R. (1999). A declarative model for simple narratives. In AAAI Fall Symposium on
Narrative Intelligence (pp. 134-141).
Lindley, C. A., Davis, J., Nack, F., & Rutledge, L. (2001) The application of rhetorical
structure theory to interactive news program generation from digital archives. CWI
technical report INS-R0101.
Mann, B. (1999). An introduction to rhetorical structure theory (RST). Retrieved from
http://www.sil.org/linguistics/rst/rintro99.htm
Monaco, J. (1981). How to read a film: The art, technology, language, history and theory
of film and media. Oxford, UK: Oxford University Press.
Nack, F. (1996). AUTEUR - The Application of Video Semantics and Theme Representation for Automated Film Editing (pp. 82-89). Doctoral dissertation, Lancaster
University, UK.
Nack, F. & Parkes, A. (1995). Auteur: The creation of humorous scenes using automated
video editing. IJCAI-95 Workshop on AI Entertainment and AI/Alife.
Pfeiiffer, R. L. S., & Effelsberg, W. (1997). Video abstracting. Communicationsof the
ACM, 40(12), 54-63.
Sack, W., & Davis, M. (1994). IDIC: Assembling video sequences from story plans and
content annotations. Proceedings of IEEE International Conference on Multimedia Computing and Systems, (pp. 30-36).
Schultz, E. & Schultz, D. (1972). How to make exciting home movies and stop boring your
friends and relatives. London: Robert Hale.
Skov, M., & Andersen, P. (2001). Designing interactive narratives. In COSIGN 2001 (pp.
59-66).
Tomlinson, B., Blumberg, B., & Nain, D. (2000). Expressive autonomous cinematography
for interactive virtual environments. In Proceedings of the Fourth International
Conference on autonomous Agents (pp. 317-324), Barcelona, Spain, June 3-7
(AGENTS2000).
Wactlar, H., Christel, M., Gong, Y., & Hauptmann, A. (1999). Lessons learned from
building a terabyte digital video library. IEEE Computer Magazine, 32, 66-73.
ENDNOTES
1
In the literary world, surveys turn up a remarkable variety of authoring styles and
an interesting analogy for us, in that they, nevertheless, generally still evince these
three distinct creative phases.
The reader might note the parallels in this view to the text planning stage and
surface realisation stage of a natural language generator.
246 Scherp & Boll
Chapter 11
MM4U:
A Framework for
Creating Personalized
Multimedia Content
Ansgar Scherp, OFFIS Research Institute, Germany
Susanne Boll, University of Oldenburg, Germany
ABSTRACT
In the Internet age and with the advent of digital multimedia information, we succumb
to the possibilities that the enchanting multimedia information seems to offer, but end
up almost drowning in the multimedia information: Too much information at the same
time, so much information that is not suitable for the current situation of the user, too
much time needed to find information that is really helpful. The multimedia material
is there, but the issues of how the multimedia content is found, selected, assembled, and
delivered such that it is most suitable for the users interest and background, the users
preferred device, network connection, location, and many other settings, is far from
being solved. In this chapter, we are focusing on the aspect of how to assemble and
deliver personalized multimedia content to the users. We present the requirements and
solutions of multimedia content modeling and multimedia content authoring as we find
it today. Looking at the specific demands of creating personalized multimedia content,
we come to the conclusion that a dynamic authoring process is needed in which just
in time the individual multimedia content is created for a specific user or user group.
We designed and implemented an extensible software framework, MM4U (short for
MultiMedia for you), which provides generic functionality for typical tasks of a
The MM4U Framework 247
dynamic multimedia content personalization process. With such a framework at hand,

an application developer can concentrate on creating personalized content in the
specific domain and at the same time is relieved from the basic task of selecting,
assembling, and delivering personalized multimedia content. We present the design of
the MM4U framework in detail with an emphasis for the personalized multimedia
composition and illustrate the frameworks usage in the context of our prototypical
applications.
INTRODUCTION
Multimedia content today can be considered as the composition of different media

elements, such as images and text, audio, and video, into an interactive multimedia
presentation like a guided tour through our hometown Oldenburg. Features of such a
presentation are typically the temporal arrangement of the media elements in the course
of the presentation, the layout of the presentation, and its interaction features. Personalization of multimedia content means that the multimedia content is targeted at a specific
person and reflects this persons individual context, specific background, interest, and
knowledge, as well as the heterogeneous infrastructure of end devices to which the
content is delivered and on which it is presented. The creation of personalized multimedia
content means that for each intended context a custom presentation needs to be created.
Hence, multimedia content personalization is the shift from one-size-fits-all to a very
individual and personal one-to-one provision of multimedia content to the users. This
means in the end that the multimedia content needs to be prepared for each individual
user. However, if there are many different users that find themselves in very different
contexts, it soon becomes obvious that a manual creation of different content for all the
different user contexts is not feasible, let alone economical (see Andr & Rist, 1996).
Instead, a dynamic, automated process of selecting and assembling personalized
multimedia content depending on the user context seems to be reasonable.
The creation of multimedia content is typically subsumed under the notion of
multimedia authoring. However, such authoring today is seen as the static creation of
multimedia content. Authoring tools with graphical user interfaces (GUI) allow us to
manually create content that is targeted at a specific user group. If the content created
is at all personalizable, then only within a very limited scope. First research approaches
in the field of dynamic creation of personalized multimedia content are promising;
however, they are often limited to certain aspects of the content personalization to the
individual user. Especially when the content personalization task is more complex, these
systems need to employ additional programming. As we observe that programming is
needed in many cases anyway, we continue this observation consequently and propose
MM4U (short for MultiMedia for you), a component-based object-oriented software
framework to support the software development process of multimedia content personalization applications. MM4U relieves application developers from general tasks in the
context of multimedia content personalization and lets them concentrate on the application domain-specific tasks. The frameworks components provide generic functionality
for typical tasks of the multimedia content personalization process. The design of the
framework is based on a comprehensive analysis of the related approaches in the field
of user profile modeling, media data modeling, multimedia composition, and multimedia
248 Scherp & Boll
presentation formats. We identify the different tasks that arise in the context of creating
personalized multimedia content. The different components of the framework support
these different tasks for creating user-centric multimedia content: They integrate the
generic access to user profiles, media data, and associated meta data, provide support
for personalized multimedia composition and layout, as well as create the context-aware
multimedia presentations. With such a framework, the development of multimedia
applications becomes easier and much more efficient for different users with their
different (semantic) contexts. On the basis of the MM4U framework, we are currently
developing two sample applications: a personalized multimedia sightseeing tour and a
personalized multimedia sports news ticker. The experiences we gain from the development of these applications give us important feedback on the evaluation and continuous
redesign of the framework.
The remainder of this chapter is organized as follows: To review the notion of
multimedia content authoring, in Multimedia Content Authoring Today we present the
requirements of multimedia content modeling and the authoring support we find today.
Setting off from this, Dynamic Authoring of Personalized Content introduces the reader
to the tasks of creating personalized multimedia content and why such content can be
created only in a dynamic fashion. In Related Approaches, we address the related
approaches we find in the field before we present the design of our MM4U framework
in The Multimedia Personalization Framework section. As the personalized creation of
multimedia content is a central aspect of the framework, Creating Personalized Multimedia Content presents in detail the multimedia personalization features of the framework.
Impact of Personalization to The Development of Multimedia Applications shows how
the framework supports application developers and multimedia authors in their effort to
create personalized multimedia content. The implementation and first prototypes are
presented in Implementation and Prototypical Applications before we come to our
summary and conclusion in the final section.
MULTIMEDIA CONTENT
AUTHORING TODAY
In this section, we introduce the reader to current notions and techniques of
multimedia content modeling and multimedia content authoring. An understanding of
requirements and approaches in modeling and authoring of multimedia content is a
helpful prerequisite to our goal, the dynamic creation of multimedia content. For the
modeling of multimedia content we present our notion of multimedia content, documents,
and presentation and describe the central characteristics of typical multimedia document
models in the first subsection. For the creation of multimedia content, we give a short
overview of directions in multimedia content authoring today in the second subsection.
Multimedia Content
Multimedia content today is seen as the result of a composition of different media
elements (media content) in a continuous and interactive multimedia presentation.
Multimedia content builds on the modeling and representation of the different media
elements that form the building bricks of the composition. A multimedia document
represents the composition of continuous and discrete media elements into a logically
coherent multimedia unit. A multimedia document that is composed in advance to its
rendering is called preorchestrated in contrast to compositions that take place just before
rendering that are called live or on-the-fly. A multimedia document is an instantiation of
a multimedia document model that provides the primitives to capture all aspects of a
multimedia document. The power of the multimedia document model determines the
degree of the multimedia functionality that documents following the model can provide.
Representatives of (abstract) multimedia document models in research can be found with
CMIF (Bulterman et al., 1991), Madeus (Jourdan et al., 1998), Amsterdam Hypermedia
Model (Hardman, 1998; Hardman et al., 1994a), and ZYX (Boll & Klas, 2001). A multimedia
document format or multimedia presentation format determines the representation of a
multimedia document for the documents exchange and rendering. Since every multimedia presentation format implicitly or explicitly follows a multimedia document model, it
can also be seen as a proper means to serialize the multimedia documents representation for the purpose of exchange. Multimedia presentation formats can either be
standardized, such as the W3C standard SMIL (Ayars et al., 2001), or proprietary such
as the widespread Shockwave file format (SWF) of Macromedia (Macromedia, 2004). A
multimedia presentation is the rendering of a multimedia document. It comprises the
continuous rendering of the document in the target environment, the (pre)loading of
media data, realizing the temporal course, the temporal synchronization between continuous media streams, the adaptation to different or changing presentation conditions and
the interaction with the user.
Looking at the different models and formats we find, and also the terminology in the
related work, there is not necessarily a clear distinction between multimedia document
models and multimedia presentation formats, and also between multimedia documents
and multimedia presentations. In this chapter, we distinguish the notion of multimedia
document models as the definition of the abstract composition capabilities of the model;
a multimedia document is an instance of this model. The term multimedia content or
content representation is used to abstract from existing formats and models, and
generally addresses the composition of different media elements into a coherent multimedia presentation. Independent of the actual document model or format chosen for the
content, one can say that a multimedia content representation has to realize at least three
central aspects: the temporal, spatial, and interactive characteristics of a multimedia
presentation (Boll et al., 2000). However, as many of todays concrete multimedia
presentation formats can be seen as representing both a document model and an
exchange format for the final rendering of the document, we use these as an illustration
of the central aspects of multimedia documents. We present an overview of these
characteristics in the following listing; for a more detailed discussion on the characteristics of multimedia document models we refer the reader to (Boll et al., 2000; Boll & Klas,
2001).
A temporal model describes the temporal dependencies between media elements

of a multimedia document. With the temporal model, the temporal course such as
the parallel presentation of two videos or the end of a video presentation on a
mouse-click event can be described. One can find four types of temporal models:
point-based temporal models, interval-based temporal models (Little & Ghafoor,
250 Scherp & Boll
1993; Allen, 1983), enhanced interval-based temporal models that can handle time
intervals of unknown duration (Duda & Keramane, 1995; Hirzalla et al., 1995; Wahl
& Rothermel, 1994), event-based temporal models, and script-based realization of
temporal relations. The multimedia presentation formats we find today realize
different temporal models, for example, SMIL 1.0 (Bugaj et al., 1998) provides an
interval-based temporal model only, while SMIL 2.0 (Ayars et al., 2001) also
supports an event-based model.
For a multimedia document not only the temporal synchronization of these
elements is of interest but also their spatial positioning on the presentation media,
for example, a window, and possibly the spatial relationship to other visual media
elements. The positioning of a visual media element in the multimedia presentation
can be expressed by the use of a spatial model. With it one can, for example, place
one image about a caption or define the overlapping of two visual media. Besides
the arrangement of media elements in the presentation, also the visual layout or
design is defined in the presentation. This can range from a simple setting for
background colors and fonts up to complex visual designs and effects. In general,
three approaches to spatial models can be distinguished: absolute positioning,
directional relations (Papadias et al., 1995; Papadias & Sellis, 1994), and topological relations (Egenhofer & Franzosa, 1991). With absolute positioning we subsume both the placement of a media element at an absolute position with respect
to the origin of the coordinate system and the placement at an absolute position
relative to another media element. The absolute positioning of media elements can
be found, for example, with Flash (Macromedia, 2004) and the Basic Language
Profile of SMIL 2.0, whereas the relative positioning is realized, for example, by
SMIL 2.0 and SVG 1.2 (Andersson et al., 2004b).
A very distinct feature of a multimedia document model is the ability to specify user
interaction in order to let a user choose between different presentation paths.
Multimedia documents without user interaction are not very interesting as the
course of their presentation is exactly known in advance and, hence, could be
recorded as a movie. With interaction models a user can, for example, select or
repeat parts of presentations, speed up a movie presentation, or change the visual
appearance. For the modeling of user interaction, one can identify at least three
basic types of interaction: navigational interactions, design interactions, and
movie interactions. Navigational interaction allows the selection of one out of
many presentation paths and is supported by all the considered multimedia
document models and presentation formats.
Looking at existing multimedia document models and presentation formats both in

industry and research, one can see that these aspects of multimedia content are
implemented in two general ways: The standardized formats and research models
typically implement these aspects in different variants in a structured (XML) fashion as
can be found with SMIL 2.0, HTML+TIME (Schmitz et al., 1998), SVG 1.2, Madeus, and
ZYX. Proprietary approaches, however, represent or program these aspects in an
adequate internal model such as Macromedias Shockwave format. Independent of the
actual multimedia document model, support for the creation of these documents is
needed multimedia content authoring. We will look at the approaches we find in the
field of multimedia content authoring in the next section.
Multimedia Authoring
While multimedia content represents the composition of different media elements
into a coherent multimedia presentation, multimedia content authoring is the process
in which this presentation is actually created. This process involves parties from
different fields including media designers, computer scientists, and domain experts:
Experts from the domain provide their knowledge in the field; this knowledge forms the
input for the creation of a storyboard for the intended presentation. Such a storyboard
forms often the basis on which creators and directors plan the implementation of the story
with the respective media and with which writers, photographers, and camerapersons
acquire the digital media content. Media designers edit and process the content for the
targeted presentation. Finally, multimedia authors compose the preprocessed and
prepared material into the final multimedia presentation. Even though we described this
as a sequence of steps, the authoring process typically includes cycles. In addition, the
expertise for some of the different tasks in the process can also be held by one single
person. In this chapter, we are focusing on the part of the multimedia content creation
process in which the prepared material is actually assembled into the final multimedia
presentation.
This part is typically supported by professional multimedia development programs,
so-called authoring tools or authoring software. Such tools allow the composition of
media elements into an interactive multimedia presentation via a graphical user interface.
The authoring tools we find here range from domain expert tools to general purpose
authoring tools.
Domain expert tools hide as much as possible the technical details of content
authoring from the authors and let them concentrate on the actual creation of the
multimedia content. The tools we find here are typically very specialized and
targeted at a very specific domain. An example for such a tool has been developed
in the context of our previous research project Cardio-OP (Klas et al., 1999) in the
domain of cardiac surgery. The content created in this project is an interactive
multimedia book about topics in the specialized domain of cardiac surgery. Within
the project context, an easy-to-use authoring wizard was developed to allow
medical doctors to easily create pages of a multimedia book in cardiac surgery.
The Cardio-OP-Wizard guides the domain experts through the authoring process
by a digital storyboard for a multimedia book on cardiac surgery. The wizard hides
as much technical detail as possible.
On the other end of the spectrum of authoring tools we find highly generalized tools
such as Macromedia Director (Macromedia, 2004). These tools are independent of
the domain of the intended presentation and let the authors create very sophisticated multimedia presentations. However, the authors typically need to have high
expertise in using the tool. Very often programming in an integrated programming
language is needed to achieve special effects or interaction patterns. Consequently, the multimedia authors need programming skills and along with this some
experience in software development and software engineering.
252 Scherp & Boll
Whereas a multimedia document model has to represent the different aspects of

time, space, and interaction, multimedia authoring tools must allow the authors to
actually assemble the multimedia content. However, the authors are normally experts
from a specific domain. Consequently, the only authoring tools that are practicable to
create multimedia content for a specific domain are those that are highly specialized and
easy to use.
DYNAMIC AUTHORING
OF PERSONALIZED CONTENT
The authoring process described above so far represents a manual authoring of
multimedia content, often with high effort and cost involved. Typically, the result is a
multimedia presentation targeted at a certain user group in a special technical context.
However, the one-size-fits-all fashion of the multimedia content created does not
necessarily satisfy different users needs. Different users may have different preferences
concerning the content and also may access the content in networks on different end
devices. For a wider applicability, the authored multimedia content needs to carry some
alternatives that can be exploited to adapt the presentation to the specific preferences
of the users and their technical settings. Figure 1 shows an illustration of the variation
possibilities that a simple personalized city guide application can possess. The root of
the tree represents the multimedia presentation for the personalized city tour. If this
presentation was intended for both Desktop PC and PDA, this results in two variants of
the presentation. If then some tourists are interested only in churches, museums, or
palaces and would like to receive the content in either English or German, this already
sums up to 12 variants. If then the multimedia content should be available in different
presentation formats, the number of variation possibilities within a personalized city tour
increases again. Even though different variants are not necessarily entirely different and
may have overlapping content, the example is intended to illustrate that the flexibility of
multimedia content to personalize to different user contexts quickly leads to an explosion
of different options. And still the content can only be personalized within the flexibility
range that has been anchored in the content.
From our point of view, an efficient and competitive creation of personalized
multimedia content can only come from a system approach that supports the dynamic
authoring of personalized multimedia content. A dynamic creation of such content allows
for a selection and composition of just those media elements that are targeted at the users
specific interest and preferences. Generally, the dynamic authoring comprises the steps
and tasks that occur also with static authoring, but with the difference that the creation
process is postponed to the time when the targeted user context and the presentation
is created for this specific context. To be able to efficiently create presentations for
(m)any given contexts, a manual authoring of a presentation meeting the user needs is
not an option; instead, a dynamic content creation is needed.
As we look into the process of dynamic authoring of personalized multimedia
content, it is apparent that this process involves different phases and tasks. We identify
the central tasks in this process that need to be supported by a suitable solution for
personalized content creation.
Figure 1. Example of the variation possibilities within a personalized city guide

application
to
Desk
Pock
e
p PC
# Variants
t PC
Pa
la
ce
s
u
Ch
es
ch
Museums
u
Ch
Museums
s
he
rc
Pa
la
ce
s
an
Germ
an
Germ
an
Germ
an
Germ
an
Germ
Germ
sh
Engli
sh
Engli
sh
Engli
sh
Engli
sh
Engli
sh
Engli
an
12
...
...
...
36
SMIL
SVG
HTML
SMIL
BLP
Mobile
SVG
HTML
Figure 2 depicts the general process of creating personalized multimedia content.

The core of this process is an application we call personalization engine. The input
parameters to this engine can be characterized by three groups: The first group of input
parameters is the media elements with the associated meta data that constitute the
content from which the personalized multimedia presentations are selected and assembled. The second group enfolds the users personal and technical context. The user
profile includes information about, for example, the users current task, the location, and
environment, like weather and loudness, his or her knowledge, goals, preferences and
interests, abilities and disabilities, as well as demographic data. The technical context is
described by the type of the users end device, the hardware and software characteristics,
as for example the available amount of memory and media player, as well as possible
network connections and input devices. The third group of input parameters influences
the general structure of the resulting personalized multimedia presentation and subsumes other preferences a user could have for the multimedia presentation.
Within the personalization engine, these input parameters are now used to author
the personalized multimedia presentation. First, the personalization engine exploits all
available information about the users context and his or her end device to select by
means of media meta data those media elements that are of most relevance according to
the users interests and preferences and meet the characteristics of the end device at the
best. In the next step, the selected media elements are assembled and arranged by the
personalization engine again in regard to the user profile information and the
characteristics of the end device to the personalized multimedia content, represented
254 Scherp & Boll
Figure 2. General process of personalizing multimedia content

Media data
Context
dependent
selection of
multimedia
content
Meta data
User profile
Transformation of
internal format
to concrete
presentation
format
Context
dependent
composition of
multimedia
content in
internal format
SMIL
Technical
environment
The Horst-Janssen Musem in

Oldenburg shows Life and
Work of Horst Janssen in a
comprehensive permanent ...
SVG
The Horst-Janssen Musem in Oldenburg shows

Th e H o rs-t J an s se-n M u se u m i n O l d en b ur g s h o ws L i fe a nd Wo rk Ho rs t Ja n ss e n i n a
Lif
and
of Horst Janssen in a
c om p re
he n e
si ve p
er ma ne nWork
t
comprehensive permanent ...
Document
structure
...
Horst-Janssen
Musem
Rules and
constraints
Layout and
style
Flash
"Personalization engine"
select
assemble
transform
present
in an internal document model (Scherp & Boll, 2004b). This internal document model
abstracts from the different characteristics of todays multimedia presentation formats
and, hence, forms the greatest common denominator of these formats. Even though our
abstract model does not reflect the fancy features of some of todays multimedia
presentation formats, it supports the very central multimedia features of modeling time,
space, and interaction. It is designed to be efficiently transformed to the concrete syntax
of the different presentation formats. For the assembly, the personalization engine uses
the parameters for document structure, the layout and style parameters, and other rules
and constraints that describe the structure of the personalized multimedia presentation,
to determine among others the temporal course and spatial layout of the presentation.
The center of Figure 2 sketches this temporal and spatial arrangement of selected media
elements over time in a spatial layout following the document structure and other
preferences. Only then in the transformation phase, the multimedia content in the internal
document model is transformed to a concrete presentation format. Finally, the just
generated personalized multimedia presentation is rendered and displayed by the actual
end device.
RELATED APPROACHES
In this section we present the related approaches in the field of personalized
multimedia content creation. We first discuss the creation of personalizable multimedia
content with todays authoring environments before we come to research approaches
that address a dynamic composition of adapted or personalized multimedia content.
Multimedia authoring tools like Macromedia Director (Macromedia, 2004) today
require high expertise from their users and create multimedia presentations that are
targeted only at a specific user or user group. Everything personalizable needs to be
programmed or scripted within the tools programming language. Early work in the field
of creating advanced hypermedia and multimedia documents can be found, for example,
with the Amsterdam Hypermedia Model (Hardman, 1998; Hardman et al., 1994b) and the
authoring system CMIFed (van Rossum, 1993; Hardman et al., 1994a) as well as with the
ZYX (Boll & Klas, 2001) multimedia document model and a domain-specific authoring
wizard (Klas et al., 1999). In the field of standardized models, the declarative description
of multimedia documents with SMIL allows for the specification of adaptive multimedia
presentations by defining presentation alternatives by using the switch element. A
manual authoring of such documents that are adaptable to many different contexts is too
complex; also the existing authoring tools such as GRiNS editor for SMIL from Oratrix
(Oratrix, 2004) are still tedious to handle. Some SMIL tools provide support for the
switch element to define presentation alternatives; a comfortable interface for editing
the different alternatives for many different contexts, however, is not provided. Consequently, we have been working on the approach in which a multimedia document is
authored for one general context and is then automatically enriched by the different
presentation alternatives needed for the expected user contexts in which the document
is to be viewed (Boll et al., 1999). However, this approach is reasonable only for a limited
number of presentation alternatives and limited presentation complexity in general.
Approaches that dynamically create personalized content are typically found on
the Web, for example, Amazon.com (Amazon, 1996-2004) or MyYahoo (Yahoo!, 2002).
However, these systems remain text-centric and are not occupied with the complex
composition of media data in time and space into real multimedia presentations. On the
pathway to an automatic generation of personalized multimedia presentations, we
primarily find research approaches that address personalized media presentations only:
For example, the home-video editor Hyper-Hitchcock (Girgensohn et al., 2003; Girgensohn
et al., 2001) provides a preprocessing of a video such that users can interactively select
clips to create their personal video summary. Other approaches create summaries of
music or video (Kopf et al., 2004; Agnihotri et al., 2003). However, the systems provide
an intelligent and intuitive access to large sets of (continuous) media rather than a
dynamic creation of individualized content. An approach that addresses personalization
for videos can be found, for example, with IBMs Video Semantic Summarization System
(IBM Corporation, 2004a) which is, however, still concentrating on one single media type.
Towards personalized multimedia we find interesting work in the area of adaptive
hypermedia systems which has been going on for quite some years now (Brusilovsky
1996; Wu et al., 2001; De Bra et al., 1999a, 2000, 2002b; De Carolis et al., 1998, 1999). The
adaptive hypermedia system AHA! (De Bra et al., 1999b, 2002a, 2003) is a prominent
example here which also addresses the authoring aspect (Stash & De Bra, 2003), for
example, in adaptive educational hypermedia applications (Stash et al., 2004). However,
though these and further approaches integrate media elements in their adaptive
hypermedia presentations, synchronized multimedia presentations are not in their focus.
Personalized or adaptive user interfaces allow the navigation and access of
information and services in a customized or personalized fashion. For example, work done
in the area of personalized agents and avatars considers presentation generation
exploiting natural language generation and visual media elements to animate the agents
and avatars (de Rosis et al., 1999). These approaches address the human computer
interface; the general issue of dynamically creating arbitrary personalized multimedia
content that meets the users information needs is not in their research focus.
256 Scherp & Boll
A very early approach towards the dynamic creation of multimedia content is the
Coordinated Multimedia Explanation Testbed (COMET), which is based on an expertsystem and different knowledge databases and uses constraints and plans to actually
generate the multimedia presentations (Elhadad et al., 1991; McKeown et al., 1993).
Another interesting approach to automate the multimedia authoring process has been
developed at the DFKI in Germany by the two knowledge-based systems, WIP (Knowledge-based Presentation of Information) and PPP (Personalized Plan-based Presenter).
WIP is a knowledge-based presentation system that automatically generates instructions for the maintenance of technical devices by plan generation and constraint solving.
PPP enhances this system by providing a lifelike character to present the multimedia
content and by considering the temporal order in which a user processes a presentation
(Andr, 1996; Andr & Rist, 1995,1996). Also a very interesting research approach
towards the dynamic generation of multimedia presentations is the Cuypers system (van
Ossenbruggen et al., 2000) developed at the CWI. This system employs constraints for
the description of the intended multimedia programming and logic programming for the
generation of a multimedia document (CWI, 2004). The multimedia document group at
INRIA in France developed within the Opra project a generic architecture for the
automated construction of multimedia presentations based on transformation sheets and
constraints (Villard, 2001). This work is continued within the succeeding project Web,
Accessibility, and Multimedia (WAM) with the focus on a negotiation and adaptation
architecture for multimedia services for mobile devices (Lemlouma & Layada, 2003,
2004).
However, we find limitations with existing systems when it comes to their expressiveness and flexible personalized content creation support. Many approaches for
personalization are targeted at a specific application domain in which they provide a very
specific content personalization task. The existing research solutions typically use a
declarative description like rules, constraints, style sheets, configuration files, and the
like to express the dynamic, personalized multimedia content creation. However, they can
solve only those presentation generation problems that can be covered by such a
declarative approach; whenever a complex and application-specific personalization
generation task is required, the systems find their limit and need additional programming
to solve the problem. Additionally, the approaches we find usually rely on fixed data
models for describing user profiles, structural presentation constraints, technical infrastructure, rhetorical structure, and so forth, and use these data models as an input to their
personalization engine. The latter evaluates the input data, retrieves the most suitable
content, and tries to most intelligently compose the media into a coherent aesthetic
multimedia presentation. A change of the input data models as well as an adaptation of
the presentation generator to more complex presentation generation tasks is difficult if
not unfeasible. Additionally, for these approaches the border between the declarative
descriptions for describing content personalization constraints and the additional
programming needed is not clear and differs from solution to solution. This leads us to
the development of a software framework that supports the development of personalized
multimedia applications.
MULTIMEDIA
PERSONALIZATION FRAMEWORK
Most of the research approaches presented above apply to text-centered information only, are limited with regard to the personalizability, or are targeted at very specific
application domains. As mentioned above, we find that existing research solutions in the
field of multimedia content personalization provide interesting solutions. They typically
use a declarative description like style sheets, transformation rules, presentation
constraints, configuration files, and the like to express the dynamic, personalized
multimedia content creation. However, they can solve only those presentation generation problems that can be covered by such a declarative approach; whenever a complex
and application-specific personalization generation task is required, the systems find
their limit and need additional programming to solve the problem. To provide application
developers with a general, domain independent support for the creation of personalized
multimedia content we pursue a software engineering approach: the MM4U framework.
With this framework, we propose a component-based object-oriented software framework that relieves application developers from general tasks in the context of multimedia
content personalization and lets them concentrate on the application domain-specific
tasks. It supports the dynamic generation of arbitrary personalized multimedia presentations and therewith provides substantial support for the development of personalized
multimedia applications. The framework does not reinvent multimedia content creation
but incorporates existing research in the field and also can be extended by domain and
application-specific solutions. In the following subsection we identify by an extensive
study of related work and our own experiences the general design goals of this framework.
In the next subsection, we present the general design of the MM4U framework, and then
we present a detailed insight into the frameworks layered architecture in the last
subsection.
General Design Goals for the MM4U Framework

The overall goal of MM4U is to simplify and to reduce the costs of the development
process of personalized multimedia applications. Therefore, the MM4U framework has
to provide the developers with support for the different tasks of the multimedia
personalization process as shown in Figure 2. These tasks comprise assistance for the
access to media data and associated meta data as well as user profile information and the
technical characteristics of the end device. The framework must also provide for the
selection and composition of media elements into a coherent multimedia presentation.
Finally, the personalized multimedia content must be created for delivery and rendering
on the users end device.
In regard to these different tasks, we conducted an extensive study of related work:
In the area of user profile modeling we considered among others Composite Capability/
Preference Profile (Klyne et al., 2003), FIPA Device Ontology Specification (Foundation
for Intelligent Physical Agents, 2002), User Agent Profile (Open Mobile Alliance, 2003),
Customer Profile Exchange (Bohrer & Holland, 2004), (Fink et al., 1997), and (Chen & Kotz,
2000). In regard to meta data modeling, we studied different approaches of modeling meta
data and approaches for meta data standards for multimedia, for example, Dublin Core
(Dublin Core Metadata Initiative, 1995-2003) and Dublin Core Extensions for Multimedia
258 Scherp & Boll
Objects (Hunter, 1999), Resource Description Framework (Beckett & McBride, 2003), and
the MPEG-7 Multimedia content description standard (ISO/IEC JTC 1/SC 29/WG 11, 1999,
2001a-e). For multimedia composition we analyzed the features of multimedia document
models, including SMIL (Ayars et al., 2001), SVG (Andersson et al., 2004b), Macromedia
Flash (Macromedia, 2004), Madeus (Jourdan et al., 1998), and ZYX (Boll & Klas, 2001).
For the presentation of multimedia content, respective multimedia presentation frameworks were regarded including Java Media Framework (Sun Microsystems, 2004),
MET++ (Ackermann 1996), and PREMO (Duke et al., 1999). Furthermore, other existing
systems and general approaches for creating personalized multimedia content that were
considered including the Cuypers engine (van Ossenbruggen et al., 2000) and the
Standard Reference Model for Intelligent Multimedia Presentation Systems (Bordegoni
et al., 1997).
We also derived design requirements to the framework from first prototypes of
personalized multimedia applications we developed in different fields such as a personalized sightseeing tour through Vienna (Boll, 2003), a personalized mobile paper chase
game (Boll et al., 2003), and a personalized multimedia music newsletter.
From the extensive study of related work and the first experiences and requirements
we gained from our prototypical applications, we developed the single layers of the
framework. We also derived three general design goals for MM4U. These design goals are
The framework is to be designed such that it is independent of any special

application domain, that is, it can be used to generate arbitrary personalized
multimedia content. Therefore, it provides general multimedia composition and
personalization functionality and is flexible enough to be adapted and extended
concerning the particular requirements of the concrete personalization functionalities
a personalized application needs.
The access to user profile information and media data with its associated meta data
must be independent of the particular solutions for storage, retrieval, and processing of such data. Rather the framework should provide a unified interface for the
access to existing solutions. With distinct interfaces for the access to user profile
information and media data with associated meta data, it is the frameworks task to
use and exploit existing (research) profile and media storage systems for the
personalized multimedia content creation.
The third design goal for the framework is what we call presentation independence.
The framework is to be independent of, for example, the technical characteristics
of the end devices, their network connection, and the different multimedia output
formats that are available. This means, that the framework can be used to generate
equivalent multimedia content for the different users and output channels and their
individual characteristics. This multichannel usage implies that the personalized
multimedia content generation task is to be partitioned into a composition of the
multimedia content in an internal representation format and its later transformation
into arbitrary (preferably standardized) presentation formats that can be rendered
and displayed by end devices.
These general design goals have a crucial impact on the structure of the multimedia
personalization framework, which we present in the following section.
General Design of the MM4U Framework

A software framework like MM4U is a semifinished software architecture, providing
a software system as a generic application for a specific domain (Pree, 1995). The MM4U
framework comprises components, which are bound together by their interaction
(Szyperski et al., 2002), and realizes generic support for personalized multimedia applications. Each component is realized as an object-oriented framework and consists of a
set of abstract and concrete classes. Depending on the usage of a framework, the socalled white-box and black-box frameworks can be distinguished (respectively,
white-box and gray-box reuse). A framework is used as a black-box if the concrete
application that uses the framework adapts its functionality by different compositions
of the frameworks classes. In this case the concrete application uses only the built-in
functionality of the framework, that is, those modules with which the framework is already
equipped. In contrast, the functionality of a white-box framework is refined or extended
by a concrete application, by adding additional modules through inheritance of (abstract) classes. Between these two contrasts arbitrary shades of gray are possible
(Szyperski et al., 2002). The design of the MM4U framework lies somewhere in the middle
between pure black-box and pure white-box. Being a domain independent framework,
MM4U needs to be configured and extended to meet the specific requirements of a
concrete personalized multimedia application. The framework provides many modules,
for example, to access media data and associated meta data, user profile information, and
generates the personalized multimedia content in a standardized output format that can
be reused for different application areas (black-box usage). For the very applicationspecific personalization functionality, the framework can be extended correspondingly
(white-box usage).
The usage of the MM4U framework by a concrete personalized multimedia application is illustrated schematically in Figure 3. The personalized multimedia application
uses the functionality of the framework to create personalized multimedia content, and
integrates it in whatever application dependent functionality is needed, either by using
the already built-in functionality of the framework or by extending it for the specific
requirements of the concrete personalized multimedia application.
With respect to the multimedia software development process the MM4U framework assists the computer scientists during the design and implementation phase. It
alleviates the time-consuming multimedia content assembly task and lets the computer
scientists concentrate on the development of the actual application. The MM4U
framework provides functionality for the single tasks of the personalization engine as
described in the section on Dynamic Authoring of Personalized Content. It offers the
computer scientists support for integrating and accessing user profile information and
media data, selecting media elements according to the users profile information,
composing these elements into coherent multimedia content, and generating this content
in standardized multimedia document formats to be presented on the users end device.
When designing a framework, the challenge is to identify the points where the
framework should be flexible, that is, to identify the semantic aspects of the frameworks
application domain that have to be kept flexible. These points are the so-called hot spots
and represent points or sockets of the intended flexibility of a framework (Pree, 1995).
Each hot spot constitutes a well-defined interface where proper modules can be plugged
in. When designing the MM4U framework we identified hot spots where adequate
260 Scherp & Boll
Figure 3. Usage of the MM4U framework by a personalized multimedia application

Personalized Multimedia
Application
Application
Generation of multimedia presentation
dependent
Personalized multimedia composition
functionality
Access to media objects and meta data
Access to user profile information
modules for supporting the personalization task can be plugged in that provide the
required functionality.
As depicted in Figure 3, the MM4U framework provides four types of such hot
spots, where different types of modules can be plugged in. Each hot spot represents a
particular task of the personalization process. The hot spots can be realized by plugging
in a module that implements the hot spots functionality for a concrete personalized
multimedia application. These modules can be both application-dependent and application-independent. For example, the access to media data and associated meta data is not
necessarily application-dependent, whereas the composition of personalized multimedia
content can be heavily dependent on the concrete application.
After the general design of the framework, we take a closer look at the concrete
architecture of MM4U and its components in the next section.
Design of the Framework Layers

For supporting the different tasks of the multimedia personalization process, which
are the access to user profile information and media data, selection and composition of
media elements into a coherent presentation, rendering and display of the multimedia
presentation on the end device, a layered architecture seems to be best suited for MM4U.
The layered design of the framework is illustrated in Figure 4. Each layer provides modular
support for the different tasks of the multimedia personalization process. The access to
user profile information and media data are realized by the layers (1) and (2), followed by
the two layers (3) and (4) in the middle for composition of the multimedia presentation
in an internal object-oriented representation and its later transformation into a concrete
presentation output format. Finally, the top layer (5) realizes the rendering and display
of the multimedia presentation on the end device.
To be most flexible for the different requirements of the concrete personalized
multimedia applications, the frameworks layers allow extending the functionality of
MM4U by embedding additional modules as indicated by the empty boxes with dots. In
the following descriptions, the features of the framework are described along its different
layers. We start from the bottom of the architecture and end with the top layer.
Figure 4. Overview of the multimedia personalization framework MM4U

5
Presentation Format Generators
SMIL 2.0
Generator
SMIL 2.0
BLP
Generator
Sequential
2
1
SVG 1.2
Generator
Mobile
SVG
Generator
Multimedia Composition
Parallel
Citytour
User Profile Accessor
User Profile Connectors
URI
CC/PP
Profile
storage
Connector Connector
(1)
Multimedia Presentation
Slideshow
Media Pool Accessor

Media Data Connectors
URI
IR
Media
system
Connector Connector
Connectors: The User Profile Connectors and the Media Data Connectors bring the
user profile data and media data into the framework. They integrate existing
systems for user profile stores, media storage, and retrieval solutions. As there are
many different systems and formats available for user profile information, the User
Profile Connectors abstract from the actual access to and retrieval of user profile
information and provide a unified interface to the profile information. With this
component, the different formats and structures of user profile models can be made
accessible via a unified interface. For example, a flexible URIProfileConnector we
developed for our demonstrator applications gains access to user profiles over the
Internet. These user profiles are described as hierarchical ordered key-value pairs.
This is a quite simple model but already powerful enough to allow effective patternmatching queries on the user profiles (Chen & Kotz, 2000). However, as shown in
Figure 4 also a User Profile Connector for the access to, for example, a Composite
Capability/Preference Profile (CC/PP) server could be plugged into the framework.
On the same level, the Media Data Connectors abstract from the access to media
elements in different media storage and retrieval solutions that are available today
with a unified interface. The different systems for storage and content-based
retrieval of media data are interfaced by this component. For example the
URIMediaConnector, we developed for our demonstrator applications, provides
a flexible access of media objects and its associated meta data from the Internet via
http or ftp protocols. The meta data is stored in a single index file, describing not
only the technical characteristics of the media elements and containing the location
262 Scherp & Boll
(2)
where to find the media elements in the Internet, but also comprise additional
information about them, for example, a short description of what is shown in a
picture or keywords for which one can search. By analogy with the access to user
profile information, another Media Data Connector plugged into the framework
could provide access to other media and meta data sources, for example, an image
retrieval (IR) system like IBMs QBIC (IBM Corporation, 2004b).
The Media Data Connector supports the query of media elements by the client
application (client-pull) as well as the automatic notification of the personalized
application when a new media object arises in the media database (server-push).
The latter is required, for example, by the personalized multimedia sports news
ticker (see the section about Sports4U) which is based on a multimedia event space
(Boll & Westermann, 2003).
Accessors: The User Profile Accessor and the Media Pool Accessor provide the
internal data model of the user profiles and media data information within the
system. Via this layer the user profile information and media data needed for the
desired content personalization are accessible and processable for the application.
The Connectors and Accessors are designed such that they are not reinventing
existing systems for user modeling or multimedia content management. They,
rather, provide a seamless integration of the systems by distinct interfaces and
comprehensive data models. In addition, when a personalized multimedia application uses more than one user profile database or media database, the Accessor layer
encapsulates the resources so that the access to them is transparent to the client
application.
While the following layer (3) to (5) each constitute single components within the
MM4U framework, the Accessor layer and Connectors layer do not. Instead the left side
and the right side of the layers (1) and (2), i.e., the User Profile Accessor and User Profile
Connectors as well as the Media Pool Accessor and Media Data Connectors, each form
one component in MM4U.
(3)
(4)
Multimedia Composition: The Multimedia Composition component comprises

abstract operators in compliance with the composition capabilities of multimedia
composition models like SMIL, Madeus, and ZYX, which provide complex multimedia composition functionality. It employs the data from the User Profile Accessor
and the Media Pool Accessor for the multimedia composition task. The Multimedia
Composition component is developed as such that it enables to develop additional,
possibly more complex or application-specific composition operators that can be
seamlessly plugged-in into the framework. Result of the multimedia composition
is an internal object-oriented representation of the personalized multimedia content independent of the different presentation formats.
Presentation Format Generators: The Presentation Format Generators work on the
internal object-oriented data model provided by the Multimedia Composition
component and convert it into a standardized presentation format that can be
displayed by the corresponding multimedia player on the client device. In contrast
to the multimedia composition operators, the Presentation Format Generators are
completely independent of the concrete application domain and only rely on the
targeted output format. In MM4U, we have already developed Presentation Format
(5)
Generators for SMIL 2.0, the Basic Language Profile (BLP) of SMIL 2.0 for mobile
devices (Ayars et al., 2001), SVG 1.2, Mobile SVG 1.2 (Andersson et al., 2004a)
comprising SVG Tiny for multimedia-ready mobile phones and SVG Basic for pocket
computers like Personal Digital Assistants (PDA) and Handheld Computers (HHC),
and HTML (Raggett et al., 1998). We are currently working on Presentation Format
Generators for Macromedia Flash (Macromedia, 2004) and other multimedia document model formats including HTML+TIME, the 3GPP SMIL Language Profile (3rd
Generation Partnership Project, 2003b), which is a subset of SMIL used for scene
description within the Multimedia Messaging Service (MMS) interchange format
(3rd Generation Partnership Project, 2003a), and XMT-Omega, a high-level abstraction of MPEG-4 based on SMIL (Kim et al., 2000).
Multimedia Presentation: The Multimedia Presentation component on top of the
framework realizes the interface for applications to actually play the presentation
of different multimedia presentation formats. The goal here is to integrate existing
presentation components of the common multimedia presentation formats like
SMIL, SVG, or HTML+TIME which the underlying Presentation Format Generator
produces. So the developers benefit from the fact that only players for standardized
multimedia formats need to be installed on the users end device and that they must
not spend any time and resources in developing their own render and display
engine for their personalized multimedia application.
The layered architecture of MM4U permits easy adaption for the particular requirements that can occur in the development of personalized multimedia applications. So
special user profile connectors as well as media database connectors can be embedded
into the Connectors layer of the MM4U framework to integrate the most diverse and
individual solutions for storage, retrieval and gathering for user profile information and
media data. With the ability to extend the Multimedia Composition layer by complex and
sophisticated composition operators, arbitrary personalization functionality can be
added to the framework. The Presentation Format Generator component allows integrating any output format into the framework to support most different multimedia players
that are available for the different end devices.
The personalized selection and composition of media elements and operators into
a coherent multimedia presentation is the central task of the multimedia content creation
process which we present in more detail in the following section.
CREATING PERSONALIZED
MULTIMEDIA CONTENT
The MM4U framework provides the general functionality for the dynamic composition of media elements and composition operators into a coherent personalized
multimedia presentation. Having presented the framework layers in the previous section,
we now look in more detail how the layers contribute to the different tasks in the general
personalization process as shown in Figure 2. The Media Data Accessor layer provides
the personalized selection of media elements by their associated meta data and is
described in the next subsection. The Multimedia Composition layer supports the
composition of media elements into time and space in the internal multimedia represenCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
264 Scherp & Boll
tation format in three different manners, which are presented in detail in the the next three
subsections, and the final subsection describes the last step, the transformation of the
multimedia content in internal document model to an output format that is actually
delivered to and rendered by the client devices. This is supported by the Presentation
Format Generators layer.
Personalized Multimedia Content Selection

For creating personalized multimedia content, first those media elements have to be
selected from the media databases that are most relevant to the users request. This
personalized media selection is realized by the Media Data Accessor and Media Data
Connector component of the framework. For the actual personalized selection of media
elements, a context object is created within the Multimedia Composition layer carrying
the user profile information, technical characteristics of the end device, and further
application-specific information. With this context object, the unified interface of the
Media Data Accessor for querying media elements is called. The context object is handed
over to the concrete Media Data Connector of the connected media database. Within the
Media Data Connector, the context object is mapped to the meta data associated with the
media elements in the database and those media elements are determined that match the
request, that is, the given context object at best. It is important to note that the Media
Data Accessor and Media Data Connector layer integrate and embrace existing multimedia information systems and modern content-based multimedia retrieval solutions. This
means that the retrieval of the best match can only be left to the underlying storage
and management systems. The framework can only provide for comprehensive and lean
interfaces to these systems. This can be our own URIMediaServer accessed by the
URIMediaConnector but also other multimedia databases or (multi)media retrieval
solutions. The result set of the query is handed back by the Accessor to the composition
layer.
For example, the context object for our mobile tourist guide application carries
information about user interests and preferences with respect to the sights of the city,
the display size of the end device, and the location for the tourist guide. The Media Data
Connector, realized in this case by our URIMediaConnector, processes this context
object and returns images and videos from those sights in Oldenburg that both match
the users interests and preferences as well as the limited display size of the mobile device.
Based on the personalized selection of media elements the Multimedia Composition
layer provides the assembly of these media elements on three different manners, the
basic, complex and sophisticated composition of multimedia content, which is described
in the following sections.
Basic Composition Functionality

With the basic composition functionality the MM4U framework provides the basic
bricks for composing multimedia content. It forms the basis for assembling the selected
media elements into personalized multimedia documents and provides the means for
realizing the central aspects of multimedia document models, that is, the temporal model,
the spatial layout, and the interaction possibilities of the multimedia presentation. The
temporal model of the multimedia presentation is determined by the temporal relationships between the presentations media elements formed by the composition operators.
The spatial layout expresses the arrangement and style of the visual media elements in
the multimedia presentation. Finally, with the interaction model the user interaction of
the multimedia presentation is determined, in order to let the user choose between
different paths of a presentation. For the temporal model, we selected an interval-based
approach as found in Duda & Keramane (1995). The spatial layout is realized by a
hierarchical model for media positioning (Boll & Klas, 2001). For interaction with the user
navigational and decision interaction are supported, as can be found with SMIL (Ayars
et al., 2001) and MHEG-5 (Echiffre et al., 1998; International Organisation for Standardization, 1996).
A basic composition operator or basic operator can be regarded as an atomic unit
for multimedia composition, which cannot be further broken down. Basic operators are
quite simple but applicable for any application area and therefore most flexible. Basic
temporal operators realize the temporal model, and basic interaction operators realize the
interaction possibilities of the multimedia presentation, as specified above. The two
basic temporal operators Sequential and Parallel, for example, can be used to present
media elements one after the other in a sequence respectively to present media elements
parallel at the same time. With basic temporal operators and media elements, the temporal
course of the presentation can be determined like a slideshow as depicted in Figure 5.
The operators are represented by white rectangles and the media elements by gray ones.
The relation between the media elements and the basic operators is shown by the edges
beginning with a filled circle at an operator and ending with a filled rhombus respectively
a diamond at a media element or another operator. The semantics of the slideshow shown
in Figure 5 are that it starts with the presentation of the root element, which is the Parallel
operator. The semantics of the Parallel operator are that it shows the operators and media
elements that are attached to it at the same time. This means that the audio file starts to
play while simultaneously the Sequential operator is presented. The semantics of the
Sequential operator are to show the attached media elements one after another, so while
the audio file is played in the background, the four slides are presented in sequence.
Besides the basic composition operators, the so-called projectors are part of the
Multimedia Composition layer. Projectors can be attached to operators and media
elements to define, for example, the visual and acoustical layout of the multimedia
presentation. Figure 6 shows the slideshow example from above with projectors attached.
The spatial position as well as the width and height of the single slide media elements
are determined by the corresponding SpatialProjectors. The volume, treble, bass, and
balance of the audio medium is determined by the attached AcousticProjector.
Figure 5. Slideshow as an example of assembled multimedia content
266 Scherp & Boll
Figure 6. Adding layout to the slideshow example
Besides temporal operators, the Multimedia Composition component offers basic

operators for specifying the interaction possibilities of the multimedia presentation.
Interaction can be added, for example, by using the basic operator InteractiveLink. It
defines a link represented by a single media element or a fragment of a multimedia
presentation that is clickable by the user and a target presentation the user receives if
he or she clicks on the link.
The description above presents some examples of the basic composition functionality the MM4U framework offers. The framework comprises customary composition
operators for creating multimedia content as provided by modern multimedia presentation formats like SMIL and SVG. Even though the basic composition functionality does
not reflect the fancy features of some of todays multimedia presentation formats, it
supports the very central multimedia features of modeling time, space, and interaction.
This allows the transformation of the internal document model into many different
multimedia presentation formats for different end devices.
With the basic multimedia composition operators the framework offers, arbitrary
multimedia presentations can be assembled. However, so far the MM4U framework
provides just basic multimedia composition functionality. In the same way that one
would use an authoring tool to create SMIL presentations, for example, the GRiNS editor
(Oratrix, 2004), one can also use a corresponding authoring tool for the basic composition
operators the MM4U framework offers to create multimedia content. For reasons of
reusing parts of the created multimedia presentations, for example, a menu bar or a
presentations layout, and for convenience, there is a need for more complex and
application-specific composition operators that provide a more convenient support for
creating the multimedia content.
Complex Composition Functionality

For creating presentations that are more complex, the Multimedia Composition layer
provides the ability to abstract from basic to complex operators. A complex composition
operator encapsulates the composition functionality of an arbitrary number of basic
operators and projectors and provides the developers with a more complex and application-specific building block for creating the multimedia content. Complex composition
operators are composed of basic and others complex operators. As complex composition
operators not only embed basic but also other complex operators, they provide for a reuse
of composition operators. In contrast to the basic operators, the complex composition
operators can be dismantled into their individual parts. Figure 7 depicts a complex
composition operator for our slideshow example. It encapsulates the media elements,
operators, and projectors of the slideshow (the latter are omitted in the diagram to reduce
complexity). The complex operator Slideshow, indicated by a small c symbol in the upper
right corner, represents an encapsulation of the former slideshow presentation in a complex
object and forms itself a building block for more complex multimedia composition.
Complex operators, as described above, define fixed encapsulated presentations.
Their temporal flow, spatial layout, and the used media elements cannot be changed
subsequently. However, a complex composition operator does not necessarily need to
specify all media elements, operators, and projectors of the respective multimedia
document tree. Instead, to be more flexible, some parts can be intentionally left open.
These parts constitute the parameters of a complex composition operator and have to be
filled in for concrete usage of these operators. Such parameterized complex composition
operators are one means to define multimedia composition templates within the MM4U
framework. However, only prestructured multimedia content can be created with these
templates, since the complex composition operators can only encapsulate presentations
of a fixed structure.
Figure 7. The slideshow example as a complex composition operator
268 Scherp & Boll
Figure 8 shows the slideshow example as a parameterized complex composition

operator. In this case, the complex operator Slideshow comprises the two basic operators
Parallel and Sequential. The Slideshows parameters are the place holders for the single
slides and have to be instantiated when the operator is used within the multimedia
composition. The slideshows audio file is already preselected. In addition, the parameters of a complex composition operator can be typed, that is, they expect a special type
of operator or media element. The Slideshow operator would expect visual media elements
for the parameters Slide 1 to Slide 4. To indicate the complex operators parameters, they
are visualized by rectangles with dotted lines. The preselected audio file is already
encapsulated in the complex operator as illustrated in Figure 7.
In the same way projectors are attached to basic operators in the basic composition
functionality section, they can also be attached to complex operators. The SpatialProjector
attached to the Slideshow operator shown in Figure 8 determines that the slideshows
position within a multimedia presentation is the position x = 100 pixel and y = 50 pixel in
relation to the position of its parent node.
With basic and complex composition operators one can build multimedia composition functionality that is equivalent to the composition functionality of advanced
multimedia document models like Madeus (Jourdan et al., 1998) and ZYX (Boll & Klas,
2001). Though complex composition operators can have an arbitrary number of parameters and can be configured individually each time they are used, the internal structure
of complex operators is still static. Once a complex operator is defined, the number of
parameters and their type are fixed and cannot be changed. Using a complex composition
Figure 8. The slideshow example as parameterized complex composition operator
operator can be regarded as filling in fixed composition templates with suitable media
elements. Personalization can only take place in selecting those media elements that fit
the user profile information at best. For the dynamic creation of personalized multimedia
content even more sophisticated composition functionality is needed, that allows the
composition operators to change the structure of the generated multimedia content at
runtime. To realize such sophisticated composition functionality, additional composition logic needs to be included into the composition operators, which cannot be
expressed anymore even by the mentioned advanced document models we find in the
field.
Sophisticated Composition Functionality

With basic and complex composition functionality, we already provide the dynamic
composition of prestructured multimedia content by parameterized multimedia composition templates. However, such templates are only flexible concerning the selection of
the concrete composition parameters. To achieve an even more flexible dynamic content
composition, the framework provides sophisticated composition operators, which allow
determining the document structure and layout during creation time by additional
composition logic. Multimedia composition templates defined by using such sophisticated composition operators are no longer limited to creating prestructured multimedia
content only, but determine the document structure and layout of the multimedia content
on-the-fly and, depending on the user profile information, the characteristics of the used
end device, and any other additional information. The latter can be, for example, a
database containing sightseeing information. Such sophisticated composition operators exploit the basic and complex composition operators the MM4U framework offers
but allow more flexible, possibly application-specific multimedia composition and personalization functionality with their additional composition logic. This composition
logic can be realized by using document structures, templates, constraints and rules, or
by plain programming. Independent of how the sophisticated multimedia content
composition functionality is actually realized, the result of this composition process is
always a multimedia document tree that consists of basic and complex operators,
projectors, as well as media elements. In our graphical notation, sophisticated composition operators are represented in the same way as complex operators, but are labeled
with a small s symbol in the upper right corner.
Figure 9 shows an example of a parameterized sophisticated composition operator,
the CityMap. This operator provides the generation of a multimedia presentation
containing a city map image together with a set of available sightseeing spots on it. The
parameters of this sophisticated operator are the city map image of arbitrary size and a
spot image used for presenting the sights on the map. Furthermore, the CityMap operator
reads out the positions of the sights on a reference map (indicated by the table on the
right) and automatically recalculates the positions in dependence of the size of the actual
city map image. Which spots are selected and actually presented on the city map depends
on the user profile, in particular the types of sights he or she is interested in, and the
categories a sight belongs to. In addition, the size of the city map image is selected to
best fit the display of the end device. The CityMap operator is used within our
personalized city guide prototype presented in the section on Sightseeing 4U and serves
there for Desktop PCs as well as mobile devices.
270 Scherp & Boll
Figure 9. Insight into the sophisticated composition operator CityMap
The multimedia document tree generated by the CityMap operator is shown in the
bottom part of Figure 9. Its root element constitutes the Parallel operator. Attached to
it are the image of the city map and a set of InteractiveLink operators. Each InteractiveLink
represents a spot on the city map, instantiated by the spot image. The user can click on
the spots to receive multimedia presentations with further information about the sights.
The positions of the spot images on the city map are determined by the SpatialProjectors.
The personalized multimedia presentations about the sights are represented by the
sophisticated operators Target 1 to Target N.
The CityMap operator is one example of extending the personalization functionality
of the MM4U framework by a sophisticated application-specific multimedia composition
operator, here in the area of (mobile) tourism applications. This operator, for example, is
developed by programming the required dynamic multimedia composition functionality.
However, the realization of the internal composition logic of sophisticated operators is
independent of the used technology and programming language. The same composition
logic could also be realized by using a different technology, for example, a constraintbased approach. Though the actual realization of the personalized multimedia composition functionality would be different, the multimedia document tree generated by this
rule-based sophisticated operator would be the same as depicted in Figure 9.
Sophisticated composition operators allow embracing the most different solutions

for realizing personalized multimedia composition functionality. This can be plain
programming as in the example of the CityMap operator, document structures and
templates that are dynamically selected according to the user profile information and
filled in with media elements relevant to the user, or systems describing the multimedia
content generation by constraints and rules.
The core MM4U framework might not offer all kinds of personalized multimedia
composition functionality one might require, since the personalization functionality
always depends on the actual application to be developed and thus can be very specific.
Instead, the framework provides the basis to develop sophisticated personalized multimedia composition operators, such that every application can integrate its own personalization functionality into the framework. So, every sophisticated composition operator
can be seen as a small application itself that can conduct a particular multimedia
personalization functionality. This small application can be reused within others and
thus extends the functionality of the framework. The Multimedia Composition component allows a seamless plug-in of arbitrary sophisticated composition operators into the
MM4U framework. This enables the most complex personalized multimedia composition
task to be just plugged into the system and to be used by a concrete personalized
multimedia application.
With the sophisticated composition operators, the MM4U framework provides its
most powerful and flexible functionality to generate arbitrary personalized multimedia
content. However, this multimedia content is still represented in an internal document
model and has to be transformed into a presentation format that can be rendered and
displayed by multimedia players on the end devices.
From Multimedia Composition to Presentation

In the last step of the personalization process, the personalized multimedia content
represented in our internal document model is transformed by the Presentation Format
Generators component into one of the supported standardized multimedia presentation
formats, which can be rendered and displayed on the client device. The output format
of the multimedia presentation is selected according to the users preferences and the
capabilities of the end device, that is, the available multimedia players and the multimedia
presentation formats they support.
The Presentation Format Generators adapt the characteristics and facilities of the
internal document model provided by the Multimedia Composition layer in regard of the
used time model, spatial layout, and interaction possibilities to the particular characteristics and syntax of the concrete presentation format. For example, the spatial layout of
our internal document model is realized by a hierarchical model that supports the
positioning of media elements in relation to other media elements. This relative positioning is supported by most of todays presentation formats, for example, SMIL, SVG, and
XMT-Omega. However, there exist multimedia presentation formats that do not support
such a hierarchical model and only allow an absolute positioning of visual media elements
in regard to the presentations origin as, for example, the Basic Language Profile of SMIL,
3GPP SMIL, and Macromedias Flash. In this case, the Presentation Format Generators
component transforms the hierarchically organized spatial layout of the internal document model to a spatial layout of absolute positioning. How the transformation of the
272 Scherp & Boll
spatial model is actually performed and how the temporal model and interaction possibilities of the internal document model are transformed into the characteristics and syntax
of the concrete presentation formats is intentionally omitted in this book chapter due to
its focus on the composition and assembly of the personalized multimedia content and
is described in Scherp and Boll (2005).
IMPACT OF PERSONALIZATION
TO THE DEVELOPMENT OF
MULTIMEDIA APPLICATIONS
The multimedia personalization framework MM4U presented so far provides support to develop sophisticated personalized multimedia applications. Involved parties in
the development of such applications are typically a heterogeneous team of developers
from different fields including media designers, computer scientists, and domain experts.
In this section, we describe what challenges personalization brings to the development
of personalized multimedia applications and how and where the MM4U framework can
support the developer team to accomplish their job.
In the next subsection, the general software engineering issues in regard to
personalization are discussed. We describe how personalization affects the single
members of the heterogeneous developer team and how the MM4U framework supports
the development of personalized multimedia applications. The challenges that arise with
creating personalized multimedia content by the domain experts using an authoring tool
are presented in the following subsection. We also introduce how the MM4U framework
can be used to develop a domain-specific authoring tool in the field of e-learning content,
which aims to hide the technical details of content authoring from the authors and lets
them concentrate on the actual creation of the personalized multimedia content.
Influence of Personalization to Multimedia Software

Engineering
We observe that software engineering support for multimedia applications such as
proper process models and development methodologies are not likely to be found in this
area. Furthermore, the existing process models and development methodologies for
multimedia applications as for example Rout and Sherwood (1999) and Engels et al. (2003)
do not support personalization aspects. However, personalization requirements complicate the software development process even more and increase the development costs,
since every individual alternative and variant has to be anticipated, considered, and
actually implemented. Therefore, there is a high demand in supporting the development
process of such applications. In the following paragraphs, we first introduce how
personalization affects the software development process with respect to the multimedia
content creation process in general. Then we identify what support the developers of
personalized multimedia applications need and consider where the MM4U framework
supports the development process.
Since the term personalization profoundly depends on the applications context,
its meaning has ever to be reconsidered when developing a personalized application for
a new domain. Rossi et al. (2001) claim that personalization should be considered directly
from the beginning when a project is conceived. Therefore, the first activity when
developing a personalized multimedia application is to determine the personalization
requirements, that is, which aspects of personalization should be supported by the actual
application. For example, in the case of an e-learning application the personalization
aspects consider the automatic adaptation to the different learning styles of the students
and their prior knowledge about the topic. In addition, different degrees of difficulty
should be supported by a personalized e-learning application. In the case of a personalized mobile tourism application, however, the users location and his or her surroundings would be of interest for personalization instead. These personalization aspects must
be kept in mind during every activity throughout the whole development process. The
decision regarding which personalization aspects are to be supported has to be incorporated in the analysis and design of the personalized application and will hopefully
entail a flexible and extendible software design. However, this increases the overall
complexity of the application to be developed and automatically leads to a higher
development effort including longer development duration and higher costs. Therefore,
a good requirement analysis is crucial when developing personalized applications lest
one dissipates ones energies in bad software design with respect to the personalization
aspects.
When transferring the requirements for developing personalized software to the
specific requirements of personalized multimedia applications one can say that it affects
all members of the developer team: the domain expert, the media designers, and the
computer scientists, and putting higher requirements to them.
The domain expert normally contributes to the development of multimedia applications by providing input to draw storyboards of the specific applications domain. These
storyboards are normally drawn by media designers and are the most important means
to communicate the later applications functionality within the developer team. When
personalization comes into account, it is difficult to draw such storyboards, because of
the many possible alternatives and different paths in the application that are implicated
with personalization. Consequently, the storyboards change in regard to, for example,
the individual user profiles and the end devices that are used. When drawing storyboards
for a personalized multimedia application, those points in the storyboard have to be
identified and visualized where personalization is required and needed. Storyboards
have to be drawn for every typical personalization scenario concerning the concrete
application. This drawing task should be supported by interactive graphical tools to
create personalized storyboards and to identify reusable parts and modules of the
content.
It is the task of the media designer in the development of multimedia applications
to plan, acquire, and create media elements. With personalization, media designers have
to think additionally about the usage of media elements for personalization purposes,
that is, the media elements have to be created and prepared for different contexts. When
acquiring media elements, the media designers must consider for which user context the
media elements are created and what aspects of personalization are to be supported, for
example, different styles, colours, and spatial dimensions. Possibly a set of quite similar
media assets have to be developed, that only differ in certain aspects. For example, an
image or video has to be transformed for different end device resolutions, colour depth,
274 Scherp & Boll
and network connections. Since personalization means to (re)assemble existing media

elements into a new multimedia presentation, the media designers will also have to
identify reusable media elements. This means that additionally the storyboards must
already capture the personalization aspects. Not only the content but also the layout of
the multimedia application can change depending on the user context. So, the media
designers have to create different visual layouts for the same application to serve the
needs of different user groups. For example, an e-learning system for children would
generate colourful multimedia presentations with many auditory elements and a comiclike virtual assistant, whereas the system would present the same content in a much more
factual style for adults. This short discussion shows that personalization already affects
the storyboarding and media acquisition. Creating media elements for personalized
multimedia applications requires a better and elaborate planning of the multimedia
production. Therefore, a good media production strategy is crucial, due to the high costs
involved with the media production process. Consequently, the domain experts and the
media designers need to be supported by appropriate tools for planning, acquiring, and
editing media elements for personalized multimedia applications.
The computer scientists actually have to develop the multimedia personalization
functionality of the concrete application. What this personalization functionality is
depends heavily on the concrete application domain and is communicated with the
domain experts and media designers by using personalized storyboards. With personalization, the design of the application has to be more flexible and more abstract to meet
the requirements of changing user profile information and different end device characteristics. This is where the MM4U framework comes into play. It provides the computer
scientists the general architecture of the personalized multimedia application and
supports them in designing and implementing the concrete multimedia personalization
functionality. When using the MM4U framework, the computer scientists must know
how to use and to extend it. The framework provides the basis for developing both basic
and sophisticated multimedia personalization functionality, as for example the Slideshow
or the CityMap operator presented in the section on content. To assist the computer
scientists methodically we are currently working on guidelines and checklists of how to
develop the personalized multimedia composition operators and how to apply them.
Consequently, the development of personalized multimedia applications by using the
MM4U framework basically means to the computer scientists the design, development,
and deployment of multimedia composition operators for generating personalized
content. The concept of the multimedia personalization operators as introduced in the
content section, that every concrete personalized multimedia application is itself a new
composition operator increases reusage of existing personalization functionality. Furthermore, the interface design of the sophisticated operators makes it possible to
embrace existing approaches that are able to generate multimedia document trees, for
example, so it can be generated with the basic and complex composition functionality of
the MM4U framework.
Influence of Personalization to Multimedia Content

Authoring
Authoring of multimedia content is the process in which the multimedia presentations are actually created. This creation process is typically supported by graphical
authoring tools, for example, Macromedias Authorware and Director (Macromedia,

2004), Toolbook (Click2learn, 2001-2002), (Arndt, 1999), and (Gaggi & Celentano, 2002).
For creating the multimedia content, the authoring tools follow different design philosophies and metaphors, respectively. These metaphors can be roughly categorized into
script-based, card/page-based, icon-based, timeline-based, and object-based authoring
(Rabin & Burns, 1996). All these different metaphors have the same goal, to support
authors in creating their content. Even though based on these metaphors a set of valuable
authoring tools has been developed, these metaphors do not necessarily provide a
suitable means for authoring personalized content.
From the context of our research project Cardio-OP we derived early experiences
with personalized content authoring for domain experts in the field of cardiac surgery
(Klas et al., 1999; Greiner & Rose, 1998; Boll et al., 2001). One of the tools developed by
a project partner, the Cardio-OP Authoring Wizard, is a page-based easy-to-use multimedia authoring environment, enabling medical experts to compose a multimedia book
on operative techniques in the domain of cardiac surgery for three different target groups,
medical doctors, nurses, and students. The Authoring Wizard guides the author through
the particular authoring steps and offers dialogues specifically tailored to the needs of
each step. Coupled tightly with an underlying media server, the authoring wizard allows
use of every precious piece of media data available at the media server in all of the
instructional applications at different educational levels. This promotes reuse of expensively produced content in a variety of different contexts.
Personalization of the e-learning content is required here, since the three target
groups have different views and knowledge about the domain of cardiac surgery.
Therefore, the target groups require different information from such a multimedia book,
presented on an adequate level of difficulty for each group.
However, the experiences we gained from deploying this tool show that it is hard
to provide the domain authors with an adequate intuitive user interface for the creation
of personalized multimedia e-learning content for three educational levels. It was a
specific challenge for the computer scientists involved in the project to provide both
media creation tools and multimedia authoring wizard that allow the domain experts to
insert knowledge into the system, while at the same time hiding the technical details from
them as much as possible.
On the basis of the MM4U framework, we are currently developing a smart
authoring tool aimed for domain experts to create personalized multimedia e-learning
content. The tool we are developing works at the what-you-see-is-what-you-get
(WYSIWYG) level and can be seen as a specialized application employing the framework
to create personalized content. The content source from which this personalized elearning content is created constitutes the LEBONED repositories. Within the LEBONED
project (Oldenettel & Malachinski, 2003) digital libraries are integrated into learning
management systems. Using the content managed by the LEBONED system for new elearning units, a multimedia authoring support is needed for assembling existing elearning modules into new, possibly more complex, units. In the e-learning context, the
background of the learners is very relevant for the content that meets the users learning
demands that means a personalized multimedia content can meet the users background knowledge and interest much better than a one-size-fits-all e-learning unit. The
creation of an e-learning unit on the other side cannot be supported by a mere automatic
276 Scherp & Boll
process. Rather the domain experts would like to control the assembly of the content
because they are responsible for the content conveyed. The smart authoring tool guides
the domain experts through the composition process and supports them in creating
presentations that still provide flexibility to the targeted user context. In the e-learning
context we can expect domain experts such as lecturers that want to create a new elearning unit but do not want to be bothered with the technical details of (multimedia)
authoring.
We use the MM4U framework to build the multimedia composition and personalization functionality of this smart authoring tool. For this, the Multimedia Composition
component supports the creation and processing of arbitrary document structures and
templates. The authoring tool exploits this functionality for composition to achieve a
document structure that is suitable just for that content domain and the targeted
audience. The Media Data Accessor supports the authoring tool in those parts in which
it lets the author choose from only those media elements that are suitable for the intended
user contexts and that can be adapted to the users infrastructure. Using the Presentation
Format Generators, the authoring tool finally generates the presentations for the different
end devices of the targeted users. Thus the authoring process is guided and specialized
with regard to selecting and composing personalized multimedia content. For the
development of this authoring tool, the framework fulfils the same function in the process
of creating personalized multimedia content in a multimedia application as described in
the previous section on the framework. However, the creation of personalized content
is not achieved at once but step by step during the authoring process.
IMPLEMENTATION AND
PROTOTYPICAL APPLICATIONS
The framework, its components, classes and interfaces, are specified using the
Unified Modeling Language (UML) and has been implemented in Java. The development
process for the framework is carried out as an iterative software development with
stepwise refinement and enhancement of the frameworks components. The redesign
phases are triggered by the actual experience of implementing the framework but also by
employing the framework in several application scenarios. In addition, we are planning
to provide a beta version of the MM4U framework to other developers for testing the
framework and to develop their own personalized multimedia applications with MM4U.
Currently, we are implementing several application scenarios to prove the applicability of MM4U in different application domains. These prototypes are the first stress
test for the framework. At the same time the development of the sample applications gives
us an important feedback about the comprehensiveness and the applicability of the
framework. In the following sections, two of our prototypes that are based on the MM4U
framework are introduced: In the Sightseeing4U subsection, a prototype of a personalized city guide is presented, and in the Sports4U subsection a prototype of a personalized
multimedia sports news ticker is described.
Sightseeing4U: A Generic Personalized City Guide

Our first prototype using the MM4U framework is Sightseeing4U, a generic
personalized city guide application (Scherp & Boll, 2004a, 2004b; Boll et al., 2004). It is
applicable to develop personalized tourist guides for arbitrary cities, both for desktop
PCs and mobile devices such as PDAs (Personal Digital Assistants). The generic
Sightseeing4U application uses the MM4U framework and its modules as depicted in
Figure 3. The concrete demonstrator we developed for our hometown Oldenburg in
Northern Germany considers the pedestrian zone and comprises video and image material
of about 50 sights. The demonstrator is developed for desktop PCs as well as PDAs
(Scherp & Boll, 2004a). It supports personalization with respect to the users interests,
for example, churches, museums, and theatres, and preferences such as the favorite
language. Depending on the specific sightseeing interests, the proper sights are
automatically selected for the user. This is realized by category matching of the users
interests with the meta data associated to the sights. Figure 10 and Figure 11 show some
screenshots of our city guide application in different output formats and on different end
devices. The presentation in Figure 10 is targeted at a user interested in culture, whereas
the presentation in Figure 11 is generated for a user who is hungry and searches for a
good restaurant in Oldenburg. The different interests of the users result in different spots
that are presented on the map of Oldenburg. When clicking on a certain spot the user
receives a multimedia presentation with further information about the sight (see the little
boxes where the arrows point). Thereby, the media elements for the multimedia presen-
Figure 10. Screenshots of the city guide application for a user interested in culture
(presentation generated in SMIL 2.0 and SMIL 2.0 BLP format, respectively)
(a) RealOne Player (RealNetworks, 2003)

on a Desktop PC
(b) PocketSMIL Player

(INRIA, 2003) on a PDA
278 Scherp & Boll
Figure 11. Screenshots of the Sightseeing4U prototype for a user searching for a good
restaurant (output generated in SVG 1.2 and Mobile SVG format, respectively)
(a) Adobe SVG Plug-In

(Adobe Systems, 2001) on a Tablet PC
(b) Pocket eSVG viewer

(EXOR, 2001-2004) on a PDA
tation are automatically selected to fit the end devices characteristics best. For example,
a user sitting at a desktop PC receives a high-quality video about the palace of Oldenburg
as depicted in Figure 10a, while a mobile user gets a smaller video of less quality in Figure
10b. In the same way, the user searching for a good restaurant in Oldenburg receives
either a high-quality video when using a Tablet PC as depicted in Figure 11a, or a smaller
one that meets the limitations of the mobile device as shown in Figure 11b. If there is no
video of a particular sight available at all, the personalized tourist guide automatically
selects images instead and generates a slideshow for the user.
Sports4U: A Personalized Multimedia Sports News

Ticker
A second prototype that uses our MM4U framework is the personalized multimedia
sports news ticker called Sports4U. The Sports4U application exploits the Medither
multimedia event space as introduced in (Boll & Westermann, 2003). The Medither is
based on a decentralized peer-to-peer infrastructure and allows one to publish, to find,
and to be notified about any kind of multimedia events of interest. In the case of Sports4U,
the event space forms the media data basis of sports-related multimedia news events. A
multimedia sports news event comprises data of different media types like describing text,
a title, one or more images, an audio record, or a video clip. The personalized sports ticker
application combines the multimedia data of the selected events, the available meta data,
and additional information, for example, from a soccer player database. The application
uses a sophisticated composition operator that automatically arranges these multimedia
Figure 12: Screenshots of the personalized sports newsticker Sport4U
sports news to a coherent presentation. It regards possible constraints like running time
limit and particular characteristics of the end device, like the limited display size of a
mobile device. The result is a sports news presentation that can be, for example, viewed
with an SMIL player over the Web as shown in Figure 12. With a suitable Media Data
Connector the Medither is connected to the MM4U framework. This connector not
only allows querying for media elements like the URIMediaConnector but also provides
the notification of incoming multimedia events to the actual personalized application.
Depending on the user context, the Sports4U prototype receives the sports news from
the pool of sports events in the Medither that match the users profile. The Sports4U
application alleviates the user from the time-consuming task of searching for sports news
he or she might be interested in.
CONCLUSION
In this chapter, we presented an approach for supporting the creation of personalized multimedia content. We motivated the need of technology to handle the flood of
multimedia information that allows for a much targeted, individual management and
access to multimedia content. To give a better understanding of the content creation
process we introduced the general approaches in multimedia data modeling and multimedia authoring as we find it today. We presented how the need for personalization of
multimedia content heavily affects the multimedia content creation process and can only
result in a dynamic, (semi)automatic support for the personalized assembly of multimedia
content. We looked into existing related approaches ranging from personalization in the
text-centric Web context over single media personalization to the personalization of
multimedia content. Especially for complex personalization tasks we observe that an
(additional) programming is needed and propose a software engineering support with our
Multimedia for you Framework (MM4U).
We presented the MM4U framework concept in general and, in more detail, the
single layers of the MM4U framework: access to user profile information, personalized
280 Scherp & Boll
media selection by meta data, composition of complex multimedia presentations, and

generation of different output formats for different end devices. As a central part of the
framework, we developed multimedia composition operators which create multimedia
content in an internal model and representation for multimedia presentations, integrating
the composition capabilities of advanced multimedia composition models. Based on this
representation, the framework provides so-called generators to dynamically create
different context-aware multimedia presentations in formats such as SMIL and SVG. The
usage of the framework and its advantages has been presented in the context of
multimedia application developers but also in the specific case of using the frameworks
specific features for the development of a high-level authoring tool for domain experts.
With the framework developed, we achieved our goals concerning the development
of a domain independent framework that supports the creation of personalized multimedia content independent of the final presentation format. Its design allows to just use
the functionality it provides, for example, the access to media data, associated meta data,
and user profile information, as well as the generation of the personalized multimedia
content in standardized presentation formats. Hence, the framework relieves the developers of personalized multimedia applications from common tasks needed for content
personalization, that is, personalized content selection, composition functionality, and
presentation generation, and lets them concentrate on their application-specific job.
However, the framework is also designed to be extensible with regard to applicationspecific personalization functionality, for example, by an application-specific personalized multimedia composition functionality. With the applications in the field of tourism
and sports news we illustrated the usage of the framework in different domains and
showed how the framework easily allows one to dynamically create personalized
multimedia content for different user contexts and devices.
The framework has been designed not to become yet another framework but to base
on and integrate previous and existing research in the field. The design has been based
on our long-term experience in advanced multimedia composition models and an extensive study of previous and ongoing related approaches. Its interfaces and extensibility
explicitly allow not only to extend the frameworks functionality but to embrace existing
solutions of other (research) approaches in the field. The dynamically created personalized multimedia content needs semantically rich annotated content in the respective
media databases. In turn, the newly created content itself not only provides users with
personalized multimedia information but at the same time forms a new, semantically even
richer multimedia content that can be retrieved and reused. The composition operators
provide common multimedia composition functionality but also allow the integration of
very specific operators. The explicit decision for presentation independence by a
comprehensive internal composition model makes the framework both independent of
any specific presentation format and prepares it for future formats to come.
Even though a software engineering approach towards dynamic creation of personalized multimedia content may not be the obvious one, we are convinced that our
framework fills the gap between dynamic personalization support based on abstract data
models, constraints or rules, and application-specific programming of personalized
multimedia applications. With the MM4U framework, we are contributing to a new but
obvious research challenge in the field of multimedia research, that is, the shift from tools
for the manual creation of static multimedia content towards techniques for the dynamic
creation of, respectively, context-aware and personalized multimedia content, which is

needed in many application fields. Due to its domain independence, MM4U can be used
by arbitrary personalized multimedia applications, each application applying a different
configuration of the framework. Consequently, for providers of applications the framework approach supports a cheaper and quicker development process and by this
contributes to a more efficient personalized multimedia content engineering.
REFERENCES
3rd Generation Partnership Project (2003a). TS 26.234; Transparent end-to-end packetswitched streaming service; protocols and codecs (Release 5). Retrieved December 19, 2003, from http://www.3gpp.org/ftp/Specs/html-info/26234.htm
3rd Generation Partnership Project (2003b). TS 26.246; Transparent end-to-end packetswitched streaming service: 3GPP SMIL language profile (Release 6). Retrieved
December 19, 2003, from http://www.3gpp.org/ftp/Specs/html-info/26246.htm
Ackermann, P. (1996). Developing object oriented multimedia software: Based on
MET++ application framework. Heidelberg, Germany: dpunkt.
Adobe Systems, Inc., USA (2001). Adobe SVG Viewer. Retrieved February 25, 2004, from
http://www.adobe.com/svg/
Agnihotri, L., Dimitrova, N., Kender, J., & Zimmerman, J. (2003). Music videos miner. In
ACM Multimedia.
Allen, J. F. (1983, November). Maintaining knowledge about temporal intervals. In
Commun. ACM, 25(11).
Amazon, Inc., USA (1996-2004). Amazon.com. Retrieved February 20, 2004, from http:/
/www.amazon.com/
Andersson, O., Axelsson, H., Armstrong, P., Balcisoy, S., et al. (2004a). Mobile SVG
profiles: SVG Tiny and SVG Basic. W3C recommendation 25/03/2004. Retrieved
June 10, 2004, from http://www.w3.org/TR/SVGMobile12/
Andersson, O., Axelsson, H., Armstrong, P., Balcisoy, S., et al. (2004b). Scalable vector
graphics (SVG) 1.2 specification. W3C working draft 05/10/2004. Retrieved June
10, 2004, from http://www.w3c.org/Graphics/SVG/
Andr, E. (1996). WIP/PPP: Knowledge-based methods for fully automated multimedia
authoring. In Proceedings of the EUROMEDIA96. London.
Andr, E., & Rist, T. (1995). Generating coherent presentations employing textual and
visual material. In Artif. Intell. Rev, 9(2-3). Kluwer Academic Publishers.
Andr, E., & Rist, T. (1996, August). Coping with temporal constraints in multimedia
presentation planning. In Proceedings of the Thirteenth National Conference on
Artificial Intelligence (AAAI-96), Portland, Oregon.
Arndt, T. (1999, June). The evolving role of software engineering in the production of
multimedia applications. In IEEE International Conference on Multimedia Computing and Systems Volume 1, Florence, Italy.
Ayars, J., Bulterman, D., Cohen, A., Day, K., Hodge, E., Hoschka, P., et al. (2001).
Synchronized multimedia integration language (SMIL 2.0) specification. W3C
Recommendation 08/07/2001. Retrieved February 23, 2004, from http://
www.w3c.org/AudioVideo/
282 Scherp & Boll
Beckett, D., & McBride, B. (2003). RDF/XML syntax specification (revised). W3C recommendation 15/12/2003. Retrieved February 23, 2004: http://www.w3c.org/RDF/
Bohrer, K., & Holland, B. (2004). Customer profile exchange (cpexchange) specification
version 1.0, 20/10/2000. Retrieved January 27, 2004, from http://
www.cpexchange.org/standard/cpexchangev1_0F.zip
Boll, S. (2003, July). Vienna 4 U - What Web services can do for personalized multimedia
applications. In Proceedings of the Seventh Multi-Conference on Systemics
Cybernetics and Informatics (SCI 2003), Orlando, Florida, USA.
Boll, S., & Klas, W. (2001). ZYX - A multimedia document model for reuse and adaptation.
In IEEE Transactions on Knowledge and Data Engineering, 13(3).
Boll, S., Klas, W., & Wandel, J. (1999, November). A cross-media adaptation strategy for
multimedia presentations. Proc. of the ACM Multimedia Conf. 99, Part 1, Orlando,
Florida, USA.
Boll, S., Klas, W., Heinlein, C., & Westermann, U. (2001, August). Cardio-OP - Anatomy
of a multimedia repository for cardiac surgery. Technical Report TR-2001301,
University of Vienna, Austria.
Boll, S., Klas, W., & Westermann, U. (2000, August). Multimedia Document Formats Sealed Fate or Setting Out for New Shores? In Multimedia - Tools and Applications, 11(3).
Boll, S., Krsche, J., & Scherp, A. (2004, September). Personalized multimedia meets
location-based services. In Proceedings of the Multimedia-Informationssysteme
Workshop associated with the 34th annual meeting of the German Society of
Computing Science, Ulm, Germany.
Boll, S., Krsche, J., & Wegener, C. (2003, August). Paper chase revisited - A real world
game meet hypermedia (short paper). In Proc. of the Intl. Conference on Hypertext
(HT03), Nottingham, UK.
Boll, S., & Westermann, U. (2003, November). Medither - An event space for contextaware multimedia experiences. In Proc. of International ACM SIGMM Workshop
on Experiential Telepresence, Berkeley, CA., USA.
Bordegoni, M., Faconti, G., Feiner, S., Maybury, M. T., Rist, T., Ruggieri, S., et al. (1997,
December). A standard reference model for intelligent multimedia presentation
systems. In ACM Computer Standards & Interfaces, 18(6-7).
Brusilovsky, P. (1996). Methods and techniques of adaptive hypermedia. User Modeling
and User Adapted Interaction, 6(2-3).
Bugaj, S., Bulterman, D., Butterfield, B., Chang, W., Fouquet, G., Gran, C., et al. (1998).
Synchronized multimedia integration language (SMIL 1.0) specification. W3C
Recommendation 06/15/1998. Retrieved June 10, 2004, from http://www.w3.org/
TR/REC-smil/
Bulterman, D. C. A., van Rossum, G., & van Liere, R. (1991). A structure of transportable,
dynamic multimedia documents. In Proceedings of the Summer 1991 USENIX
Conf., Nashville, TN, USA.
Chen, G., & Kotz, D. (2000). A survey of context-aware mobile computing research.
Technical Report TR2000-381. Dartmouth University, Department of Computer
Science,
Click2learn, Inc, USA (2001-2002). Toolbook Standards-based content authoring.
Retrieved February 6, 2004 from http://home.click2learn.com/en/toolbook/
index.asp
CWI (2004). The cuypers multimedia transformation engine. Amsterdam, The Netherlands. Retrieved February 25, 2004, on http://media.cwi.nl:8080/demo/
De Bra, P., Aerts, A., Berden, B., De Lange, B., Rousseau, B., Santic, T., Smits, D., & Stash,
N. (2003, August). AHA! The Adaptive Hypermedia Architecture. Proceedings of
the ACM Hypertext Conference, Nottingham, UK.
De Bra, P., Aerts, A., Houben, G.-J., & Wu, H. (2000). Making general-purpose adaptive
hypermedia work. In Proc. of the AACE WebNet Conference, San Antonio, Texas.
De Bra, P., Aerts, A., Smits, D., & Stash, N. (2002a, October). AHA! version 2.0: More
adaptation flexibility for authors. In Proc. of the AACE ELearn2002 Conf.
De Bra, P., Brusilovsky, P., & Conejo, R. (2002b, May). Proc. of the Second Intl. Conf.
for Adaptive Hypermedia and Adaptive Web-Based Systems, Malaga, Spain,
Springer LNCS 2347.
De Bra, P., Brusilovsky, P., & Houben, G.-J. (1999a, December). Adaptive hypermedia:
From systems to framework. ACM Computing Surveys, 31(4).
De Bra, P., Houben, G.-J., & Wu, H. (1999b). AHAM: A dexter-based reference model for
adaptive hypermedia. In Proceedings of the 10th ACM Conf. on Hypertext and
hypermedia: returning to our diverse roots, Darmstadt, Germany.
De Carolis, B., de Rosis, F., Andreoli, C., Cavallo, V., De Cicco, M L (1998). The Dynamic
Generation of Hypertext Presentations of Medical Guidelines. The New Review of
Hypermedia and Multimedia, 4.
De Carolis, B., de Rosis, F., Berry, D., & Michas, I. (1999). Evaluating plan-based
hypermedia generation. In Proc. of European Workshop on Natural Language
Generation, Toulouse, France.
de Rosis, F., De Carolis, B., & Pizzutilo, S. (1999). Software documentation with animated
agents. In Proc. of the 5th ERCIM Workshop on User Interfaces For All, Dagstuhl,
Germanny.
Dublin Core Metadata Initiative (1995-2003). Expressing simple Dublin Core in RDF/
XML,1995-2003. Retrieved February 2, 2004, from http://dublincore.org/documents/2002/07/31/dcmes-xml/
Duda, A., & Keramane, C. (1995). Structured temporal composition of multimedia data.
Proceedings of the IEEE International Workshop Multimedia-Database-Management Systems.
Duke, D. J., Herman, I., & Marshall, M. S. (1999). PREMO: A framework for multimedia
middleware: Specification, rationale, and java binding. New York: Springer.
Echiffre, M., Marchisio, C., Marchisio, P., Panicciari, P., & Del Rossi, S. (1998, JanuaryMarch). MHEG-5 Aims, concepts, and implementation issues. In IEEE Multimedia.
Egenhofer, M. J., & Franzosa, R. (1991, March). Point-Set Topological Spatial Relations.
Int. Journal of Geographic Information Systems, 5(2).
Elhadad, M., Feiner, S., McKeown, K., & Seligmann, D. (1991). Generating customized
text and graphics in the COMET explanation testbed. In Proc. of the 23rd
Conference on Winter Simulation. IEEE Computer Society, Phoenix, Arizona,
USA.
Engels, G., Sauer, S., & Neu, B. (2003, October). Integrating software engineering and
user-centred design for multimedia software developments. In Proc. IEEE Symposia on Human-Centric Computing Languages and Environments - Symposium on
284 Scherp & Boll
Visual/Multimedia Software Engineering, Auckland, New Zealand. IEEE Computer Society Press.
Exor International Inc. (2001-2004). eSVG: Embedded SVG. Retrieved February 12, 2004,
from http://www.embedding.net/eSVG/english/overview/overview_frame.html
Fink, J., Kobsa, A., & Schreck, J. (1997). Personalized hypermedia information through
adaptive and adaptable system features: User modeling, privacy and security
issues. In A. Mullery, M. Besson, M. Campolargo, R. Gobbi, & R. Reed (Eds.),
Intelligence in services and networks: Technology for cooperative competition.
Berlin: Springer.
Foundation for Intelligent Physical Agents (2002). FIPA device ontology specification,
2002. Retrieved January 23, 2004, from http://www.fipa.org/specs/fipa00091/
Gaggi, O., & Celentano, A. (2002). A visual authoring environment for prototyping
multimedia presentations. In Proceedings of the IEEE Fourth International
Symposium on Multimedia Software Engineering.
Girgensohn, A., Bly, S., Shipman, F., Boreczky, J., & Wilcox, L. (2001). Home video editing
made easy Balancing automation and user control. In Proc. of the HumanComputer Interaction, Tokyo, Japan.
Girgensohn, A., Shipman, F., & Wilcox, L. (2003, November). Hyper-Hitchcock: Authoring
Interactive Videos and Generating Interactive Summaries. In Proc. ACM Multimedia.
Greiner, C., & Rose, T. (1998, November). A Web based training system for cardiac
surgery: The role of knowledge management for interlinking information items. In
Proc. The World Congress on the Internet in Medicine, London.
Hardman, L. (1998, March). Modeling and Authoring Hypermedia Documents. Doctoral
dissertation, University of Amsterdam, The Netherlands.
Hardman, L., Bulterman, D. C. A., & van Rossum, G. (1994b, February). The Amsterdam
Hypermedia Model: Adding time and context to the Dexter Model. In Comm. of the
ACM, 37(2).
Hardman, L., van Rossum, G., Jansen, J., & Mullender, S. (1994a). CMIFed: A transportable hypermedia authoring system. In Proc. of the Second ACM International
Conference on Multimedia, San Francisco.
Hirzalla, N., Falchuk, B., & Karmouch, A. (1995). A temporal model for interactive
multimedia scenarios. In IEEE Multimedia, 2(3).
Hunter, J. (1999, October). Multimedia metadata schemas. Retrieved June 16, 2004, from
http://www2.lib.unb.ca/ Imaging_docs/IC/schemas.html
IBM Corporation, USA. (2004a). IBM research Video semantic summarization systems.
Retrieved June 15, 2004, from http://www.research.ibm.com/MediaStar/
VideoSystem.html#Summarization%20Techniques
IBM Corporation, USA. (2004b). QBIC home page. Retrieved June 16, 2004, from http:/
/wwwqbic.almaden.ibm.com/
INRIA (2003). PocketSMIL 2.0. Retrieved February 24, 2004, from http://
opera.inrialpes.fr/pocketsmil/
International Organisation for Standardization (1996). ISO 13522-5, information technology Coding of multimedia and hypermedia information, Part 5: Support for
base-level interactive applications. Geneva, Switzerland: International
Organisation for Standardization.
ISO/IEC (1999, July). JTC 1/SC 29/WG 11, MPEG-7: Context, Objectives and Technical
Roadmap, V.12. ISO/IEC Document N2861. Geneva, Switzerland: Int. Organisation
for Standardization/Int. Electrotechnical Commission.
ISO/IEC (2001a, November). JTC 1/SC 29/WG 11. InformationtechnologyMultimedia
content description interfacePart 1: Systems. ISO/IEC Final Draft International
Standard 15938-1:2001. Geneva, Switzerland: Int. Organisation for Standardization/
Int. Electrotechnical Commission.
ISO/IEC (2001b, September). JTC 1/SC 29/WG 11. Information technologyMultimedia
content description interfacePart 2: Description definition language. ISO/IEC
Final Draft Int. Standard 15938-2:2001. Geneva, Switzerland: Int. Organisation for
Standardization/Int. Electrotechnical Commission.
ISO/IEC (2001c, July). JTC 1/SC 29/WG 11. Information technologyMultimedia content description interfacePart 3: Visual. ISO/IEC Final Draft Int. Standard
15938-3:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int.
Electrotechnical Commission.
ISO/IEC (2001d, June). JTC 1/SC 29/WG 11. Information technologyMultimedia
content description interfacePart 4: Audio. ISO/IEC Final Draft Int. Standard
15938-4:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int.
Electrotechnical Commission.
ISO/IEC (2001e, October). JTC 1/SC 29/WG 11. Information technologyMultimedia
content description interfacePart 5: Multimedia description schemes. ISO/IEC
Final Draft Int. Standard 15938-5:2001. Geneva, Switzerland: Int. Organisation for
Standardization/Int. Electrotechnical Commission.
Jourdan, M., Layada, N., Roisin, C., Sabry-Ismal, L., & Tardif, L. (1998). Madeus, and
authoring environment for interactive multimedia documents. ACM Multimedia.
Kim, M., Wood, S., & Cheok, L.-T. (2000, November). Extensible MPEG-4 textual format
(XMT). In Proc. of the 8th ACM Multimedia Conf., Los Angeles.
Klas, W., Greiner, C., & Friedl, R. (1999, July). Cardio-OP: Gallery of cardiac surgery. IEEE
International Conference on Multimedia Computing and Systems (ICMS 99).
Florence, July.
Klyne, G., Reynolds, F., Woodrow, C., Ohto, H., Hjelm, J., Butler, M. H., & Tran, L. (2003).
Composite capability/preference profile (CC/PP): Structure and vocabularies W3C Working Draft 25/03/2003.
Kopf, S., Haenselmann, T., Farin, D., & Effelsberg, W. (2004). Automatic generation of
summaries for the Web. In Proceedings Electronic Imaging 2004.
Lemlouma, T., & Layada, N. (2003, June). Media resources adaptation for limited devices.
In Proc. of the Sevent ICCC/IFIP International Conference on Electronic Publishing ELPUB 2003, Universidade deo Minho, Portugal.
Lemlouma, T., & Layada, N. (2004, January). Context-aware adaptation for mobile
devices. IEEE International Conference on Mobile Data Management, Berkeley,
California, USA.
Little, T. D. C., & Ghafoor, A. (1993). Interval-based conceptual models for timedependent multimedia data. In IEEE Transactions on Knowledge and Data
Engineering, 5(4).
Macromedia, Inc., USA (2003, January). Using Authorware 7. [Computer manual].
Available from http://www.macromedia.com/software/authorware/
286 Scherp & Boll
Macromedia, Inc., USA (2004). Macromedia. Retrieved June 15, 2004, from http://
www.macromedia.com/
McKeown, K., Robin, J., & Tanenblatt, M. (1993). Tailoring lexical choice to the users
vocabulary in multimedia explanation generation. In Proc. of the 31st conference
on Association for Computational Linguistics, Columbus, Ohio.
Oldenettel, F., & Malachinski, M. (2003, May). The LEBONED metadata architecture. In
Proc. of the 12th International World Wide Web Conference, Budapest, Hungary
(pp. S.207-216). ACM Press, Special Track on Education.
Open Mobile Alliance (2003). User agent profile (UA Prof). 20/05/2003. Retrieved
February 10, 2004, from http://www.openmobilealliance.org/
Oratrix (2004). GRiNS for SMIL Homepage. Retrieved February 23, 2004, from http://
www.oratrix.com/GRiNS
Papadias, D., & Sellis, T. (1994, October). Qualitative representation of spatial knowledge
in two-dimensional space. In VLDB Journal, 3(4).
Papadias, D., Theodoridis, Y., Sellis, T., & Egenhofer, M. J. (1995, March). Topological
relations in the world of minimum bounding rectangles: A study with R-Trees. In
Proc. of the ACM SIGMOD Conf. on Management of Data, San Jose, California.
Pree, W. (1995). Design patterns for object-oriented software development. Boston:
Addison-Wesley.
Rabin, M. D., & Burns, M. J. (1996). Multimedia authoring tools. In Conference Companion on Human Factors in Computing Systems, Vancouver, British Columbia,
Canada, ACM Press.
Raggett, D., Le Hors, A., & Jacobs, I. (1998). HyperText markup language (HTML)
version 4.0. W3C Recommendation, revised on 04/24/1998. Retrieved February 20,
2004, from http://www.w3c.org/MarkUp/
RealNetworks (2003). RealOne Player. Retrieved February 25, 2004, from http://
www.real.com/
Rossi, G., Schwabe, D., & Guimares, R. (2001, May). Designing personalized Web
applications. In Proceedings of the tenth World Wide Web (WWW) Conference,
Hong Kong. ACM.
Rout, T. P., & Sherwood, C. (1999, May). Software engineering standards and the
development of multimedia-based systems. In Fourth IEEE International Symposium and Forum on Software Engineering Standards. Curitiba, Brazil.
Scherp, A., & Boll, S. (2004a, March). MobileMM4U - Framework support for dynamic
personalized multimedia content on mobile systems. In Multikonferenz
Wirtschaftsinformatik 2004, special track on Technologies and Applications for
Mobile Commerce.
Scherp, A., & Boll, S. (2004b, October). Generic support for personalized mobile multimedia tourist applications. Technical demonstration for the ACM Multimedia
Conference, New York, USA.
Scherp, A., & Boll, S. (2005, January). Paving the last mile for multi-channel multimedia
presentation generation. In Proceedings of the 11th International Conference on
Multimedia Modeling, Melbourne, Australia.
Schmitz, P., Yu, J., & Santangeli, P. (1998). Timed interactive multimedia extensions for
HTML (HTML+TIME). W3C, version 09/18/1998. Retrieved February 20, 2004, from
http://www.w3.org/TR/NOTE-HTMLplusTIME
Stash, N., Cristea, A., & De Bra, P. (2004). Authoring of learning styles in adaptive
hypermedia. In WWW04 Education Track. New York: ACM.
Stash, N., & De Bra, P. (2003, June). Building Adaptive Presentations with AHA! 2.0.
Proceedings of the PEG Conference, Sint Petersburg, Russia.
Sun Microsystems, Inc. (2004). Java media framework API. Retrieved February 15, 2004,
from http://java.sun.com/products/java-media/jmf/index.jsp
Szyperski, C., Gruntz, D., & Murer, S. (2002). Component software: Beyond objectoriented programming (2nd ed.). Boston: Addison-Wesley.
van Ossenbruggen, J.R., Cornelissen, F.J., Geurts, J.P.T.M., Rutledge, L.W., & Hardman,
H.L. (2000, December). Cuypers: A semiautomatic hypermedia presentation system. Technical Report INS-R0025. CWI, The Netherlands.
van Rossum, G., Jansen, J., Mullender, S., & Bulterman, D. C. A. (1993). CMIFed: A
presentation environment for portable hypermedia documents. In Proc. of the First
ACM International Conference on Multimedia, Anaheim, California.
Villard, L. (2001, November). Authoring transformations by direct manipulation for
adaptable multimedia presentations. In Proceeding of the ACM Symposium on
Document Engineering, Atlanta, Georgia.
Wahl, T., & Rothermel, K. (1994, May). Representing time in multimedia systems. In Proc.
IEEE Int. Conf. on Multimedia Computing and Systems, Boston.
Wu, H., de Kort, E., & De Bra, P. (2001). Design issues for general-purpose adaptive
hypermedia systems. In Proc. of the 12th ACM Conf. on Hypertext and Hypermedia,
rhus, none, Denmark.
Yahoo!, Inc. (2002). MyYahoo!. Retrieved February 17, 2004, from http://my.yahoo.com/
288
Zutshi, Wilson, Krishnaswamy & Srinivasan
Chapter 12
The Role of Relevance

Feedback in Managing
Multimedia Semantics:
A Survey
Samar Zutshi, Monash University, Australia

Campbell Wilson, Monash University, Australia
Shonali Krishnaswamy, Monash University, Australia
ABSTRACT
Relevance feedback is a mature technique that has been used to take user subjectivity
into account in multimedia retrieval. It can be seen as an attempt to bridge the semantic
gap by keeping a human in the loop. A variety of techniques have been used to
implement relevance feedback in existing retrieval systems. An analysis of these
techniques is used to develop the requirements of a relevance feedback technique that
aims to be capable of managing semantics in multimedia retrieval. It is argued that
these requirements suggest a case for a user-centric framework for relevance feedback
with low coupling to the retrieval engine.
The Role of Relevance Feedback in Managing Multimedia Semantics
289
INTRODUCTION
A key challenge in multimedia retrieval remains the issue often referred to as the
semantic gap. Similarity measures computed on low-level features may not correspond
well with human perceptions of similarity (Zhang et al., 2003). Human perceptions of
similarity of multimedia objects such as images or video clips tend to be semantically
based, that is, the perception that two multimedia objects are similar arises from these two
objects evoking similar or overlapping concepts in the users mind. Therefore different
users posing the same query may have very different expectations of what they are
looking for. On the other hand, existing retrieval systems tend to return the same results
for a given query. In order to cater to user subjectivity and to allow for the fact that the
users perception of similarity may be different from the systems similarity measure, the
users need to be kept in the loop.
Relevance feedback is a mature and widely recognised technique for making
retrieval systems better satisfy users information needs (Rui et al., 1997). Informally,
relevance feedback can be interpreted as a technique that should be able to understand
the users semantic similarity perception and to incorporate this in subsequent iterations.
This chapter aims to provide an overview of the rich variety of relevance feedback
techniques described in the literature while examining issues related to the semantic
implications of these techniques. Section 2 presents a discussion of the existing literature
on relevance feedback and highlights certain advantages and disadvantages of the
reviewed approaches. This analysis is used to develop the requirements of a relevance
feedback technique that would be an aid in managing multimedia semantics (Section 3).
A high-level framework for such a technique is outlined in Section 4.
RELEVANCE FEEDBACK IN
CONTENT-BASED MULTIMEDIA RETRIEVAL
Broadly speaking, anything the user does or says can be used to interpret
something about their view of a computer system; for example, the time spent at a Web
page, the motion of their eyes while viewing an electronic document, and so forth. In the
context of content-based multimedia retrieval we use the term relevance feedback in the
conventional sense, whereby users are allowed to indicate their opinion of results
returned by a retrieval system. This is done in a number of ways; for example, the user
only selects results that they consider relevant to their query, the user provides positive
as well as negative examples, or the user is left to provide some sort of ranking of the
images. In general terms, the user classifies the result set into a number of categories.
The relevance feedback module should be able to use this classification to improve
subsequent retrieval. It is expected that several successive iterations will further refine
the result set, thus converging to an acceptable result. What makes a result acceptable
is dependent on the user; it may be a single result (the so-called target searching of
Cox et al. (1996)). On the other hand, acceptable may mean a sufficient number of
relevant results (as when the users are performing a category search).
Intuitively, for results to approximate semantic retrieval, the relevance feedback
mechanism should understand why users mark the results the way they do. It should
ideally be able to identify not only what is common about the results belonging to a
290
particular category but also take into account subtleties like what the difference is
between an example being marked nonrelevant and being left unmarked.
There is a rich body of literature on the subject of relevance feedback, particularly
in the context of content-based image retrieval (CBIR). For instance, Zhang et al. (2003)
interpret relevance feedback as a machine-learning problem and present a comprehensive review of relevance feedback algorithms in CBIR based on the learning and searching
natures of the algorithms. Zhou and Huang (2001b) discuss a selection of relevance
feedback variants in the context of multimedia retrieval examined under seven conceptual dimensions. These range from simple dimensions such as, What is the user looking
for, to the highly diverse, The goals and the learning machines (Zhou & Huang,
2001b). We examine a variety of existing relevance feedback methods, many from CBIR
since this area has received much attention in the literature. We classify them into one
of five broad approaches. While there are degrees of overlap, this taxonomy focuses on
the main conceptual thrust of the techniques and may be taken as representative rather
than exhaustive.
The Classical Approach
In this approach, documents and queries are represented by vectors in an ndimensional feature space. The feedback is implemented through Query Point Movement
(QPM) and/or Query ReWeighting (QRW). QPM aims to estimate an ideal query point
by moving it closer to positive example points and away from negative example points.
QRW tries to give higher importance to the dimensions that help in retrieving relevant
images and reduce the importance of the ones that do not. The classical approach is
arguably the most mature approach. Although elements of it can be found in almost all
existing techniques, certain techniques are explicitly or very strongly of this type.
The MARS system (Rui et al., 1997) uses a reweighting technique based on a
refinement of the text retrieval approach. Later work by Rui and Huang (1999) has been
adapted by Kang (2003) to use relevance feedback to detect emotional events in video.
To overcome certain disadvantages of these approaches, such as the need for ad-hoc
constants, the relevance feedback task is formalised as a minimisation problem in
Ishikawa et al. (1998).
A novel framework is presented in Rui and Huang (1999) based on a two-level image
model with features like colour, texture and shape occupying the higher level. The lower
level contains the feature vector for each feature. The overall distance between a training
sample and the query is defined in terms of both these levels. The performance of the
MARS and MindReader systems is compared against the novel model in terms of a
percentage of relevant images returned. While relevance feedback boosts retrieval
performance of all the techniques, the novel framework is able to consistently perform
better than MARS and MindReader (Rui & Huang, 1999).
The vector space model is not just restricted to CBIR. Liu and Wan (2003) have
developed two relevance feedback algorithms for use in audio retrieval with time domain
features as well as frequency domain features. The first is a standard deviation based
reweighting while the second relies on minimizing the weighted distance between the
relevant examples and the query. They demonstrate an average precision increase as a
result of the use of relevance feedback and claim that, Through the relevance feedback
291
some [...] semantics can be added to the retrieval system, although the claim is not very
well supported.
The recent techniques in this area are characterised by their robustness, solid
mathematical formulation and efficient implementations. However, their semantic underpinnings may be seen as somewhat simplistic due to the underlying assumptions that (a)
a single query vector is able to represent the users query needs and (b) visually similar
images are close together in the feature space. There is also often an underlying
assumption in methods of this kind that the users ideal query point remains static
throughout a query session. While this assumption may be justifiable while performing
automated tests for computation of performance benchmarks, it clearly does not capture
users information needs at a semantic level.
The Probabilistic-Statistical Approach

Feedback used to estimate the probability of other documents/images being
relevant is often performed using Bayesian reasoning, for example, the PicHunter system
(Cox et al., 1996). They are interested in the case when a user is interested in a specific
target image as the result of their search. The probability that a given datum is the target
is expressed in terms of the probability of the possible user actions (marking). The number
of iterations taken to retrieve the target image (search length) is used as an objective
benchmark. At each iteration four images are displayed. The user can select zero or more
images as relevant. Unmarked ones are considered not relevant (absolute judgement).
A subsequent extension (Cox et al., 1998), the stochastic-comparison search, is based
on interpreting the marked images as more relevant than unmarked ones (relative
judgement); the relevance feedback is addressed as a comparison-searching problem.
The Stochastic model in Geman and Moquet (1999) tries to improve the PicHunter
technique by recognising that the actual interactive process is more random than a noisy
response to a known metric. Hence the metric itself is modeled as a random variable in
three cases. In the first case with a fixed metric, in the second with a fixed distribution
and in the general case the distributions vary with the display and target. The more
complex and general model is seen to perform better in terms of mean search length than
the ones with simplifying assumptions when humans perform the tests.
The BIR probabilistic retrieval model (Wilson & Srinivasan, 2001) relies on inference
within Bayesian networks. Evidential inference within the Bayesian networks is employed for the initial retrieval. Following this, diagnostic inference, suppressed in the
initial retrieval, is used for relevance feedback subsequent to user selection of images
(Wilson & Srinivasan, 2002).
The generality and elaborate modeling of user subjectivity is representative of the
more sophisticated techniques approach while the computational complexity associated
with such modeling can be a challenge during implementation. The semantic interpretation of the techniques in this area is more appealing than those from the classical
approach. It seems conceptually more robust to try to estimate the probability of a given
datum being the target given the users actions and the information signatures of the
multimedia objects. The intuitive guess that elaborate and complex modeling of the user
response seems to be confirmed by Geman and Moquet (1999) who ascribe this behaviour
to the greater allowance for variation and subjectivity in human decisions.
292
The Machine Learning Approach

The problem of relevance feedback has been considered a machine-learning issue
the approach is to learn user preferences over time. Typically, the query image and
the images marked relevant are used as positive training samples and the images marked
nonrelevant are used as negative training samples.
MacArthur et al. (2000) use decision trees as the basis of their relevance feedback
framework, which they call RFDT. The samples are used to partition the feature space and
classify the entire database of feature vectors based on the partitioning. RFDT stores
historical information from previous feedback and uses it while inducing the decision
tree. The gain in average precision is documented to improve with a higher number of
images marked relevant at each iteration during testing on high-resolution computedtomography (HRCT) images of the human lung.
Koskela et al. (2002) demonstrate that the PicSOM image retrieval system (Laaksonen
et al., 2000) can be enhanced with relevance feedback. The assumption is that images
similar to each other are located near each other on the Self-Organising Map surfaces.
The semantic implications of these methods seem to have close parallels with those
of the vector space approach.
An interesting subclass of these techniques augments a classical or probabilistic
method by learning from past feedback sessions. For example, the application of a Market
Basket Analysis to data from log files to improve relevance feedback is described in
Mller et al. (2004). The idea is to identify pairs of images that have been marked together
(either as relevant or nonrelevant) and use the frequency of their being marked together
to construct association rules and calculate their probabilities. Depending on the
technique used to combine these probabilities with the existing relevance feedback
formula, significant gains in precision and recall can be achieved.
An unorthodox variation on the use of machine learning for relevance feedback
involves an attempt to minimize the user involvement. This is done by adapting a selftraining neural network to model the notion of video similarity through automated
relevance feedback (Muneesawang & Guang, 2002). The method relies on a specialized
indexing paradigm designed to incorporate the temporal aspects of the information in a
video clip.
The machine learning methods usually place an emphasis on the users classification of the results, which is a conceptually natural and appealing way to model relevance
feedback. However, obtaining an effective training sample may be a challenge. Further,
care needs to be taken to ensure that considerations of user subjectivity are not lost while
considering historical training data that may have been contributed by multiple users.
The Keyword Integration Approach

A method for integrating semantic (keyword) features and low-level features is
presented in Lu et al. (2000). The low-level features are dealt with in a vector-space style.
A semantic network is built to link the keywords to the images. An extension to this
approach with cross-session memory is presented in Zhang et al. (2003). These techniques show that the computationally efficient vector-space style approach (for lowlevel features) and the machine-learning approach (for keywords) can be integrated.
While these methods incorporate some elements of other approaches, we present them
separately since the integration of keyword as well as visual features is their distinguishCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written
293
ing characteristic rather than their use of such techniques. These techniques enable the
so-called cross-modal queries, which allow users to express queries in terms of keywords
and yet have low-level features taken into account and vice versa. Further, Zhang et al.
(2003) are able to utilise a learning process to propagate keywords to unlabelled images
based on the users interaction. The motivation is that in some cases the users
information need is better expressed in terms of keywords (possibly also combined with
low-level features) than in terms of only low-level features. It seems to be taken for
granted in these techniques that, in semantic terms, keywords represent a higher level
of information about the images than the visual features. We elaborate on these
assumptions in Section 3.
The Mathematical-Conceptual Approach

There has been a recent emergence of techniques that interpret the relevance
feedback problem in a novel manner. Established conceptual and mathematical tools are
used to develop elaborate frameworks in order to capture the semantic richness and
diversity of the problem.
Grootjen and van der Weide (2002) acknowledge that it is difficult to grasp the users
information need since it is an internal mental state. If the user were able to browse
through an entire document collection and pick out all the relevant ones, such a selection
may be interpreted as a representation of the information need of that user with respect
to the given collection. Reasoning in this manner implies that the result set returned is
an approximation of the users information need. To ensure that this approximation is
close to the actual need, they postulate a semantic structure of interconnected nodes.
Each node represents a so-called concept and this overall structure becomes a concept
lattice and can be derived using the theory of Formal Concept Analysis (Willie, 1982).
Zhuang et al. (2002) have developed a two-layered graph theoretic model for
relevance feedback in image retrieval that aims to capture the correlations between
images. The two layers are the visual layer and the semantic layer. Each layer is an
undirected graph with the nodes representing images and the links representing correlations. The nodes in the two layers actually represent the same images but the
correlations are interpreted differently. A correlation between images in the visual layer
denotes a similarity based on low-level features while a correlation in the semantic layer
represents a high-level semantic correlation. The visual links are obtained in an offline
manner at the time images are added to the database. An algorithm to learn the semantic
links based on feedback is outlined in the paper. These semantic links are then exploited
in subsequent iterations. The semantic information becomes more beneficial in the long
term as the links are updated.
Zutshi et al. (2003) strongly focus on the relevance feedback provided by considering it a classification problem and modeling it using rough set theory. A rough set
decision system is constructed at each iteration of feedback and used to analyse the use
response. Two feature reweighting schemes were proposed. Integrating this relevance
feedback technique with a probabilistic retrieval system was demonstrated.
Analysis and Emerging Trends

A summary of the approaches reviewed is presented in Table 1 to establish their
advantages and disadvantages with respect to the management of multimedia semantics.
294
Table 1. Advantages and disadvantages of the various approaches with respect to

multimedia semantics
Approach
Classical
Advantages
Computationally more efficient than
other approaches, mature
ProbabilisticStatistical
Conceptually appealing model of

user response in the more
sophisticated methods
Avoids loss of information gained
across sessions
Machine Learning
Keyword Integration Can take advantage of keyword

annotations as well as visual features
which may be able to capture the
users information need well
MathematicalConceptual
Few assumptions regarding users

information need hence more usercentric
Disadvantages
Simplistic semantic
interpretation of the users
information need (ideal query
vector)
Computational costs can
become prohibitive with
complex modeling
Training sample can take time
to accumulate, in some
techniques user subjectivity
not explicitly catered for
Possible impedance mismatch
between the different
techniques used for visual and
keyword features, difficulty in
obtaining keyword
annotations
Computational costs can
become prohibitive with
complex modeling
The review of the literature above suggests certain trends. There is a movement to
more general and comprehensive approaches. In the vector-space arena this can be seen
in the changes from MARS (Rui et al., 1997) to MindReader (Ishikawa et al., 1998) to the
novel framework of Rui and Huang (1999). The vector-space style interpretation of the
relevance feedback can be seen to be increasingly general, increasingly complex and
reliant on fewer artificial parameters in its evolution, if we may term it thus. The spirit of
recent generalised methods such as in Rui and Huang (1999) and Geman and Moquets
(1999) stochastic model (an extension of the work by Cox et al., 1996) acknowledges a
much higher level of complexity than was previously recognised. The application of the
machine learning techniques and the augmentation of relevance feedback methods with
historical information also present a promising avenue. Extending these approaches to
incorporate per-user relevance feedback as well as the deduction of implicit truths that
may apply across a large number of users will probably continue. The use of keyword
features as a means to augment visual features is also a step forward. However, the
success of its adoption in applications other than Web-based ones may depend on the
feasibility of obtaining keyword annotations for large multimedia collections. Such
keyword annotations could perhaps be accumulated manually during user feedback.
CONSIDERATIONS FOR A
RELEVANCE FEEDBACK TECHNIQUE TO
HANDLE MULTIMEDIA SEMANTICS
While considering existing techniques from the perspective of selecting appropriate strategies for a proposed multimedia system or for developing a new one, several
295
factors need to be taken into account. Zhou and Huang (2001a) have outlined several
such factors while interpreting relevance feedback as a classification problem. Among
these are the small sample issue, that is, the fact that in a given session the user provides
a small number of training examples. Another key issue is the inherent asymmetry
involved the training is in the form of a classification while the desired output of
retrieval is typically a rank-ordered top-k return. They proceed to present a compilation
of a list of critical issues to consider while designing a relevance feedback algorithm
(Zhou & Huang, 2001b). While their analysis continues to be relevant and the issues
presented remain pertinent, they do not explicitly take into account the added complexities of certain semantic implications. We attempt to outline certain critical considerations
that need to be taken into account while selecting an existing relevance feedback
technique (or developing a new one) to be used with a semantically focussed multimedia
retrieval system. These requirements are based on the analysis and review of the
literature in the second section.
Conceptual Robustness and Soft Modeling of the User

Response
If semantic issues are a concern in a multimedia retrieval system, the relevance
feedback must be based on a robust conceptual model, which captures at a high level the
complexities of relevance feedback and the issues of subjectivity. Since most existing
techniques are strongly analytical and often rely on what is essentially a reductionist
approach, missing out on important details or making simplifying assumptions too early
in the modeling process can lead to deficiencies at the implementation stage. For
instance, an elaborate vector-space- style model proposes at the very outset that the user
has an ... ideal query vector in mind and that the distance of the sample vectors
from this ideal vector is a generalised ellipsoid distance (Ishikawa et al., 1998). It can
be seen that from a semantic viewpoint, if we acknowledge that the users information
needs may be complex and highly abstract, claiming that an ideal query vector can
represent this may not be conceptually justifiable. The users information needs in openended browsing may evolve based on the results initially presented to them. Indeed, the
users may not always start out with a fixed, clear query in their mind at all. To allow for
these possibilities, the methods listed under the Mathematical-Conceptual approach
(see that section) attempt to formulate a much higher level and a more general model of
the relevance feedback scenario. This means that simplifying assumptions are only made
at a later stage as and when they become necessary. This avoids the reductionist trap
whereby if something relatively minor is missing at the higher levels, the lower levels may
suffer greatly. In this sense, these approaches could be seen as having a conceptual
advantage. However, this is not to be taken to mean that we endorse all complex and
general models for relevance feedback blindly; nor that we deny that the harder
approaches are useful. We merely point out the conceptual appeal of modeling a complex
semantic issue in a fitting manner, and that doing so could be used as the basis of a more
thorough analysis and cater to a wider variety of circumstances.
An analogous argument can be made for the modeling of the user response.
Indicating a few results to be of some type can be challenging to interpret semantically.
There is much room for ambiguity; for example, does marking a given result relevant imply
that this result meets the users need, or is that an indication that the need is not met but
296
the result marked relevant is a close approximation? Further, in the case when negative
examples are employed (i.e., the user is allowed to nominate results as either being
relevant or nonrelevant) what does the fact that the user left a result unmarked signify?
For instance Cox et al. (1998) make a distinction between absolute judgement, where
images being marked by the user are nominated as relevant while unmarked ones are
considered nonrelevant; and relative judgement, which interprets the user feedback
to mean that the marked images are more relevant than the unmarked ones. Geman and
Moquet (1999) observe that their most general model with fewest simplifying assumptions may be the most suitable for modeling the human response, which, however, comes
with a penalty in terms of computational complexity. Again, a general model, which
incorporates a soft approach, could be seen as more conceptually robust than a
harder one.
Another important criterion for conceptual robustness is the compatibility of the
tools used while developing a relevance feedback strategy. For instance, in the keywordintegration techniques a semantic web based on keywords is integrated with a vectorspace style representation of low-level features. Care must be taken to ensure that in
cases like these, when multiple tools are combined, conceptually compatible approaches
are used rather than fusing incompatible tools in an ad hoc manner.
Interestingly, Cox et al. (1996) take notice of the fact that PicHunter seems to be able
to place together images that are apparently semantically similar. This remains hard to
explain. They advance as conjecture that their algorithms produce probability distributions that are very complex functions of the feature sets used. This phenomenon of the
feature sets combined with user feedback producing seemingly semantic results does not
seem to be explored in great depth by the literature on relevance feedback and possibly
merits further investigation.
Computational Complexity and Effective Representation

Computational complexity and effective representation of the relevance feedback
problem tend to be conflicting requirements. For the sake of computational efficiency,
certain simplifying assumptions often have to be made, particularly in probabilistic
techniques. These simplifications are designed to make a technique efficient to implement but almost invariably result in a loss of richness of representation. Such simplifications include the assumption that the probability that a given datum is the target is
independent of who the user is, the assumption that the users action in a particular
iteration is independent of the previous iterations (Cox et al., 1996) and the simplified
metrics of Geman and Moquet (1999) which offer computational benefits over their most
general (and arguably most effective) model. As is often the case, there is a trade-off to
be made. If semantic retrieval is critical, users may be willing to allow for longer
processing times, for example, while matching mug shots of criminal offenders to the
query image of a suspect, high retrieval accuracy may be important enough to warrant
longer waiting times. If, on the other hand, users are likely to use a retrieval system for
so-called category searches, they may prefer lower processing times with acceptable
results rather than very accurate results based on a sophisticated model of their
preferences. However, semantic considerations may add another dimension to the
computing complexity issue. The challenge is interpreting and representing the relevance feedback problem in a way that captures enough detail while remaining
297
computationally feasible. Ideally, the test of whether enough detail has been captured
could be based on the notion of emergent meaning (Santini et al., 2001). If a relevance
feedback technique can facilitate sufficiently complex interaction between the user and
the retrieval system for semantic behaviour to emerge as a result of the interaction, the
technique has effectively represented all aspects of the problem. The computational
issues are related, since if the users interaction is useful in the emergence of semantic
behaviour by the system, extended waiting times will influence their state of mind
(frustration, disinterest) and possibly impair the semantic performance of the relevance
feedback technique.
Keyword Features and Semantics

While acknowledging the breakthrough made in recent approaches such as Lu et
al. (2000) and Zhang et al. (2003) whereby relevance feedback models can take advantage
of keywords associated with images, it must be noted that simply incorporating keywords
into a relevance feedback technique may not be sufficient to address the issue of
semantic retrieval. A keyword can be seen as a way to represent a concept, part of a
concept or multiple concepts depending on the keyword itself and its context. In
multimedia collections where there are a relatively small number of ground truths and/
or the concepts that specific keywords refer to are agreed upon, definite improvements
in retrieval can be achieved by incorporating keywords. These can be obtained from
textual data associated with a multimedia object such as existing captions or descriptions
in web pages. However, automatically associating keywords with collections of multimedia objects that have not been somehow manually annotated remains difficult. A
suitable position for a relevance feedback technique to take would be to allow for the use
of keywords so that if they can be obtained (or are already available) their properties can
be exploited This is indeed done in Zhang et al. (2003).
It is worth bearing in mind that, depending on the application domain, visual
features and keywords may have varying distinguishing characteristics. In relatively
homogenous collections requiring expert knowledge, for instance, keywords would
clearly be very useful, for example, in finding images similar to those of a given species
of flower (www.unibas.ch/botimage). In this case, it is likely that the users semantic
perception can be reflected in the fact that a domain expert (or a classifier) has assigned
a particular label (the name of the species) to a given image. In other contexts, say a
collection of images of clothes, the user may be looking for matches such as blue shirts
but not trousers, which could be addressed by a combination of visual features (colour
and shape), or by a combination of keywords and visual features (colour and textual
annotation of garment type) or even completely by keywords (textual annotations of
colour and garment type). In such cases, it may not be semantically appropriate to
consider keywords as being higher-level features than other visual features, as seems
to be done by Zhang et al. (2003). This indicates that constructing a semantic web on
top of a multimedia database might not accurately reflect the semantic content of a
multimedia collection.
A final point regarding the use of keywords as a feature: Keywords may be
inadequate to capture the semantic content of media due to the richness of the content
(Dorai et al., 2002). The role of the relevance feedback mechanism then becomes to try
and adapt to the user by capturing the interaction between keywords and visual
298
information to mimic the users semantic perception. A future direction along these lines
may be to combine the work of Zhang et al. (2003) with the work of Rui and Huang (1999),
which incorporates the two-layer model for visual information. It may then be possible
to build parallel or even interconnecting networks of visual and textual features in a step
towards truly capturing the semantic content of multimedia objects.
Limitations of Precision and Recall

A perusal of the literature on relevance feedback demonstrates that by far the most
common method of indicating performance is to present precision-recall measures. This
may make it tempting to select a relevance feedback technique that has been documented
as producing the best gain in terms of precision and recall. However, this may not
necessarily be advisable in the case of a multimedia retrieval system that aims to handle
semantic content. From this point of view, two major criticisms can be made of the use
of precision and recall.
Firstly, the practical utility of these measures from an end-user point of view (as
opposed to the point of view of researchers) has been questioned. For example, a study
outlined in Su (1994) contained the conclusion that precision was not a good indicator
of user perceptions of quality of content-based image retrieval. In this study, recall was
not evaluated as a quality measure since the authors point out that this would involve
advance knowledge of the contents of the image database by the end users; an unrealistic
assumption for practical image database systems. Such criticism of recall and precision
has in fact been made almost from the time of their conception. For example, it is stated
in Cleverdon (1974) that the use of recall and precision in highly controlled experimental
settings did not necessarily translate well to operational systems. A key contention
common to these negative analyses of recall and precision would appear to be that simple
numerical scores of researcher-determined relevance judgments do not capture the
semantic intentions of real users.
The second criticism to be levelled at recall and precision (and indeed can be applied
to many attempts at objective evaluation of content-based multimedia retrieval) concerns
the nature of relevance itself. Relevance is directly correlated with the highly subjective
notion of similarity and is therefore itself highly subjective. The research in this area
indicates that relevance is itself a difficult concept to define. Many different definitions
of relevance are in use in information retrieval (Mizzaro, 1998). In the textual information
retrieval arena, concepts such as contextual pertinence (Ruthven & van Rijsbergen, 1996)
and situational relevance (Wilson, 1973) are also used to further express semantic
information needs of users. Attempts at evaluation of multimedia retrieval systems based
on numeric relevance measures, particularly those drawing comparisons with other systems,
should be made with caution. As stated by Hersh (1994), these measures are important, but
they must be placed in perspective and particular attention paid to their meaning.
USER-CENTRIC MODELING
OF RELEVANCE FEEDBACK
From the review of various feedback techniques in the second section and the
requirements of relevance feedback outlined in the third section, it can be seen that much
299
of the current literature suggests a strong focus on the underlying retrieval engine.
Indeed the existing literature often tends to be constrained by the value set of the
features, for example, the separate handling of numerical and keyword features (Lu et al.,
2002; Zhang et al., 2003). In this section we make a case for and present an alternate
perspective.
Nastar et al. (1998) point out that image databases can be classified into two types.
The first type is the homogeneous collection where there is a ground truth regarding
perceptual similarity. This may be implicitly obvious to human users, such as in
photographs of people where two images are either of the same person or not. Otherwise,
experts may largely agree upon similarity, for example, two images either represent the
same species of flower, or they do not. However, in the second type, the heterogeneous
collection, no ground truth may be available. A characteristic example of this type is a
stock collection of photographs. Systems designed to perform retrieval on this second
category of collection should therefore be as flexible as possible, adapting to and
learning from each user in order to satisfy their goal (Nastar et al., 1998). This
classification has significance in the context of databases of other multimedia objects
such as video and audio clips.
In homogeneous collections, retrieval strategies may perform adequately even if
they are not augmented by relevance feedback. This is because the feature set and
similarity measure can be chosen in order to best reflect the accepted ground truths. This
would not be true of heterogeneous collections. Relevance feedback becomes increasingly necessary with heterogeneity in large multimedia collections. When there is no
immediately obvious ground truth, the semantic considerations can become extremely
complex. We try to illustrate this complexity in Figure 1, which uses notation similar to
that of ER Diagrams, while eschewing the use of cardinality. The users information need
may be specific to any of several relevance entities: the users themselves (subjectivity);
the context of their situation (e.g., while considering fashion, the keyword model would
map to a different semantic concept than if cars were being considered); or the semantic
concept(s) associated with their information need (which may or may not be expressible
in terms of keywords and visual features). There would then be zero or more multimedia
objects that would be associated with the users information need. The key point is that
the information need is potentially specific to all four entities in the diagram.
Figure 1. Complex information need in a heterogeneous collection
User
Multimedia Object
Information need
Semantic Concept
Context
300
It can be seen that while considering whether a multimedia object satisfies a users
information need, all the factors that are to be taken into account are based on the user.
This clearly highlights the importance of relevance feedback for semantic multimedia
retrieval feedback provided by the user can be used to ensure that the concept
symbolising their information need is not neglected. It follows that the underlying model
of relevance feedback must be explicitly user-centric and semantically based to better
meet the users information needs.
In order to identify a way to address this challenge, the overall multimedia process
can be reviewed. A high-level overview of the information flow in the relevance feedback
and multimedia retrieval process is presented in Figure 2. In existing approaches the
retrieval mechanism and the relevance feedback module have often been formulated very
closely integrated with each other such as in the classical approach. In the extreme case,
the initial query can be interpreted as the initial iteration of user feedback. For example,
in PicHunter (Cox et al., 1996; Cox et al., 1998) and MindReader (Ishikawa et al., 1998), there
is no real distinction between the relevance feedback and the retrieval.
As an alternative approach, it is possible to interpret the task of relevance feedback
as being distinct from retrieval, by identifying and separating their roles. Retrieval tries
to answer the question which objects are similar to the query specification? Feedback
deals with the questions what does the fact that objects were marked in this fashion tell
us about what the user is interested in? And how can this information be best conveyed
to the retrieval engine? As reflected in Figure 2, the task of relevance feedback can then
be considered a module in the overall multimedia retrieval process.
If the relevance feedback technique is modeled in a general fashion, it can focus on
interpreting user input and be loosely coupled to the retrieval engine and the feature set.
By reducing the dependency of the relevance feedback module on the feature set and its
value set, the incorporation of novel visual and semantic features as they become
Figure 2. A high-level overview of the multimedia retrieval process

Object Collection
Objects
Query
User Interface
Classified
Results
Feature Extractor
Query Specification
Result Set
Feature Descriptions
Retrieval Engine
Relevance Feedback Module

Query Modification Specification
Log
301
available would be possible. A highly general relevance feedback model would also
support user feedback in terms of an arbitrary number of classes, that is, support the
classification of the result set into the categories relevant and unmarked; relevant,
nonrelevant and unmarked; or even, say, results given an integral ranking between one
and five. The interpretations of each case would, of course, have to be consistent to
maintain semantic integrity.
An interesting effect of modeling relevance feedback is the possibility for the use
of multiple relevance feedback modules with the same retrieval engine. A relevance
feedback controller could then potentially be developed to apply the most suitable
relevance feedback algorithm depending on specific users requirements in the context
of the collection. This would mean that the controller would have the task of identifying
which relevance feedback technique would best approximate the users information need
at the semantic level.
CONCLUSION
There is a rich and diverse collection of relevance feedback techniques described
in the literature. Any multimedia retrieval system aiming to cater to users needs at a
semantic or close to semantic level of performance should certainly incorporate a
relevance feedback strategy as a measure towards narrowing the semantic gap. An
understanding of the generic types of techniques that are available and what bearing
they have on the specific goals of the multimedia retrieval system in question is essential.
The situation is complicated by the fact that in the literature such techniques are often
supported by precision and recall figures and other such objective measures. While
these figures can reveal certain characteristics about the relevance feedback and its
enhancement of the retrieval process, it is safe to say that such figures alone are not
sufficient to compare alternatives, especially in semantic terms. It would appear that at
the moment there are no definitive comparison criteria. However, a subjective consideration of available techniques can be made in accordance with the guidelines outlined and
the conceptual alignment of a relevance feedback technique with the goals of the given
retrieval system. Such an approach has the benefit of being able to eliminate techniques
should they not meet the requirements envisioned for the retrieval system. Should
existing trends continue and relevance feedback systems continue to become more
complex and general, it may be possible to implement a relevance feedback controller, or
a metarelevance feedback module that can use information gathered from the users
feedback to actually select and refine the systems relevance feedback strategy to better
meet the users specific information needs at a semantic level.
REFERENCES
Cleverdon, C. W. (1974) User evaluation of information retrieval systems. Journal of

Documentation, 30, 170-180.
Cox, I., Miller, M., Omohundro, S., & Yianilos, P. (1996). PicHunter: Bayesian relevance
feedback for image retrieval. In Proceedings of the 13th International Conference
on Pattern Recognition (Vol. 3, pp. 361-369).
302
Cox, I. J., Miller, M. L., Minka, T. P., & Yianilos, P. N. (1998). An optimized interaction
strategy for bayesian relevance feedback. In IEEE Conf. on Comp. Vis. and Pattern
Recognition, Santa Barbara, California (pp. 553-558).
Dorai, C., Mauthe, A., Nack, F., Rutledge, L., Sikora, T., & Zettl, H. (2002). Media
semantics: Who needs it and why? In Proceedings of the 10th ACM International
Conference on Multimedia (pp. 580 -583).
Geman, D., & Moquet, R. (1999). A stochastic feedback model for image retrieval.
Technical report, Ecole Polytechnique, 91128 Palaiseau Cedex, France.
Grootjen, F. & van der Weide, T. P. (2002). Conceptual relevance feedback. In IEEE
International Conference on Systems, Man and Cybernetics (Vol. 2, pp. 471-476).
Hersh, W. (1994) Relevance and retrieval evaluation: Perspectives from medicine.
Journal of the American Society for Information Science, 45(3), 201-206.
Ishikawa, Y., Subramanya, R., & Faloutsos, C. (1998). MindReader: Querying databases
through multiple examples. In Proceedings of the 24th International Conference
on Very Large Data Bases, VLDB (pp. 218-227).
Kang, H.-B. (2003). Emotional event detection using relevance feedback. In Proceedings
of the International Conference on Image Processing (Vol. 1, pp. 721-724).
Koskela, M., Laaksonen, J., & Oje, E. (2002). Implementing relevance feedback as
convolutions of local neighborhoods on self-organizing maps.
In Proceedings of the International Conference on Artificial Neural Networks
(pp. 981-986), Madrid, Spain.
Laaksonen, J., Koskela, M., Laakso, S.P., & Oje, E. (2000). PicSOM Content-based image
retrieval with self organizing maps. Pattern Recognition Letters, 21, 1199-1207.
Liu, M., & Wan, C. (2003). Weight updating for relevance feedback in audio retrieval. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 03) (Vol. 5, pp. 644-647).
Lu, Y., Hu, C., Zhu, X., Zhang, H., & Yang, Q. (2000). A unified framework for semantics
and feature based relevance feedback in image retrieval systems. In Proceedings
of the eighth ACM international conference on Multimedia, Marina del Rey,
California (pp. 31-37).
MacArthur, S., Brodley, C., & Shyu, C.-R. (2000). Relevance feedback decision trees in
content-based image retrieval. In Proceedings of the IEEE Workshop on Contentbased Access of Image and Video Libraries (pp. 68-72).
Mizzaro, S. (1998). How many relevances in information retrieval? Interacting with
Computers, 10(3), 305-322.
Mller, H., Squire, D., & Pun, T. (2004). Learning from user behaviour in image retrieval:
Application of market basket analysis. International Journal of Computer Vision,
56(1-2), 65-77.
Muneesawang, P., & Guan, L. (2002). Video retrieval using an adaptive video indexing
technique and automatic relevance feedback. In Proceedings of the
IEEE Workshop on Multimedia Signal Processing (pp. 220-223).
Nastar, C., Mitschke, M., & Meilhac, C. (1998). Efficient query refinement for image
retrieval. In Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, Santa Barbara, California (pp. 547-552).
Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information
Sciences, 11, 341-356.
303
Rui, Y., & Huang, T. S. (1999). A novel relevance feedback technique in image retrieval.
In Proceedings of the Seventh ACM International Conference on Multimedia
(Part 2), Orlando, Florida (pp. 67-70).
Rui, Y., Huang, T., & Mehrotra, S. (1997). Content-based image retrieval with relevance
feedback in MARS. In Proceedings of the IEEE International Conference on
Image Processing (pp. 815-818).
Ruthven, I., & van Rijsbergen, C. J. (1996). Context generation in information retrieval.
In Proceedings of the Florida Artificial Intelligence Research Symposium.
databases. In IEEE Transaction on Knowledge and Data Engineering, 13(3), 337351.
Su, L. T. (1994). The relevance of recall and precision in user evaluation. Journal of the
American Society for Information Science, 45(3), 207-217.
Willie, R. (1982). Restructuring lattice theory: An approach based on hierarchies of
concepts. In I. Rival (Ed.), Ordered sets, 445-470. D. Reidel Publishing Company.
Wilson, C., & Srinivasan, B. (2002). Multiple feature relevance feedback in content based
image retrieval using probabilistic inference networks. In Proceedings of the First
International Conference on Fuzzy Systems and Knowledge Discovery
(FSKD02a), Singapore (pp. 651-655).
Wilson, C., Srinivasan, B., & Indrawan, M. (2001). BIR - The Bayesian network image
retrieval system. In Proceedings of the IEEE International Symposium on Intelligent Multimedia, Video and Speech Processing (ISIMP2001) (pp. 304-307).
Hong Kong SAR, China.
Wilson, P. (1973). Situational relevance. Information Retrieval and Storage, 9, 457-471.
Zhang, H., Chen, Z., Li, M., & Su, Z. (2003). Relevance feedback and learning in contentbased image search. In World Wide Web: Internet and Web Information Systems,
6, (pp. 131-155). The Netherlands: Kluwer Academic Publishers.
Zhou, X. S., & Huang, T. S. (2001a). Comparing discriminating transformations and SVM
for learning during multimedia retrieval. In Proceedings of the Ninth ACM International Conference on Multimedia (pp. 137-146). Ottawa, Canada.
Zhou, X. S., & Huang, T. S. (2001b). Exploring the nature and variants of relevance
feedback. In IEEE Workshop on Content-Based Access of Image and Video
Libraries (CBAIVL 2001), 94-101.
Zhuang, Y., Yang, J., Li, Q., & Pan, Y. (2002). A graphic-theoretic model for incremental
relevance feedback in image retrieval. In International Conference on Image
Processing, 1, 413-416.
Zutshi, S., Wilson, C., Krishnaswamy, S., & Srinivasan, B. (2003). Modelling relevance
feedback using rough sets. In Proceedings of the Fifth International Conference
on Advances in Pattern Recognition (ICAPR 2003) (pp. 495-500).
304
Section 4
Managing Distributed
Multimedia
EMMO 305
Chapter 13
EMMO:
Tradeable Units of KnowledgeEnriched Multimedia Content

Utz Westermann, University of Vienna, Austria
Sonja Zillner, University of Vienna, Austria
Karin Schellner, ARC Research Studio Digital Memory Engineering,
Vienna, Austria
Wolfgang Klas, University of Vienna and
ARC Research Studio Digital Memory Engineering, Vienna, Austria
ABSTRACT
Current semantic approaches to multimedia content modeling treat the contents

media, the semantic description of the content, and the functionality performed on the
content, such as rendering, as separate entities, usually kept on separate servers in
separate files or databases and typically under the control of different authorities. This
separation of content from its description and functionality hinders the exchange and
sharing of content in collaborative multimedia application scenarios. In this chapter,
we propose Enhanced Multimedia MetaObjects (Emmos) as a new content modeling
formalism that combines multimedia content with its description and functionality.
Emmos can be serialized and exchanged in their entirety covering media, description,
and functionality and are versionable, thereby establishing a suitable foundation
for collaborative multimedia applications. We outline a distributed infrastructure for
Emmo management and illustrate the benefits and usefulness of Emmos and this
infrastructure by means of two practical applications.
306
Westermann, Zillner, Schellner & Klas
INTRODUCTION
Todays multimedia content formats such as HTML (Raggett et al., 1999), SMIL
(Ayars et al., 2001), or SVG (Ferraiolo et al., 2003) primarily encode the presentation of
content but not the information content conveys. But this presentation-oriented
modeling only permits the hard-wired presentation of multimedia content exactly in the
way specified; for advanced operations like retrieval and reuse, automatic composition,
recommendation, and adaptation of content according to user interests, information
needs, and technical infrastructure, valuable information about the semantics of content
is lacking.
In parallel to research on the Semantic Web (Berners-Lee et al., 2001; Fensel, 2001),
one can therefore observe a shift in paradigm towards a semantic modeling of multimedia
content. The basic media of which multimedia content consists are supplemented with
metadata describing these media and their semantic interrelationships. These media and
descriptions are processed by stylesheets, search engines, or user agents providing
advanced functionality on the content that can exceed mere hard-wired playback.
Current semantic multimedia modeling approaches, however, largely treat the
contents basic media, the semantic description, and the functionality offered on the
content as separate entities: the basic media of which multimedia content consists are
typically stored on web or media servers; the semantic descriptions of these media are
usually stored in databases or in dedicated files on web servers using formats like RDF
(Lassila & Swick, 1999) or Topic Maps (ISO/IEC JTC 1/SC 34/WG 3, 2000); the functionality on the content is normally realized as servlets or stylesheets running in application
servers or as dedicated software running at the clients such as user agents.
This inherent separation of media, semantic description, and functionality in
semantic multimedia content modeling, however, hinders the realization of multimedia
content sharing as well as collaborative applications which are gaining more and more
importance, such as the sharing of MP3 music files (Gnutella, n.d.) or learning materials
(Nejdl et al., 2002) or the collaborative authoring and annotation of multimedia patient
records (Grimson et al., 2001). The problem is that exchanging content today in such
applications simply means exchanging single media files. An analogous exchange of
semantically modeled multimedia content would have to include content descriptions
and associated functionality, which are only coupled loosely to the media and usually
exist on different kinds of servers potentially under control of different authorities, and
which are thus not easily moveable.
In this chapter, we give an illustrated introduction to Enhanced Multimedia
MetaObjects (Emmo), a semantic multimedia content modeling approach developed with
collaborative and content sharing applications in mind. Essentially, an Emmo constitutes
a self-contained piece of multimedia content that merges three of the contents aspects
into a single object: the media aspect, that is, the media which make up the multimedia
content, the semantic aspect which describes the content, and the functional aspect by
which an Emmo can offer meaningful operations on the content and its description that
can be invoked and shared by applications. Emmos in their entirety including media,
content description, and functionality can be serialized into bundles and are
versionable: essential characteristics that enable their exchangeability in content
sharing applications as well as the distributed construction and modification of Emmos
in collaborative scenarios.
EMMO 307
Furthermore, this chapter illustrates how we employed Emmos for two concrete
collaborative and content sharing applications in the domains of cultural heritage and
digital music archives.
The chapter is organized as follows: we begin with an overview of Emmos and show
their difference to existing approaches for multimedia content modeling. We then
introduce the conceptual model behind Emmos and outline a distributed Emmo container
infrastructure for the storage, exchange, and collaborative construction of Emmos. We
then apply Emmos for the representation of multimedia content in two application
scenarios. We conclude this paper with a summary and give an outlook to our current
and future work.
BACKGROUND
In this section, we provide a basic understanding of the Emmo idea by means of an
illustrating example. We show the uniqueness of this idea by relating Emmos to other
approaches to multimedia content modeling in the field.
The Emmo Idea

The motivation for the development of the Emmo model is the desire to realize
multimedia content sharing and collaborative applications on the basis of semantically
modeled content but to avoid the limitations and difficulties of current semantic modeling
approaches implied by their isolated treatment of media, semantic description, and
content functionality.
Following an abstract vision originally formulated by (Reich et al., 2000), the
essential idea behind Emmos is to keep semantic description and functionality tied to the
pieces of content to which they belong, thereby creating self-contained units of
semantically modeled multimedia content easier to exchange in content sharing and
collaborative application scenarios. An Emmo coalesces the basic media of which a piece
of multimedia content consists (i.e., the contents media aspect), the semantic description of these media (i.e., the contents semantic aspect), and functionality on the content
(i.e., the contents functional aspect) into a single serializeable and versionable object.
Figure 1 depicts a sketch of a simplified Emmo, which models a small multimedia
photo album of a holiday trip of an imaginary couple Paul and Mary and their friend Peter.
The bottom of the figure illustrates how Emmos address the media aspect of multimedia
content. An Emmo acts as a container of the basic media of which the content that is
represented by the Emmo consists. In our example, the Emmo contains four JPEG images
which constitute the different photographs of the album along with corresponding
thumbnail images.
Media can be contained either by inclusion, that is, raw media data is directly
embedded within an Emmo, or by reference via a URI if embedding raw media data is not
feasible because of the size of media data or the media is a live stream.
An Emmo further carries a semantic description of the basic media it contains
and the associations between them. This semantic aspect, illustrated to the upper left
of Figure 1, makes an Emmo a unit of knowledge about the multimedia content it
represents.
308
Figure 1. Aspects of an Emmo

Semantics
instance-of
Paris
location
instance-of
part-of
Vienna
Salzburg
location location
Family
Member
Friend
Austria
part-of
renderAsSlideShow(persons,
locations, dates)
is-a
is-a
instance-of
France
part-of
Functionality
Person
Location
renderAsMap(persons,
location, dates)
instance-of instance-of
Mary
Paul
Peter
location
depicts
depicts
depicts
depicts
depicts
Picture 1
Shot at 07/21/2003
Picture 2
Shot at 07/25/2003
Picture 3
Shot at 07/28/2003
Picture 4
Shot at 08/1/2003
instance-of
instance-of
Media
picture001.jpg
thumbnail001.jpg
instance-of
Photograph
picture002.jpg
thumbnail002.jpg
picture003.jpg
thumbnail003.jpg
picture004.jpg
thumbnail004.jpg
For content description, Emmos apply an expressive concept graph-like data model
similar to RDF and Topic Maps. In this graph model, the description of the content
represented by an Emmo is not performed directly on the media that are contained in the
Emmo; instead, the model abstracts from physical media making it possible to subsume
several media objects which constitute only different physical manifestations of logically one and the same medium under a single media node. This is a convenient way to
capture alternative media. In Figure 1, for example, each media node Picture 1 Picture
4 subsumes not only a photo but also its corresponding thumbnail image.
Apart from media, nodes can also represent abstract concepts. By associating an
Emmos media objects with such concepts, it is possible to create semantically rich
descriptions of the multimedia content the Emmo represents. In Figure 1, for instance,
it is expressed that the logical media nodes Picture 1 Picture 4 constitute photos
taken in Paris, Vienna, and Salzburg showing Peter and Paul, Paul and Mary, and Mary,
respectively. The figure further indicates that nodes can be augmented with primitive
attribute values for closer description: the pictures of the photo album are furnished with
the dates at which they have been shot.
By associating concepts with each other, it is also possible to express domain
knowledge within an Emmo. It is stated in our example that Peter, Paul, and Mary are
Persons, that Paul and Mary are family members, that Peter is a friend, that Paris is located
in France, and that Vienna and Salzburg are parts of Austria.
The Emmo model does not predefine the concepts, association types, and primitive
attributes available for media description; these can be taken from arbitrary, domainspecific ontologies. While they thus constitute a very generic, flexible, and expressive
approach to multimedia content modeling, Emmos are not ready-to-use formalism but
require an agreed common ontology before they can be employed in an application.
EMMO 309
Figure 2. Emmo functionality
Finally, Emmos also address the functional aspect of content. An Emmo can offer
operations that can be invoked by applications in order to work with the content the
Emmo represents in a meaningful manner. As shown to the top right of Figure 1, our
example Emmo provides two operations supporting two different rendition options for
the photo album, which are illustrated by the screenshots of Figure 2. As indicated by
the left screenshot, the operation renderAsSlideshow()might know how to given a
set of persons, locations, as well as time periods of interest render the photo album
as a classic slideshow on the basis of the contained pictures and their semantic
description by generating an appropriate SMIL presentation. As indicated by the right
screenshot, the operation renderAsMap() might also know how to given the same data
render the photo album as a map with thumbnails pointing to the locations where
photographs have been taken by constructing an SVG graph.
One may think of many further uses of operations. For example, operations could
also be offered for rights clearance, displaying terms of usage, and so forth.
Emmos have further properties: an Emmo can be serialized and shared in its entirety
in a distributed content sharing scenario including its contained media, the semantic
description of these media, and its operations. In our example, this means that Paul can
accord Peter the photo album Emmo as a whole for instance, via email or a file-sharing
peer-to-peer infrastructure and Peter can do anything with the Emmo that Paul can also
do, including invoking its operations.
Emmos also support versioning. Every constituent of an Emmo is versionable, an
essential prerequisite for applications requiring the distributed and collaborative
authoring of multimedia content. This means that Peter, having received the Emmo from
Paul, can add his own pictures to the photo album while Paul can still modify his local
copy. Thereby, two concurrent versions of the Emmo are created. As the Emmo model
is able to distinguish both versions, Paul can merge them into a final one when he receives
Peters changes.
Related Approaches
The fundamental idea underlying the concept of Emmos presented beforehand is
that an Emmo constitutes an object unifying three different aspects of multimedia
310
content, namely the media aspect, the semantic aspect, and the functional aspect. In the
following, we fortify our claim that this idea is unique.
Interrelating basic media like single images and videos to form multimedia content
is the task of multimedia document models. Recently, several standards for multimedia
document models have emerged (Boll et al., 2000), such as HTML (Ragett et al., 1999),
XHTML+SMIL (Newmann et al., 2002), HyTime (ISO/IEC JTC 1/SC 34/WG 3, 1997),
MHEG-5 (ISO/IEC JTC 1/SC 29, 1997), MPEG-4 BIFS and XMT (Pereira & Ebrahimi, 2002),
SMIL (Ayars et al., 2001), and SVG (Ferraiolo et al., 2003). Multimedia document models
can be regarded as composite media formats that model the presentation of multimedia
content by arranging basic media according to temporal, spatial, and interaction relationships. They thus mainly address the media aspect of multimedia content. Compared to
Emmos, however, multimedia document models neither interrelate multimedia content
according to semantic aspects nor do they allow providing functionality on the content.
They rely on external applications like presentation engines for content processing.
As a result of research concerning the Semantic Web, a variety of standards have
appeared that can be used to model multimedia content by describing the information it
conveys on a semantic level, such as RDF (Lassila & Swick, 1999; Brickley & Guha, 2002),
Topic Maps (ISO/IEC JTC 1/SC 34/WG 3, 2000), MPEG-7 (especially MPEG-7s graph
tools for the description of content semantics (ISO/IEC JTC 1/SC 29/WG 11, 2001)), and
Conceptual Graphs (ISO/JTC1/SC 32/WG 2, 2001). These standards clearly cover the
semantic aspect of multimedia content. As they also offer means to address media within
a description, they undoubtedly refer to the media aspect of multimedia content as well.
Compared to Emmos, however, these approaches do not provide functionality on
multimedia content. They rely on external software like database and knowledge base
technology, search engines, user agents, and so forth, for the processing of content
descriptions. Furthermore, media descriptions and the media described are separate
entities potentially scattered around different places on the Internet, created and
maintained by different and unrelated authorities not necessarily aware of each other and
not necessarily synchronized whereas Emmos combine media and their semantic
relationships into a single indivisible unit.
There exist several approaches that represent multimedia content by means of
objects. Enterprise Media Beans (EMBs) (Baumeister, 2002) extend the Enterprise Java
Beans (EJBs) architecture (Matena & Hapner, 1998) with predefined entity beans for the
representation of basic media within enterprise applications. These come with rudimental
access functionality but can be extended with arbitrary functionality using the inheritance mechanisms available to all EJBs. Though addressing the media and functional
aspects of content, EMBs in comparison to Emmo are mainly concerned with single media
content and not with multimedia content. Furthermore, EMBs do not offer any dedicated
support for the semantic aspect of content.
Adlets (Chang & Znati, 2001) are objects that represent individual (not necessarily
multimedia) documents. Adlets support a fixed set of predefined functionality which
enables them to advertise themselves to other Adlets. They are thus content representations that address the media as well as the functional aspect. Different from Emmos,
however, the functionality supported by Adlets is limited to advertisement and there is
no explicit modeling of the semantic aspect.
Tele-Action Objects (TAOs) (Chang et al., 1995) are object representations of
multimedia content that encapsulate the basic media of which the content consists and
EMMO 311
interlink them with associations. Though TAOs thus address the media aspect of
multimedia content in a way similar to Emmos, they do not adequately cover the semantic
aspect of multimedia content: only a fixed set of five association types is supported
mainly concerned with temporal and spatial relationships for presentation purposes.
TAOs can further be augmented with functionality. Such functionality is, in contrast to
the functionality of Emmos, automatically invoked as the result of system events and not
explicitly invoked by applications.
Distributed Active Relationships (Daniel et al., 1998) define an object model based
on the Warwick Framework (Lagoze et al., 1996). In the model, Digital Objects (DOs),
which are interlinked with each other by semantic relationships, act as containers of
metadata describing multimedia content. DOs thus do not address the media aspect of
multimedia content but focus on the semantic aspect. The links between containers can
be supplemented with arbitrary functionality. As a consequence, DOs take account of
the functional aspect as well. Different from Emmos, however, the functionality is not
explicitly invoked by applications but implicitly whenever an application traverses a link
between two DOs.
ENHANCED MULTIMEDIA META OBJECTS

Having motivated and illustrated the basic idea behind them, this section
semiformally introduces the conceptual model underlying Emmos using UML class
diagrams. A formal definition of this model can be found in (Schellner et al., 2003). The
discussion is oriented along the three aspects of multimedia content encompassed by
Emmos: the media aspect, the semantic aspect, and the functional aspect.
Figure 3. Management of basic media in an Emmo

Connector
MediaSelector
0..1
1..*
0..*
0..*
FullSelector
1
MediaProfile
+audioChannels : int
+bandWidth : float
+bitRate : int
+colorDomain : String
+contentType : String
+duration : float
+fileFormat : String
+fileSize : int
+fontSize : int
+fontStyle : String
+frameRate : double
+height : int
+profileID : String
+qualityRate : float
+resolution : int
+samplingRate : double
+width : int
TemporalSelector
SpatialSelector
TextualSelector
CompositeSelector
+beginMs : int
+durationMs : int
+startX : int
+startY : int
+endX : int
+endY : int
+beginChar : int
+endChar : int
+compositionType : int
MediaInstance
1..*
+inlineMedia : Byte[]
+locationDescription : String
+mediaURL : URL
312
Media Aspect
Addressing the media aspect of multimedia content, an Emmo encapsulates the
basic media of which the content it represents is composed. Figure 3 presents the excerpt
of the conceptual model which is responsible for this.
Closely following the MPEG-7 standard and its multimedia description tools (ISO/
IEC JTC 1/SC 29/WG 11, 2001), basic media are modeled by media profiles (represented
by the class MediaProfile in Figure 3) along with associated media instances (represented by the class MediaInstance). Media profiles hold low-level metadata describing
physical characteristics of the media such as the storage format, file size, and so forth.;
the media data itself is represented by media instances, each of which may directly embed
the data in form of a byte array or, if that is not possible or feasible, address its storage
location by means of a URI. Moreover, if a digital representation is not available, a textual
location description can be specified, for example the location of analog tapes in some
tape archive. Figure 3 further shows that a media profile can have more than one media
instances. In this way, an Emmo can be provided with information about alternative
storage locations of media.
Basic media represented by media profiles and media instances are attached to an
Emmo by means of a connector (see class Connector in Figure 3). A connector does not
just address a basic medium via a media profile; it may also refer to a media selector (see
base class MediaSelector) to address only a part of the medium. As indicated by the
various subclasses of MediaSelector, it is possible to select media parts according to
simple textual, spatial, temporal and textual criteria, as well as an arbitrary combination
of these criteria (see class CompositeSelector). It is thus possible to address the upper
right part of a scene in a digital video starting from second 10 and lasting until second
30 within an Emmo without having to extract that scene and put it into a separate media
file using a video editing tool.
Semantic Aspect
Out of the basic media which it contains, an Emmo forges a piece of semantically
modeled multimedia content by describing these media and their semantic interrelationships. The class diagram of Figure 4 gives an overview over the part of the Emmo model
that provides these semantic descriptions. As one can see, the basic building blocks of
the semantic descriptions, the so-called entities, are subsumed under the common base
Figure 4. Semantic aspect of Emmos

Entity
0..*
0..*
LogicalMediaPart
Emmo
Association
OntologyObject
EMMO 313
Figure 5. Entity details

Entity
0..*
+successor
+OID : String
+name : String
+description : String
+creationDate : long
+modifiedDate : long
+creator : String
AttributeValue
+value : Object
1
0..*
0..*
+value
0..*
+predecessor
0..*
+type
+attribute
OntologyObject
0..*
class Entity. The Emmo model distinguishes four kinds of entities: namely, logical media
parts, associations, ontology objects, and Emmos themselves, represented by assigned
subclasses. These four kinds of entities have a common nature but each extends the
abstract notion of an entity with additional characteristic features.
Figure 5 depicts the characteristics that are common to all kinds of entities. Each
entity is globally and uniquely identified by its OID, realized by means of a universal
unique identifier (UUID) (Leach, 1998) which can be easily created even in distributed
scenarios. To enhance human readability and usability, each entity is further augmented
with additional attributes like a name and a textual description. Moreover, each entity
holds information about its creator and its creation and modification date.
Figure 5 further expresses that entities may receive an arbitrary number of types. A
type is a concept taken from an ontology and represented by an ontology object in the
model. Types thus constitute entities themselves. By attaching types, an entity gets
meaning and is classified in an application-dependent ontology. As mentioned before,
the Emmo model does not come with a predefined set of ontology objects but instead
relies on applications to agree on common ontology before the Emmo model can be used.
In the example of Figure 6, the entity Picture 3 of kind logical media part (depicted
as a rectangle), which represents the third picture of our example photo album of the
Figure 6. Entity with its types
Picture 3
digital image
photograph
314
Figure 7. Entity with an attribute value

:java.util.Date
Picture 3
07/28/2003
date
holiday trip introduced in the previous section, is an instantiation of the concepts

photograph and digital image, represented by the ontology objects photograph
and digital image (each pictured by an octagon), respectively. The type relationships are
indicated by dashed arrows.
For further description, the Emmo model also allows attaching arbitrary attribute
values to entities (expressed by the class of the same name in the class diagram of Figure
5). Attribute values are simple attribute-value pairs, with the attributes being a concept
of an application-dependent ontology represented by an ontology object entity, and the
value being an arbitrary object suiting the type of the value. The rationale behind
representing attributes by concepts of an ontology and not just by mere string identifiers
is that this allows expressing constraints on the usage of attributes within the ontology;
for example, to which entity types attributes are applicable.
Figure 7 gives an example of attribute values. In the figure, it is stated that the third
picture of the photo album was taken July 28, 2003, by attaching an attribute value
date=07/28/2003 to the entity Picture 3 representing that picture. The attribute date
is modeled by the ontology object date and the value 07/28/2003 is captured by an
object of a suitable date class (represented using the UML object notation).
As an essential prerequisite for the realization of distributed, collaborative multimedia applications in which multimedia content is simultaneously authored and annotated by different persons at different locations, the Emmo model provides intrinsic
support for versioning. The class diagram of Figure 5 states that every entity is
versionable and can have an arbitrary number of predecessor and successor versions,
all of which have to be entities of the same kind as the original entity. Treating an entitys
versions as entities on their own has several benefits: on the one hand, entities
constituting versions of other entities have their own globally unique OID. Hence,
different versions concurrently derived from one and the same entity at different sites
can easily be distinguished without synchronization effort. On the other hand, different
versions of an entity can be interrelated just like any other entities allowing one to
establish comparative relationships between entity versions.
Figure 8 exemplifies a possible versioning of our example entity Picture 3. The
original version of this logical part is depicted to the left of the figure. As expressed by
the special arrows indicating the predecessor (pred) and the successor (succ) relationship between different versions of the same entity, two different successor versions of
this original version were created, possibly by two different people at two different
EMMO 315
Figure 8. Versioning of an entity
Picture 3
pred
succ
pred
succ
Picture 3
(attribute value
date added)
Picture 3
(attribute value
aperture added)
pred
succ
pred
succ
Picture 3
(attribute values
aperture and
date merged)
locations. One version augments the logical media part with a date attribute value to
denote the creation date of the picture whereas the other provides an attribute value
describing the aperture with which the picture was taken. Finally, as shown by the logical
media part at the right side of the figure, these two versions were merged again into a
fourth that now holds both attribute values.
Having explained the common characteristics shared by all entities, we are now able
to introduce the peculiarities of the four concrete kinds of entities: logical media parts,
ontology objects, associations, and Emmos.
Logical Media Parts

Logical media parts are entities that form the bridge between the semantic aspect
and the media aspect of an Emmo. A logical media part represents a basic medium of which
multimedia content consists on a logical level for semantic description, thereby providing an abstraction from the physical manifestation of the medium. According to the class
diagram of Figure 9, logical media parts can refer to an arbitrary number of connectors
which we already know from our description of the media aspect of Emmo permitting
one to logically subsume alternative media profiles and instances representing different
media files in possibly different formats in possibly different storage locations under a
common logical media part. The ID of the default profile to use is identified via the
attribute masterProfileID. Since logical media parts do not need to have connectors
associated with them, it is also possible to refer to media within Emmos which do not have
a physical manifestation.
Ontology Objects
Ontology objects are entities that represent concepts of an ontology. We have
already described how ontology objects are used to define entity types and to augment
entities with attribute values. By relating entities such as logical media parts to ontology
objects, they can be given a meaning. As it can be seen from the class diagram of Figure
10, the Emmo model distinguishes two kinds of ontology objects represented by two
subclasses of OntologyObject: Concept and ConceptRef. Whereas an instance of
Concept serves to represent a concept of an ontology that is fully captured within the
Emmo model, ConceptRef allows one to reference concepts of ontologies specified in
external ontology languages such as RDF Schema (Brickley & Guha, 2002). The latter is
a pragmatic tribute to the fact that we have not developed an ontology language for
316
Figure 9. Logical media parts
Figure 10. Ontology objects

OntologyObject
LogicalMediaPart
+ontType : int
+masterProfileID : String
1
ConceptRef
0..*
Concept
+objOID : String
+ontStandard : String
Connector
Figure 11. Association

Entity
+source
1
+target
0..*
0..*
Association
Emmos yet and therefore rely on external languages for this purpose. References to
concepts of external ontologies additionally need a special ID (objOID) uniquely
identifying the external concept referenced and a label indicating the format of the
ontology (ontStandard); for example, RDF Schema.
Associations
Associations are entities that establish binary directed relationships between
entities, allowing the creation of complex and detailed descriptions of the multimedia
content represented by the Emmo. As one can see from Figure 11, each association has
exactly one source entity and one target entity. The kind of semantic relationship
represented by an association is defined by the associations type which is like the
types of other entities an ontology object representing the concept that captures the
type in an ontology. Different from other entities, however, an association is only
permitted to have one type as it can express only a single kind of relationship.
Since associations are first-class entities, they can take part as sources or targets
in other associations like any other entities. This feature permits the creation of very
complex content descriptions, as it facilitates the reification of statements (statements
about statements) within the Emmo model.
EMMO 317
Figure 12. Reification

Paul
believes
Peter
thinks
Mary
Picture 3
fancies
Figure 12 demonstrates how reification can be expressed. In the figure, associations

are symbolized by a diamond shape, with solid arrows indicating the source and target
of an association and dashed arrows indicating the association type. The example shown
in this figure wants to express that Peter believes that Paul thinks that Mary fancies
Picture 3. The statement Mary fancies Picture 3 is represented at the bottom of the
figure by an association of type fancies that connects the ontology object Mary with
the logical media part Picture3. Moreover, this association acts as target for another
association having the type thinks and the source entity Paul, thereby making a
statement about the statement Mary fancies Picture 3. This reification is then further
enhanced by attaching another statement to obtain the desired message.
Emmos
Emmos themselves, finally, constitute the fourth kind of entities. An Emmo is
basically a container that encapsulates arbitrary entities to form a semantically modeled
piece of multimedia content (see the aggregation between the classes Emmo and Entity
in the introductory outline of the model in Figure 4). As one and the same entity can be
contained in more than one Emmo, it is possible to encapsulate different, contextdependent, and even contradicting, views onto the same content within different Emmo;
as Emmo are first-class entities, they can be contained within other Emmos and take part
in associations therein, allowing one to build arbitrarily nested Emmo structures for the
logical organization of multimedia content. These are important characteristics especially useful for the authoring process, as they facilitate reuse of existing Emmos and the
content they represent.
Figure 13 shows an example where a particular Emmo encapsulates another. In the
figure, Emmos are graphically shown as ellipses. The example depicts an Emmo modeling
a private photo gallery that up to the moment holds only a single photo album (again
318
Figure 13. Nested Emmo

Photo Gallery
domain
Journey to Europe
Paris
Paul
Vienna
Mary
vacation
depicts
location
Picture 1
Picture 2
Picture 3
digital image
photograph
modeled by an Emmo): namely, the photo album of the journey to Europe we used as a
motivating example in the section illustrating the Emmo idea. Via an association, this
album is classified as vacation within the photo gallery. In the course of time, the photo
gallery might become filled with additional Emmos representing further photo albums;
for example, one that keeps the photos of a summer vacation in Spain. These Emmos can
be related to each other. For example, an association might express that the journey to
Europe took place before the summer vacation in Spain.
Functional Aspect
Emmos also address the functional aspect of multimedia content. Emmos may offer
operations that realize arbitrary content-specific functionality which makes use of the
media and descriptions provided with the media and semantic aspects of an Emmo and
which can be invoked by applications working with content. The class diagram of Figure
14 shows how this is realized in the model. As expressed in the diagram, an Emmo may
aggregate an arbitrary number of operations represented by the class of the same name.
Each operation has a designator, that is, a name that describes its functionality, which
is represented by an ontology object. Similar to attributes, the motivation behind using
concepts of an ontology as operation designators instead of simple string identifiers is
that this allows one to express restrictions on the usage of operations within an ontology;
for example, the types of Emmo for which an operation is available, the types of the
expected input parameters, and so forth.
The functionality of an operation is provided by a dedicated implementation class
whose name is captured by an operations implClassName attribute to permit the
dynamic instantiation of the implementation class at runtime. There are not many
restrictions for such an implementation class: the Emmo model merely demands that an
implementation class realizes the OperationImpl interface. OperationImpl enforces the
implementation of a single method only: namely, the method execute() which expects
EMMO 319
Figure 14. Emmos functionality

Emmo
OntologyObject
+Designator
0..*
Operation
+implClassName : String
0..*
<<instantiates>>
interface
OperationImpl
+execute(in e : Emmo, in args : Object[]) : Object
Figure 15. Example of Emmo operations

renderAsSlideShow
RenderAsSlideShow:
OperationImpl
Journey to Europe
RenderAsMap:
OperationImpl
renderAsSlideMap
the Emmo on which an operation is executed as its first parameter followed by a vector
of arbitrary operation-dependent parameter objects. Execute() performs the desired
functionality and, as a result, may return an arbitrary object.
Figure 15 once more depicts the Emmo modeling the photo album of the journey to
Europe that we already know from Figure 13, but this time enriched with the two
operations already envisioned in the second section: one that traverses the semantic
description of the album returns an SMIL presentation that renders the album as a slide
show, and another that returns an SVG presentation that renders the same album as a map.
For both operations, two implementation classes are provided that are attached to the
Emmo and differentiated via their designators renderAsSlideShow and renderAsMap.
320
THE EMMO CONTAINER INFRASTRUCTURE

As an elementary foundation for the sharing and collaborative authoring of pieces
of semantically modeled multimedia content on the basis of the Emmo model, we have
implemented a distributed Emmo container infrastructure. Figure 16 provides an overview of this infrastructure, which we are going to describe in more detail in the following
section.
Basically, an Emmo container provides a space where Emmos live. Its main
purpose is the management and persistent storage of Emmos. An Emmo container
provides application programming interfaces that permit applications to fine-grainedly
access, manipulate, traverse, and query the Emmos it stores. This includes the media
aspect of an Emmo with its media profiles and instances, the semantic aspect with all its
descriptional entities such as logical media parts, ontology objects, other Emmos, and
associations, as well as the versioning relationships between those entities. Moreover,
an Emmo container offers an interface to invoke and execute an Emmos operations giving
access to the functional aspect of an Emmo.
Emmo containers are not intended as a centralized infrastructure with a single Emmo
container running at a server (although this is possible). Instead, it is intended to
establish a decentralized infrastructure with Emmo containers of different scales and
sizes running at each site that works with Emmos. Such a decentralized Emmo management naturally reflects the nature of content sharing and collaborative multimedia
applications.
The decentralized approach has two implications. The first implication is that
platform independence and scalability are important in order to support Emmo containers at potentially very heterogeneous sites ranging from home users to large multimedia
content publishers with different operating systems, capabilities, and requirements.
For these reasons, we have implemented the Emmo containers in Java, employing
the object-oriented DBMS ObjectStore for persistent storage. By Java, we obtain
platform independence; by ObjectStore, we obtain scalability as there does not just exist
a full-fledged database server implementation suitable for larger content providers, but
also a code-compatible file-based in-process variant named PSEPro better suiting the
limited needs of home users. It would have been possible to use a similarly scalable
Figure 16. Emmo container infrastructure
Access, manipulate,
traverse, query,
Access, manipulate,
traverse, query,
Emmo Container 1
Import/Export Emmos
Persistent storage
of media and semantic
relationships
Emmo Container 2
Persistent storage
of media and semantic
relationships
EMMO_1284
e98ea567ea
d456778872
Invoke and execute

operations
Invoke and execute

operations
EMMO 321
relational DBMS for persistent storage as well; we opted for an object-oriented DBMS,
however, because of these systems suitability for handling complex graph structures
like Emmos.
The second implication of a decentralized infrastructure is that Emmos must be
transferable between the different Emmo containers operated by users that want to share
or collaboratively work on content. This requires Emmo containers to be able to
completely export Emmos into bundles encompassing their media, semantic, and functional aspects, and to import Emmos from such bundles, which is explained in more detail
in the following two subsections.
In the current state of implementation, Emmo containers are rather isolated components, requiring applications to explicitly initiate the import and export of Emmos and to
manually transport Emmo bundles between different Emmo containers themselves. We
are building a peer-to-peer infrastructure around Emmo containers that permits the
transparent search for and transfer of Emmos across different containers.
Exporting Emmos
An Emmo container can export an Emmo into a bundle whose overall structure is
illustrated by Figure 17.
The bundle is basically a ZIP archive which captures all three aspects of an Emmo:
the media aspect is captured by the bundles media folder. The basic media files of which
the multimedia content modeled by the Emmo consists are stored in this folder.
Figure 17. Structure of an Emmo bundle
!""#$!%&'(')*+(,,+-../('0'-
-1(2-230).+4.%-
!%&'(')*+(,,+-../('0'-
-1(2-230).+4.%-5678
709:+1;<
71/:0
=<<>?@@AAA5>=B<B53B7@>0C8@1CDB>1@>:3<CD122-5E>F
=<<>?@@AAA5>=B<B53B7@>0C8@1CDB>1@<=C7490:822-5E>F
B>1D0<:B9;
D19/1D:9F5E0D
322
Figure 18. Emmo XML representation

<?xml version="1.0" encoding="UTF-16"?>

<emmo xmlns="http://www.cultos.org/emmos" xmlns:mpeg7="http://www.mpeg7.org/2001/MPEG-7_Schema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.cultos.org/emmos http://www.cultos.org/emmos/XML/emmo.xsd">
<components>
<entities>
<entity xsi:type="LogicalMediaPart" mode="strong">
<oid>E1a8d252-f8f2e098bb-3c7cb04afdbd1144ba4d1ea866d93db2</oid>
<name>Beethoven's 5th Symphony</name>
<creationDate>19 November 2003 08:24:54 CET</creationDate>
<modifiedDate>19 November 2003 08:24:55 CET</modifiedDate>
</entity>
<entity xsi:type="Concept" mode="weak">
<oid>E1a8d252-f8f2d7a8a5-3c7cb04afdbd1144ba4d1ea866d93db2</oid>
<name>Classical music</name>
<creationDate>19 November 2003 08:15:09 CET</creationDate>
<modifiedDate>19 November 2003 08:15:09 CET</modifiedDate>
<ontologyType>0</ontologyType>
</entity>
.....
</entities>
<mediaProfiles/>
</components>
<links>
<types>
<typeLink entity="E1a8d252-f8f2e095ed-3c7cb04afdbd1144ba4d1ea866d93db2"
type="E1a8d252-f8f2e0980f-3c7cb04afdbd1144ba4d1ea866d93db2"/>
<typeLink entity="E1a8d252-f8f2e095fc-3c7cb04afdbd1144ba4d1ea866d93db2"
type="E1a8d252-f8f2e098ef-3c7cb04afdbd1144ba4d1ea866d93db2"/>
<typeLink entity="E1a8d252-f8f2e098bb-3c7cb04afdbd1144ba4d1ea866d93db2"
type="E1a8d252-f8f2d7a8a5-3c7cb04afdbd1144ba4d1ea866d93db2"/>
</types>
<attributeValues/>
<associations>
<assoLink association="E1a8d252-f8f2e0981a-3c7cb04afdbd1144ba4d1ea866d93db2"
sourceEntity="E1a8d252-f8f2e095fc3c7cb04afdbd1144ba4d1ea866d93db2"
targetEntity="E1a8d252-f8f2e098bb-3c7cb04afdbd1144ba4d1ea866d93db2"/>
</associations>
<connectors/>
<predVersions/>
<succVersions/>
<encapsulations/>
</links>
</emmo>
The semantic aspect is captured by a central XML file whose name is given the OID
of the bundled Emmo. This XML file captures the semantic structure of the Emmo, thus
describing all of the Emmos entities, the associations between them, the versioning
relationships, and so forth.
Figure 18 shows a fragment of such an XML file. It is divided into a <components>
section declaring all entities and media profiles relevant for the current Emmo and a
<links> section capturing all kinds of relationships between these entities and media
profiles, such as types, associations, and so forth.
The functional aspect of an Emmo is captured by the bundles operations folder
in which the binary code of the Emmos operations is stored. Here, our choice for Java
as the implementation language for Emmo containers comes in handy again, as it allows
EMMO 323
us to transfer operations in form of JAR files with platform-independent bytecode even

between heterogeneous platforms.
The export functionality can react to different application needs by offering several
export variants: an Emmo can be exported with or without media included in the bundle,
one can choose whether to also include media that are only referenced by URIs, the
predecessor and successor versions of the contained entities can either be added to the
bundle or omitted, and it can be decided whether to recursively export Emmos contained
within an exported Emmo. The particular export variants chosen are recorded in the
bundles manifest file.
In order to implement these different export variants, an Emmo container distinguishes three different modes of how entities can be placed in a bundle:
The strong mode is the normal mode for an entity. The bundle holds all information
about an entity including its types, attribute values, immediate predecessor and
successor versions, media profiles (in case of a logical media part), contained
entities (in case of an Emmo), and so forth.
The hollow mode is applicable to Emmos only. The hollow mode indicates that the
bundle holds all information about an Emmo except the entities it contains. The
hollow mode appears in bundles where it was chosen not to recursively export
encapsulated Emmo. In this case, encapsulated Emmos receive the hollow mode;
the entities encapsulated by those Emmos are excluded from the export.
The weak mode indicates that the bundle contains only basic information about
an entity, such as its OID, name, and description but no types, attribute values, and
so forth. Weak mode entities appear in bundles that have been exported without
versioning information. In this case, the immediate predecessor and successor
versions of exported entities are placed into the bundle in weak mode; indirect
predecessor and successor versions are excluded from the export.
The particular mode of an entity within a bundle is marked with the mode attribute
in the entitys declaration in the bundles XML file (see again Figure 18).
Importing Emmos
When importing an Emmo bundle exported in the way described in the previous
subsection, an Emmo container essentially inserts all media files, entities, and operations
included in the bundle into its local database. In order to avoid duplicates, the container
checks whether an entity with the same OID or whether a media file or JAR file already
exists in the local database before insertion. If a file already exists, the basic strategy of
the importing container is that the local copy prevails.
However, the different export variants for Emmos and the different modes in which
entities might occur in a bundle as well as the fact that in a collaborative scenario
Emmos might have been concurrently modified without creating new versions of entities
demand a more sophisticated handling of duplicate entities on the basis of a timestamp
protocol. Depending on the modes of two entities with the same OID in the bundle, and
the local database and the timestamps of both entities, essentially the following treatment
is applied:
A greater mode (weak < hollow < strong) in combination with a more recent
timestamp always wins. Thus, if the local entity has a greater mode and a newer
324
timestamp, it prevails, and the entity in the bundle is ignored. Similarly, if the local
entity has a lesser mode and an older timestamp, the entity in the bundle completely
replaces the local entity in the database.
If the local entity has a more recent timestamp but a lesser mode, additional data
available for the entity in the bundle (entity types, attribute values, predecessor
or successor versions, encapsulated entities in case of Emmos, or media profiles
in case of logical media parts) complements the data of the local entity, thereby
raising its mode.
In case of same modes but a more recent timestamp of the entity in the bundle, the
entity in the bundle completely replaces the local entity in the database.
In case of same modes but a more recent timestamp of the entity in the local
database, the entity in the database prevails and the entity in the bundle is ignored.
APPLICATIONS
Having introduced and described the Emmo approach to semantic multimedia
content modeling and the Emmo container infrastructure, this section illustrates how
these concepts have been practically applied in two concrete multimedia content sharing
and collaborative applications. The first application named CULTOS is in the domain of
cultural heritage and the second application introduces a semantic jukebox.
CULTOS
CULTOS is an European Union (EU)-funded project carried out from 2001 to 2003
with 11 partners from EU-countries and Israel1. It has been the task of CULTOS to develop
a multimedia collaboration platform for authoring, managing, retrieving, and exchanging
Intertextual Threads (ITTs) (Benari et al., 2002; Schellner et al., 2003) knowledge
structures that semantically interrelate and compare cultural artifacts such as literature,
movies, artworks, and so forth. This platform enables the community of intertextual
studies to create and exchange multimedia-enriched pieces of cultural knowledge that
incorporate the communitys different cultural backgrounds an important contribution to the preservation of European cultural heritage.
ITTs are basically graph structures that describe semantic relationships between
cultural artifacts. They can take a variety of forms, ranging from spiders over centipedes
to associative maps, like the one shown in Figure 19.
The example ITT depicted in the figure highlights several relationships of the poem
The Fall by Tuvia Ribner to other works of art. It states that the poem makes reference
to the 3rd book of Ovids Metamorphoses and that the poem is an ekphrasis of the
painting Icarus Fall by the famous Dutch painter Breugel.
The graphical representation of an ITT bears strong resemblance to well-known
techniques for knowledge representation such as concept graphs or semantic nets,
although it lacks their formal rigidity. ITTs nevertheless get very complex, as they
commonly make use of constructs such as encapsulation and reification of statements
that are challenging from the perspective of knowledge representation.
EMMO 325
Figure 19. Simple intertextual thread

Text
Referencing
by Ovid
Metamorphoses
Book 3
The Fall
by Ribner
Text
Ekphrasis
by Breugel
Icaruss Fall
Painting
Encapsulation is intrinsic to ITTs because intertextual studies are not exact

sciences. Certainly, the cultural and personal context of a researcher affects the kind of
relationships between pieces of literature he discovers and are of value to him. As such
different views onto a single subject are highly interesting to intertextual studies, ITTs
themselves can be relevant subjects of discourse and thus be contained as first-class
artifacts within other ITTs. Figure 20 illustrates this point with a more complex ITT that
interrelates two ITTs manifesting two different views on Ribners poem as opposed
representations.
Reification of statements is also frequently occurring within ITTs. Since experts in
intertextual studies extensively base their position on the position of other researchers,
statements about statements are common practice within ITTs. In the ITT of Figure 20,
for instance, it is expressed by reification that the statement describing the two depicted
ITTs as opposed representation is only the opinion of a certain researcher B. Zoa.
Given these characteristics of ITTs, we have found that Emmos are very well suited
for their representation in the multimedia collaboration platform for intertextual studies
that is envisioned by CULTOS. Firstly, the semantic aspect of Emmo offers sufficient
Figure 20. Complex intertextual thread
Text
Referencing
Metamorphoses
Text
by Ovid
The Fall of
Adam&Eve
Book 3
The Fall
by Ribner
NewTestament
Genesis,ch.II
Opposed
Representation
Text
Cultural Concept
Ekphrasis
Icaruss Fall
by Breugel
Painting
believes
The Fall
by Ribner
Text
B. Zoa
326
Figure 21. Emmo representing an ITT
Emmo3
Opposed
Representation
Metamorphoses
by Ovid
Referencing
The Fall
of Adam&Eve
Connector 2
http://.../Metamorphoses.pdf
Emmo1
Connector 5
Emmo2
FallAdam&Eve.doc
Icarus Fall
By Breugel
The Fall
by Ribner
Connector 1
Cultural
Concept
Connector 3
Ekphrasis
http://.../IcarusFall.jpg
The Fall
by Ribner
http://.../TheFall.doc
Connector 4
believes
TheFall.doc
:RenderingImplementation
B.Zoa
Rendering
expressiveness to capture ITTs. Figure 21 shows how the complex ITT of Figure 20 could
be represented using Emmos. Due to the fact that associations as well as Emmos
themselves are first-class entities, it is even possible to cope with reification of
statements as well as with encapsulation of ITTs.
Secondly, the media aspect of Emmos allows researchers to enrich ITTs that so far
expressed interrelationships between cultural artefacts on an abstract level with digital
media about these artefacts, such as a JPEG image showing Breugels painting, Icarus
Fall. The ability to consume these media while browsing an ITT certainly enhances the
comprehension of the ITT and the relationships described therein.
Thirdly, with the functional aspect of Emmos, functionality can be attached to ITTs.
For instance, an Emmo representing an ITT in CULTOS offers operations to render itself
in an HTML-based hypermedia view.
Additionally, our Emmo container infrastructure outlined in the previous section
provides a suitable foundation for the realization of the CULTOS platform. Their ability
to persistently store Emmos as well as their interfaces which enable applications to finegrainedly traverse and manipulate the stored Emmos and invoke their operations make
Emmo containers an ideal ground for the authoring and browsing applications for ITTs
that had to be implemented in the CULTOS project. Figure 22 gives a screenshot of the
authoring tool for ITTs that has been developed in the CULTOS project which runs on
top of an Emmo container.
Moreover, their decentralized approach allows the setup of independent Emmo
containers at the sites of different researchers; their ability to import and export Emmos
with all the aspects they cover facilitates the exchange of ITTs, including the media by
EMMO 327
Figure 22. CULTOS authoring tool for ITTs
which they are enriched as well as the functionality they offer. This enables researchers
to share and collaboratively work on ITTs in order to discover and establish new links
between artworks as well as different personal and cultural viewpoints, thereby paving
the way to novel insights to a subject. The profound versioning within the Emmo model
further enhance this kind of collaboration, allowing researchers to concurrently create
different versions of an ITT at different sites, to merge these versions, and to highlight
differences between these versions.
Semantic Jukebox
One of the most prominent (albeit legally disputed) multimedia content sharing
applications is the sharing of MP3 music files. Using peer-to-peer file sharing infrastructures such as Gnutella, many users gather large song libraries on their home PCs which
328
they typically manage with one of the many jukebox programs available, such as Apples
iTunes (Apple Computer, n.d.). The increasing use of ID3 tags (ID3v2, n.d.) optional
free text attributes capturing metadata like the interpreter, title, and the genre of a song
- within MP3 files for song description alleviates the management of such libraries.
Nevertheless, ID3-based song management quickly reaches its limitations. While
ID3 tags enable jukeboxes to offer reasonably effective search functionality for songs
(provided the authors of ID3 descriptions spell the names of interprets, albums, and
genres consistently), more advanced access paths to song libraries are difficult to realize.
Apart from other songs of the same band or genre, for instance, it is difficult to find songs
similar to the one that is currently playing. In this regard, it would also be interesting to
be able to navigate to other bands in which artists of the current band played as well or
with which the current band appeared on stage together. But such background knowledge cannot be captured with ID3 tags.
Using Emmos and the Emmo container infrastructure, we have implemented a
prototype of a semantic jukebox that considers background knowledge about music. The
experience we have gained from this prototype shows that the Emmo model is well-suited
to represent knowledge-enriched pieces of music in a music sharing scenario. Figure 23
gives a sketch of such a music Emmo which holds some knowledge about the song
Round Midnight.
Figure 23. Knowledge about the song Round Midnight represented by an Emmo
Composition
Round Midnight
Thelonious Monk
Round Midnight
composed by
assigned to
Artist
Record
has Manifestation
Miles Davis
Round about
Midnight
Round Midnight
:java.util.Date
played by
Connector 1
10/26/1955
http://.../roundmid.mp3
date of issue
Performance
:RenderAsTimelineinSVG
Rendering
EMMO 329
Its media aspect enables the depicted Emmo to act as a container of MP3 music files.
In our example, this is a single MP3 file with the song Round Midnight that is connected
as a media profile to the logical media part Round Midnight in the center of the figure.
The Emmos semantic aspect allows us to express rich background knowledge
about music files. For this purpose, we have developed a basic ontology for the music
domain featuring concepts such as Artist, Performance, Composition, and Record
that all appear as ontology objects in the figure. The ontology also features various
association types which allow us to express that Round Midnight was composed by
Thelonious Monk and the particular performance by Miles Davis can be found on the
record Round about Midnight.
The ontology also defines attributes for expressing temporal information like the
issue date of a record.
The functional aspect, finally, enables the Emmo to support different renditions of
the knowledge it contains. To demonstrate this, we have realized an operation that, being
passed a time interval as its parameter, produces an SVG timeline rendition (see
screenshot of Figure 24) arranging important events like the foundation of bands, the
birthdays and days of death of artists, and so forth, around a timeline. More detailed
information for each event can be gained by clicking on the particular icons on the
timeline.
Further operations could be imagined; for example, operations that provide rights
clearance functionality for the music files contained in the Emmo, which is a crucial issue
in music sharing scenarios.
Figure 24. Timeline rendition of a music Emmo
330
Our Emmo container infrastructure provides a capable storage foundation for

semantic jukeboxes. Their ability to fine-grainedly manage Emmos as well as their
scalability allowing them to be deployed as both small-scale file-based and as large-scale
database server configurations. Thus, Emmo containers constitute suitable hosts for
knowledge-enriched music libraries of private users as well as libraries of professional
institutions such as radio stations. Capable of exporting and importing Emmos to and
from bundles, Emmo containers also facilitate the sharing of music between different
jukeboxes. Their versioning support even allows it to move from mere content sharing
scenarios to collaborative scenarios where different users cooperate to enrich and edit
Emmos with their knowledge about music.
CONCLUSION
Current approaches to semantic multimedia content modeling typically regard the
basic media which the content comprises, the description of these media, and the
functionality of the content as conceptually separate entities. This leads to difficulties
with multimedia content sharing and collaborative applications. In reply to these
difficulties, we have proposed Enhanced Multimedia Meta Objects (Emmos) as a novel
approach to semantic multimedia content modeling. Emmos coalesce the media of which
multimedia content consists, their semantic descriptions, as well as functionality of the
content into single indivisible objects. Emmos in their entirety are serializable and
versionable, making them a suitable foundation for multimedia content sharing and
collaborative applications. We have outlined a distributed container infrastructure for
the persistent storage and exchange of Emmos. We have illustrated how Emmos and the
container infrastructure were successfully applied for the sharing and collaborative
authoring of multimedia-enhanced intertextual threads in the CULTOS project and for the
realization of a semantic jukebox.
We strive to extend the technological basis of Emmos. We are currently developing
a query algebra, which permits declarative querying of all the aspects of multimedia
content captured by Emmos, and integrating this algebra within our Emmo container
implementation. Furthermore, we are wrapping the Emmo containers as services in a peerto-peer network in order to provide seamless search for and exchange of Emmos in a
distributed scenario. We also plan to develop a language for the definition of ontologies
that is adequate for use with Emmos. Finally, we are exploring the handling of copyright
and security within the Emmo model. This is certainly necessary as Emmos might not just
contain copyrighted media material but also carry executable code with them.
REFERENCES
Apple Computer (n.d.). iTunes. Retrieved 2004 from http://www.apple.com

Ayars, J., Bulterman, D., Cohen, A., et al. (2001). Synchronized multimedia integration
language (SMIL 2.0). W3C Recommendation, World Wide Web Consortium
(W3C).
Baumeister, S. (2002). Enterprise media beans TM specification. Public Draft Version1.0,
IBM Corporation.
EMMO 331
Benari, M., Ben-Porat, Z., Behrendt, W., Reich, S., Schellner, K., & Stoye, S. (2002).
Organizing the knowledge of arts and experts for hypermedia presentation. Proceedings of the Conference of Electronic Imaging and the Visual Arts, Florence,
Italy.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American.
Boll, S., Klas, W., & Westermann, U. (2000). Multimedia document formats - sealed fate
or setting out for new shores? Multimedia - Tools and Applications, 11(3).
Brickley, D., & Guha, R.V. (2002). Resource description framework (RDF) vocabulary
description language 1.0: RDF Schema. W3C Working Draft, World Wide Web
Consortium (W3C).
Chang, H., Hou, T., Hsu, A., & Chang, S. (1995). Tele-Action objects for an active
multimedia system. Proceedings of the International Conference on Multimedia
Computing and Systems (ICMCS 1995), Ottawa, Canada.
Chang, S., & Znati, T. (2001). Adlet: An active document abstraction for multimedia
information fusion. IEEE Transactions on Knowledge and Data Engineering,
13(1).
Daniel, R., Lagoze, D., & Payette, S. (1998). A metadata architecture for digital libraries.
Proceedings of the Advances in Digital Libraries Conference, Santa Barbara,
California.
Fensel, D. (2001). Ontologies: A silver bullet for knowledge management and electronic
commerce. Heidelberg: Springer.
Ferraiolo, J., Jun, F., & Jackson, D. (2003). Scalable vector graphics (SVG) 1.1. W3C
Recommendation, World Wide Web Consortium (W3C).
Gnutella (n.d.). Retrieved 2003 from http://www.gnutella.com
Grimson, J., Stephens, G., Jung, B., et al. (2001). Sharing health-care records over the
internet. IEEE Internet Computing, 5(3).
ID3v2 (n.d.). [Computer software]. Retrieved 2004 from http://www.id3.org
ISO/IEC JTC 1/SC 29 (1997). Information technology - Coding of hypermedia information - part 5: support for base-level interactive applications. ISO/IEC International Standard 13522-5:1997, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/IEC JTC 1/SC 29/WG 11 (2001). Information technology - Multimedia content
description interface - part 5: Multimedia description schemes. ISO/IEC Final
Draft International Standard 15938-5:2001, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/IEC JTC 1/SC 34/WG 3 (1997). Information technology - Hypermedia/time-based
structuring language (HyTime). ISO/IEC International Standard 15938-5:2001,
International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/IEC JTC 1/SC 34/WG 3 (2000). Information technology - SGML applications - topic
maps. ISO/IEC International Standard 13250:2000, International Organization for
Standardization/International Electrotechnical Commission (ISO/IEC).
ISO/JTC1/SC 32/WG 2 (2001). Conceptual graphs. ISO/IEC International Standard,
International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
332
Lagoze, C., Lynch, C., & Daniel, R. (1996). The warwick framework: A container
architecture for aggregating sets of metadata. Technical Report TR 96-1593,
Cornell University, Ithaca, New York.
Lassila, O., & Swick, R.R. (1999). Resource description framework (RDF) model and
syntax specification. W3C Recommendation, World Wide Web Consortium (W3C).
Leach, P. J. (1998, February). UUIDs and GUIDs. Network Working Group Internet-Draft,
The Internet Engineering Task Force (IETF).
Matena, V., & Hapner, M. (1998). Enterprise Java Beans TM. Specification Version 1.0,
Sun Microsystems Inc.
Nejdl, W., Wolf, B., Qu, C., et al. (2002). EDUTELLA: A P2P networking infrastructure
based on RDF. Proceedings of the Eleventh International World Wide Web
Conference (WWW 2002), Honolulu, Hawaii.
Newmann, D., Patterson, A., & Schmitz, P. (2002). XHTML+SMIL profile. W3C Note,
World Wide Web Consortium (W3C).
Pereira, F., & Ebrahimi T., (Eds.) (2002). The MPEG-4 book. CA: Pearson Education
Reich, S., Behrendt, W., & Eichinger, C. (2000). Document models for navigating digital
libraries. Proceedings of the Kyoto International Conference on Digital Libraries, Orlando, Kyoto, Japan.
Raggett, D., Le Hors, A., & Jacobs, I. (1999). HTML 4.01 specification. W3C Recommendation, World Wide Web Consortium (W3C).
ENDNOTE
1
See http://www.cultos.org for more details on the project.
Semantically Driven Multimedia Querying and Presentation
333
Chapter 14
Semantically Driven
Multimedia Querying
and Presentation
Isabel F. Cruz, University of Illinois, Chicago, USA
Olga Sayenko, University of Illinois, Chicago, USA
ABSTRACT
Semantics can play an important role in multimedia content retrieval and presentation.
Although a complete semantic description of a multimedia object may be difficult to
generate, we show that even a limited description can be explored so as to provide
significant added functionality in the retrieval and presentation of multimedia. In this
chapter we describe the DelaunayView that supports distributed and heterogeneous
multimedia sources and proposes a flexible semantically driven approach to the
selection and display of multimedia content.
INTRODUCTION
The goal of a semantically driven multimedia retrieval and presentation system is
to explore the semantics of the data so as to provide the user with a rich selection criteria
and an expressive set of relationships among the data, which will enable the meaningful
extraction and display of the multimedia objects. The major obstacle in developing such
a system is the lack of an accurate and simple way of extracting the semantic content that
is encapsulated in multimedia objects and in their inter-relationships. However, metadata
that reflect multimedia semantics may be associated with multimedia content. While
334 Cruz & Sayenko
metadata may not be equivalent to an ideal semantic description, we explore and

demonstrate its possibilities in our proposed framework. DelaunayView is envisioned as
a system that allows users to retrieve multimedia content and interactively specify its
presentation using a semantically driven approach.
DelaunayView incorporates several ideas from the earlier systems Delaunay (Cruz &
Leveille, 2000) and DelaunayMM (Cruz & James, 1999). In the DelaunayView framework,
multimedia content is stored in autonomous and heterogeneous sources annotated with
metadata descriptions in resource description framework (RDF) format (Klyne & Carroll,
2004). One such source could be a database storing scientific aerial photographs and
descriptions of where and when the photographs were taken. The framework provides
tools for specifying connections between multimedia items that allow users to create an
integrated virtual multimedia source that can be queried using RQL (Karvounarakis et al.,
2002) and keyword searches. For example, one could specify how a location attribute from
the aerial photo database maps to another location attribute of an infrared satellite image
database so that a user can retrieve images of the same location from both databases.
In DelaunayView, customizable multimedia presentation is enabled by a set of
graphical interfaces that allow users to bind the retrieved content to presentation
templates (such as slide sorters or bipartite graphs), to specify content layout on the
screen, and to describe how the dynamic visual interaction among multimedia objects can
reflect the semantic relationships among them. For example, a user can specify that aerial
photos will be displayed in a slide sorter on the left of the workspace, satellite images
in another slide sorter on the bottom of the workspace, and that when a user selects a
satellite image, the aerial photos will be reordered so that the photos related to the
selected image appear first in the sorter.
In this paper we describe our approach to multimedia querying and presentation and
focus on how multimedia semantics can be used in these activities. In Background we
discuss work in multimedia presentation, retrieval, and description; we also introduce
concepts relating to metadata modeling and storage. In A Pragmatic Approach to
Multimedia Presentation, we present a case study that illustrates the use of our system
and describe the system architecture. In Future Work we describe future research
directions and summarize our findings in Conclusions.
BACKGROUND
A multimedia presentation system relies on a number of technologies for describing, retrieving and presenting multimedia content. XML (Bray et al., 2000) is a widely
accepted standard for interoperable information exchange. MPEG-7 (Martinez, 2003;
Chang et al., 2001) makes use of XML to create rich and flexible descriptions of multimedia
content. DelaunayView relies on multimedia content descriptions for the retrieval and
presentation of content, but it uses RDF (Klyne & Carroll, 2004) rather than XML. We
chose RDF over XML because of its richer modeling capabilities, whereas in other
components of the Delaunay View system we have used XML (Cruz & Huang, 2004).
XML specifies a way to create structured documents that can be easily exchanged
over the Web. An XML document contains elements that encapsulate data. Attributes
may be used to describe certain properties of the elements. Elements participate in
335
hierarchical relationships that determine the document structure. XML Schema (Fallside,
2001) provides tools for defining elements, attributes, and document structure. One can
define typed elements that act as building blocks for a particular schema. XML Schema
also supports inheritance, namespaces, and uniqueness.
MPEG-7 (Martnez, 2003) defines a set of tools for creating rich descriptions of
multimedia content. These tools include Descriptors, Description Schemes (DS)
(Salembier & Smith, 2001) and the Description Definition Language (DDL) (Hunter,
2001). MPEG-7 descriptions can be expressed in XML or in binary format. Descriptors
represent low-level features such as texture and color that can be extracted automatically.
Description Schemes are composed of multiple Descriptors and Description Schemes to
create more complex descriptions of the content. For example, the MediaLocator DS
describes the location of a multimedia item. The MediaLocator is composed of the
MediaURL descriptor and an optional MediaTime DS: the former contains the URL that
points to the multimedia item, while the latter is meaningful in the case where the
MediaLocator describes an audio or a video segment. Figure 1 shows an example of a
MediaLocator DS and its descriptors RelTime and Duration that, respectively, describe
the start time of a segment relative to the beginning of the entire piece and the segment
duration.
The Resource Description Framework (RDF) offers an alternative approach to
describing multimedia content. An RDF description consists of statements about
resources and their properties. An RDF resource is any entity identifiable by a URI. An
RDF statement is a triple consisting of subject, predicate, and object. The subject is the
resource about which the statement is being made. The predicate is the property being
described. The object is the value of this property. RDF Schema (RDFS) (Brickley &
Guha, 2001) provides mechanisms for defining resource classes and their properties. If
an RDF document conforms to an RDF schema (expressed in RDFS), resources in the
document belong to classes defined in the schema. A class definition includes the class
name and a list of class properties. A property definition includes domain the subject
of the corresponding RDF triple and range the object. The RDF Query Language
(RQL) (Karvounarakis et al., 2002) is a query language for RDF and RDFS documents. It
supports a select-from-where structure, basic queries and iterators that combine the
basic queries into nested and aggregate queries, and generalized path expressions.
MPEG-7 Description Schemes and DelaunayView differ in their approach to multimedia semantics. In MPEG-7, semantics are represented as a distinct description scheme
that is narrowly aimed at narrative media. The Semantic DS includes description schemes
Figure 1. MediaLocator description scheme
336 Cruz & Sayenko
for places, objects, events, and agents people, groups, or other active entities
that operate within a narrative world associated with a multimedia item. In DelaunayView
any description of multimedia content can be considered a semantic description for the
purposes of multimedia retrieval. DelaunayView recognizes that, depending on the
application, almost any description may be semantically valuable. For example, an aerial
photo of the arctic ice sheet depicts some areas of intact ice sheet and others of open
water. Traditional image processing techniques can be applied to the photo to extract
light and dark regions that represent ice and open water respectively. In the climate
research domain, the size, shape, and locations of these regions constitute the semantic
description of the image. In MPEG-7 however, this information will be described with the
StillRegion DS, which does not carry semantic significance.
Beyond their diverse perspectives on the nature of the semantics that they
incorporate, MPEG-7 and DelaunayView use different approaches to representing semantic descriptions: MPEG-7 uses XML, while DelaunayView uses RDF. An XML document
is structured according to the tree paradigm: each element is a node and its children are
the nodes that represent its subelements. An RDF document is structured according to
the directed graph paradigm: each resource is a node and each property is a labeled
directed edge from the subject to the object of the RDF statement. Unlike XML, where
schema and documents are separate trees, an RDF document and its schema can be
thought of as a single connected graph. This property of RDF enables straightforward
implementation of more powerful keyword searches as a means of selecting multimedia
for presentation. Thus using RDF as an underlying description format gives users more
flexibility in selecting content for presentation.
Another distinctive feature between MPEG-7 and DelaunayView is the focus of the
latter on multimedia presentation. A reference model for intelligent multimedia presentation systems encompasses an architecture consisting of control, content, design,
realization, and presentation display layers (Bordegoni et al., 1997). The user interacts
with the control layer to direct the process of generating the presentation. The content
layer includes the content selection component that retrieves the content, the media
allocation component that determines in what form content will be presented, and
ordering components. The design layer produces the presentation layout and further
defines how individual multimedia objects will be displayed. The realization layer
produces the presentation from the layout information provided by the design layer. The
presentation display layer displays the presentation. Individual layers interact with a
knowledge server that maintains information about customization.
LayLab demonstrates an approach to multimedia presentation that makes use of
constraint solving (Graf, 1995). This approach is based on primitive graphical constraints
such as under or beside that can be aggregated into complex visual techniques (e.g.,
alignment, ordering, grouping, and balance). Constraint hierarchies can be defined to
specify design alternatives and to resolve overconstrained states. Geometrical placement heuristics are constructs that combine constraints with control knowledge.
Additional work in multimedia presentation and information visualization can be
found in Baral et al. (1998), Bes et al. (2001), Cruz and Lucas (1997), Pattison and Phillips
(2001), Ram et al. (1999), Roth et al. (1996), Shih and Davis (1997), and Weitzman and
Wittenburg (1994).
337
A PRAGMATIC APPROACH TO
MULTIMEDIA PRESENTATION
In our approach to the design of a multimedia presentation system, we address the
following challenges. Multimedia content is resident in distributed, heterogeneous, and
autonomous sources. However, it is often necessary to access content from multiple
sources. The data models and the design of the sources vary widely and are decided upon
autonomously by the various entities that maintain them. Our approach accommodates
this diversity by using RDFS to describe the multimedia sources in a simple and flexible
way. The schemata are integrated into a single global schema that enables users to
access the distributed and autonomous multimedia sources as if they were a single
source. Another challenge is that the large volume of multimedia objects presented to
the user makes it difficult to perceive and understand the relationships among them. Our
system gives users the ability to construct customized layouts, thus making the semantic
relationships among multimedia objects more obvious.
Case Study
This case study illustrates how multimedia can be retrieved and presented in an
integrated view workspace using as example of a bill of materials for the aircraft industry.
A bill of materials is a list of parts or components required to build a product. In Figure
2, the manufacturing of commercial airplanes is being planned using a coordinated
visualization composed of three views: a bipartite graph, a bar chart, and a slide sorter.
The bipartite graph illustrates the part-subpart relationship between commercial aircrafts
and their engines, the bar chart displays the number of engines currently available in the
Figure 2. A coordinated integrated visualization
338 Cruz & Sayenko
inventory of a plant or plants, and the slide sorter shows the maps associated with the
manufacturing plants.
First, the user constructs a keyword query using the Search Workspace to obtain
a data set. This process may be repeated several times to get data sets related to airplanes,
engines, and plants. The user can preview the data retrieved from the query, further refine
the query, and name the data set for future use.
Then, relationships are selected (if previously defined) or defined among the data
sets, using metadata, a query, or user annotations. In the first two cases, the user selects
a relationship that was provided by the integration layer. An example of such a
relationship would be the connection that is established between the attribute engine of
the airplane data set (containing one engine used in that airplane) and the engine data
set. Other more complex relationships can be established using an RQL query.
Yet another type of relationship can be a connection that is established by the user.
This interface is shown in Figure 3. In this figure and those that follow, the left panel
contains the overall navigation mechanism associated with the interface, allowing for
any other step of the querying or visualization process to be undertaken. Note that we
chose the bipartite component to provide visual feedback when defining binary relationships. This is the same component that is used for the display of bipartite graphs.
The next step involves creating the views, which are built using templates. A data
set can be applied to different templates to form different views. The interface of Figure
4 illustrates a slide sorter of the maps where the manufacturers of aircraft engines are
located. In this process, data attributes of the data set are bound to visual attributes of
the visual template. For example, the passenger capacity of a plane can be applied to the
height of a bar chart. The users also can further change the view to conform to their
preferences, for example, by changing the orientation of a bar chart from vertical to
horizontal. The sorter allows the thumbnails to be sorted by the values of any of the
attributes of the objects that are depicted by the thumbnails. Individual views can be laid
out anywhere on the panel as shown in Figure 5. The user selects the kind of dynamic
interaction between every pair of views by using a simple customization panel.
Figure 3. Relation workspace
339
Figure 4. Construction of a view
Figure 5. View layout
In the integrated view, the coordination between individual views has been
established. By selecting a manufacturing plant in the slide sorter, the bar displays the
inventory situation of the selected plant; for example, the availability of each type of
airplane engine. By selecting more plants in the sorter, the bar chart can display the
aggregate number of available engines over several plants for each type of airplane
engine.
There are two ways of displaying relationships: they can be either represented
within the same visualization (as in the bipartite graph of Figure 2) or as a dynamic
relationship between two different views, as in the interaction between the bar chart and
sorter views. Other interactions are possible in our case study. For example, the bipartite
graph can also react to the user selections on the sorter. As more selections of plants
340 Cruz & Sayenko
are performed on the sorter, different types of engines produced by the selected
manufacturer(s) appear highlighted. Moreover, the bipartite graph view can be
refreshed to display only the relationship between the corresponding selected items in
the two data sets.
System Architecture
The DelaunayView system is composed of the presentation, integration, and data

layers. The data layer consists of a number of autonomous and heterogeneous
multimedia data sources that contain images and metadata. The integration layer
connects the individual sources into a single integrated virtual source that makes
multimedia from the distributed sources available to the presentation layer. The
presentation layer includes user interface components that allow for users to query the
multimedia sources and to specify how the images returned by the queries should be
displayed and what should be the interaction among those images.
Data Layer
The data layer is comprised of a number of autonomous multimedia sources that

contain images annotated with metadata. An image has a number of attributes associated
with it. First there are the low-level features that can be extracted automatically. In
addition, there are the application-dependent attributes that may include timestamps,
provenance, text annotations, and any number of other relevant characteristics. All of
these attributes determine the semantics of the image in the application context. Image
attributes are described by an RDF schema. When an image is added to a multimedia
source, it is given a unique identifier and stored in a binary string format. A document
fragment containing image metadata is created and stored in the database.
Example 1: Let us consider a multimedia source that contains aerial photos of the
Arctic ice sheet used in climate research. The relevant metadata attributes include date
and time a photo was taken and the latitude and longitude of the location where it was
taken. A photo taken on 08/22/2003 14:07:23 at 8148N, 140E is represented in the
following way: the image file is stored in table arctic (Figure 6) with identifier QZ297492.
The RDF document fragment containing the metadata and referencing the image is shown
in Figure 7.
In Figure 7, Line 3 declares namespace aerial which contains the source schema.
Lines 5-8 contain the RDF fragment that describes the object-relational schema of table
Figure 6. Table arctic
341
Figure 7. Arctic aerial photo metadata document
arctic; arctic.imageId contains a unique identifier and arctic.imageValue contains the

image itself. Note that only the attributes that are a part of the reference to the image are
described; that is, arctic.source is omitted. Lines 9-14 describe the metadata attributes
of an aerial photo. They include timestamp, longitude, and latitude. The property
reference does not describe a metadata attribute, but rather acts as a reference to the
image object.
The source schema is shown in Figure 8. Lines 4-12 define class imageLocation with
properties key and value. Lines 13-29 define class image with properties timestamp,
longitude, latitude and reference. Every time an image is added to the source an RDF
fragment conforming to this schema is created and stored in the database. Although each
fragment will contain a description of table arctic, this description will be stored only once.
We use the RDFSuite (Alexaki et al., 2000) to store RDF and RDFS data. The
RDFSuite provides both persistent storage for the RDF and RDFS data and the implementation of the RQL language. The RDFSuite translates RDF data and schemata into
object-relational format and stores them in a PostgreSQL database. The RQL interpreter
generates SQL queries over the object-relational representation of RDF data and schema
and processes RQL path expressions.
Integration Layer
The integration layer combines all multimedia sources into a single integrated
virtual source. In the context of this layer, a multimedia source is a local source and its
source schema is a local schema. The integrated virtual source is described by the global
schema, which is obtained as a result of the integration of the sources. DelaunayView uses
foreign key relationships to connect individual sources into the integrated virtual source.
Implicit foreign key relationships exist between local sources, but they only become
apparent when all local sources are considered as a whole. The global schema is built
by explicitly defining foreign key relationships. A sequence of foreign key definitions
342 Cruz & Sayenko
Figure 8. RDFS schema describing arctic photo metadata
yields a graph where the local schemata are the subgraphs and the foreign key
relationships are the edges that connect them.
The foreign key relationships are defined with the help of the graphical integration
tool of Figure 9. This tool provides a simple graphical representation of the schemata that
are present in the system and enables the user to specify foreign key relationships
between them. When the user imports a source into the system, its schema is represented
on the left-hand side panel as a box. Individual schemata are displayed on the right-hand
side pane as trees. The user defines a foreign key by selecting a node in each schema that
participates in the relationship, and connecting them by an edge. Figure 9 shows how
a foreign key relationship is defined between airplane and engine schemata. The edge
between engine and name represents that relationship. The graphical integration tool
generates an RDF document that describes all the foreign key relationships defined by
the user.
The integration layer contains the mediator engine and the schema repository. The
mediator engine receives queries from the presentation layer, issues queries to the data
343
Figure 9. Graphical integration tool
sources, and passes results back to the presentation layer. The schema repository
contains the description of the global schema and the mappings from global to local
schemata. The mediator engine receives queries in terms of the global schema, global
queries, and translates them into queries in terms of the local schemata of the individual
sources, local queries, using the information available from the schema repository. We
demonstrate how local queries are obtained by the following example.
Example 2: The engine database and the airplane database are two local sources
and engine name connects the local schemata. In the airplane schema (Figure 10), engine
name is a foreign key and is represented by the property power-plant and in the engine
schema (Figure 11) it is the key and is represented by the property name. Mappings from
the global schema (Figure 12) to the local schema have the form ([global name], ([local
name], [local schema])). We say that a class or a property in the global schema, x, maps
to a local schema S when (x, (y, S)) is in the set of the mappings. For this example, this
set is:
(airplane, (airplane, S1)),
(type, (type, S1)),
(power-plant, (power-plant, S1)),
(power-plant, (name, S2)),
(engine, (engine, S2)),
(thrust, (thrust, S2)),
(name, (name, S2))
All the mappings are one-to-one, except for the power-plant property that connects
the two schemata; power-plant belongs to a set of foreign key constraints maintained
344 Cruz & Sayenko
Figure 10. Airplane schema S 1
Figure 11. Engine schema S2
Figure 12. Global schema
by the schema repository. These constraints are used to connect the set of results from
the local queries.
The global query QG returns the types of airplanes that have engines with thrust
of 115,000 lb:
select B
from {A}type{B}, {A}power-plant{C}, {C}thrust{D}
where D = 115000 lbs
The mediator engine translates Q G into QL1, which is a query over the local schema
S1, and QL2, which is a query over the local schema S2. The from clause of QG contains
three path expressions: {A}type{B}, which contains property type that maps to S1 ,
{A}power-plant{C}, which contains property power-plant that maps both to S1 and to
S2 , and {C}thrust{D}, which contains property thrust that maps to S2 .
To obtain the from clause of a local query, the mediator engine selects those path
expressions that contain classes or properties that map to the local schema. The from
clause of QL1 is: {A}type{B}, {A}power-plant{C}. Similarly, the where clause of a local
query contains only those variables of the global where clause that appear in the local
from clause. D, which is the only variable in the global where clause, does not appear
in the from clause of QL1, so the where clause of QL1 is absent.
The select clause of a local query includes variables that appear in the global select
clause and in the local from clause. B is a part of the global select clause and it appears
345
in the from clause of QL1, so it will appear in the select clause as well. In addition to the
variables from the global select clause, a local select clause contains variables that are
necessary to perform a join of the local results in order to obtain the global result. These
are the variables is the local from clause that refer to elements of the foreign key
constraint set. C refers to the value of power-plant, which is the only foreign key
constraint, so C is included in the local select clause. Therefore, QL1 is as follows:
select B, C
from {A}type{B}, {A}power-plant{C}
The from clause of QL2 should include {A}power-plant{C} and {C}thrust{D};
power-plant maps to name and thrust maps to thrust in S2. Since D in the global where
clause maps to S2, the local where clause contains D and the associated constraint: D
= 115000 lbs. The global select clause does not contain any variables that map to S2,
so the local select clause contains only the foreign key constraint variable C. The
intermediate version of QL2 is:
select C
from {C}thrust{D}, {A}power-plant{C}
The intermediate version of QL2 contains {A}power-plant{C} because power-plant
maps to name in S2. However, {A}power-plant{C} is a special case because it is a foreign
key constraint: we must check whether variables A and C refer to resources that map to
S2. A refers to airplane, therefore it does not map to S2 and {A}power-plant{C} should
be removed from QL2. The final version of QL2 is:
select C
from {C}thrust{D}
In summary, the integration layer connects the local sources into the integrated
virtual source and makes it available to the presentation layer. The interface between the
integration and presentation layers includes the global schema provided by the integration layer, the queries issued by the presentation layer, and the results returned by the
integration layer.
Presentation Layer
The presentation layer enables the user to query the distributed multimedia
sources and to create complex multicomponent coordinated layouts to display the query
results. The presentation layer sends user queries to the integration layer and receives
the data sets, which are the query results. A view is created when a data set is attached
to a presentation template that determines how the images in the data set are to be
displayed. The user specifies the position and the orientation of the view and the
dynamic interaction properties of views in the integrated layout.
Images and metadata are retrieved from the multimedia sources by means of RQL
queries to the RDF multimedia annotations stored at the local sources. In addition to RQL
346 Cruz & Sayenko
Figure 13. Translation of a keyword search to an RQL query
queries, the user may issue keyword searches. A keyword search has three components:
the keyword, the criteria, and the source. Any of the components is optional. A keyword
will match the class or property names in the schema. The criteria match the values of
properties in the metadata RDF document. The source restricts the results of the query
to that multimedia source. The data sets returned by the integration layer are encapsulated in the data descriptors that associate the query, layout, and view coordination
information with the data set. The following example illustrates how a keyword query gets
translated into an RQL query:
Example 3: The keyword search where keyword = airplane, criteria = Boeing,
and source = aircraftDataSource returns resources that are of class airplane or have
a property airplane, have the property value Boeing, and are located in the source
aircraftDataSource. This search is translated into the RQL query of Figure 13 and sent
to source aircraftDataSource by the integration layer.
DelaunayView includes predefined presentation templates that allow the user to build
customized views. The user chooses attributes of the data set that correspond to
template visual attributes. For example, a view can be defined by attaching the arctic
photo dataset (see Example 1) to the slide sorter template, setting the order-by property
of the view to the timestamp attribute of the data set, and setting the image source
property of the view to the reference attribute of the data set. When a tuple in the data
set is to be displayed, image references embedded in it are resolved and images are
retrieved from multimedia sources.
The user may further customize views by specifying their orientation, position
relative to each other, and coordination behavior. Views are coordinated by specifying
a relationship between the initiating view and the destination view. The initiating view
notifies the destination view of initiating events. An initiating event is the change of
view state caused by a user action; selecting an image in a slide sorter, for example.
The destination view responds to initiation events by changing its own state
according to the reaction model selected by the user. Each template defines a set of
initiating events and reaction models.
In summary, semantics play a central role in DelaunayView architecture. The data
layer makes semantics available as the metadata descriptions and the local schemata. The
integration layer enables the user to define the global schema that adds to the semantics
provided by the data layer. The presentation layer uses semantics provided by the data
and integration layers for source querying, view definition, and view coordination.
347
FUTURE WORK
Our future work will further address the decentralized nature of the data layer.
DelaunayView can be viewed as a single node in a network of multimedia sources. This
network can be considered from two different points of view. From a centralized
perspective, the goal is to create a single consistent global schema with which queries
can be issued to the entire network as if it were formed by a single database. From a
decentralized data acquisition point of view, the goal is to answer a query submitted at
one of the nodes. The network becomes relevant when data are required that are not
present at the local node.
In the centralized approach, knowledge of the entire global schema is required to
answer a query while in the decentralized approach only the knowledge of paths to the
required information is necessary. In the centralized approach, the global schema is
static. Local sources are connected to each other one by one, resulting in the global
schema that must be modified when a local schema is changed. Under the decentralized
approach, the integration process can be performed at the time the query is created
(automatically in an ideal system) by discovering the data available at the other nodes.
A centralized global schema must resolve inconsistencies in schema and data in a
globally optimal manner. Under the decentralized approach inconsistencies have to be
resolved only at the level of that node.
The goal of our future work will be to extend DelaunayView to a decentralized peerto-peer network. Under this architecture, the schema repository will connect to its
neighbors to provide schema information to the mediator engine. Conceptually, a request
for schema information will be recursively transmitted throughout the network to retrieve
the current state of the distributed global schema, but our implementation will adapt
optimization techniques from the peer-to-peer community to make schema retrieval
efficient. The implementation of the mediator engine and the graphical integration tool
will be modified to accommodate the new architecture.
Another goal is to incorporate MPEG-7 Feature Extraction Tools into the framework. Feature extraction can be incorporated into the implementation of the graphical
integration tool to perform automatic feature extraction on the content of the new sources
as they are added to the system. This capability will add another layer of metadata
information that will enable users to search for content by specifying low-level features.
CONCLUSIONS
We have discussed our approach to multimedia presentation and querying from a
semantic point of view, as implemented by our DelaunayView system. Our paper describes
how multimedia semantics can be used to enable access to distributed multimedia
sources and to facilitate construction of coordinated views. Semantics are derived from
the metadata descriptions of multimedia objects in the data layer. In the integration layer,
schemata that describe the metadata are integrated into a single global schema that
enables users to view a set of distributed multimedia sources as a single unified source.
In the presentation layer, the system provides a framework for creating customizable
integrated layouts that highlight semantic relationships between the multimedia objects.
The user can retrieve multimedia data sets by issuing RQL queries or keyword searches.
348 Cruz & Sayenko
The datasets thus obtained are mapped to presentation templates to create views. The
position, the orientation, and the dynamic interaction of views can be interactively
specified by the user. The view definition process involves the mapping of metadata
attributes to the graphical attributes of a template. The view coordination process
involves the association of metadata attributes from two datasets and the specification
of how the corresponding views interact. By using the metadata attributes, both the view
definition and the view coordination processes take advantage of the multimedia
semantics.
ACKNOWLEDGMENTS
This research was supported in part by the National Science Foundation under
Awards ITR-0326284 and EIA-0091489.
We are grateful to Yuan Feng Huang and to Vinay Bhat for their help in implementing
the system, and to Sofia Alexaki, Vassilis Christophides, and Gregory Karvounarakis
from the University of Crete for providing timely technical support of the RDFSuite.
REFERENCES
Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., & Tolle, K. (2000). The
RDFSuite: Managing voluminous RDF description bases. Technical report,
Institute of Computer Science, FORTH, Heraklion, Greece. Online at http://
www.ics.forth.gr/proj/isst/ RDF/RSSDB/rdfsuite.pdf
Baral, C., Gonzalez, G., & Son, T. C. (1998). Design and implementation of display
specifications for multimedia answers. In Proceedings of the 14th International
Conference on Data Engineering, (pp. 558-565). IEEE Computer Society.
Bes, F., Jourdan, M., & Khantache, F. A. (2001) Generic architecture for automated
construction of multimedia presentations. In the Eighth International Conference
on Multimedia Modeling.
Bordegoni, M., Faconti, G., Feiner, S., Maybury, M., Rist, T., Ruggieri, S., et al. (1997).
A standard reference model for intelligent multimedia presentation systems.
Computer Standards and Interfaces, 18(6-7), 477-496.
Bray, T., Paoli, J., Sperberg-McQueen, C., & Maler, E. (2000). Extensible markup
language (XML) 1.0 (second edition). W3C Recommendation 6 October 2000.
Online at http://www.w3.org/TR/2000/REC-xml-20001006
Brickley, D., & Guha, R. (2001). RDF vocabulary description language 1.0: RDF schema.
W3C Recommendation 10 February 2004. Online at http://www.w3.org/TR/2004/
REC-rdf-schema-20040210
Cruz, I. F., & Huang, Y. F. (2004). A layered architecture for the exploration of
heterogeneous information using coordinated views. In Proceedings of the IEEE
Symposium on Visual Languages and Human-Centric Computing (to appear).
Cruz, I. F., & James, K. M. (1999). User interface for distributed multimedia database
querying with mediator supported refinement. In International Database Engineering and Application Symposium (pp. 433-441).
349
Cruz, I. F., & Leveille, P. S. (2000). Implementation of a constraint-based visualization

system. In IEEE Symposium on Visual Languages (pp. 13-20).
Cruz, I. F., & Lucas, W. T. (1997). A visual approach to multimedia querying and
presentation. In Proceedings of the Fifth ACM international conference on
Fallside, D. (2001). XML schema part 0: Primer. W3C Recommendation, 2 May 2001.
Online at http://www.w3.org/TR/2001/REC-xmlschema-0-20010502
Graf, W. H. (1995). The constraint-based layout framework LayLab and its applications.
In Proceedings of ACM Workshop on Effective Abstractions in Multimedia,
Layout and Interaction, San Francisco.
Hunter, J. (2001) An overview of the MPEG-7 Description definition language (DDL).
IEEE Transactions on Circuits and Systems for Video Technology, 11(6), 765-772.
Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., & Scholl, M. (2002).
RQL: A declarative query language for RDF. In the 11th International World Wide
Web Conference (WWW2002).
Klyne, G., & Carroll, J. (2004). Resource description framework (RDF): Concepts and
abstract syntax. W3C Recommendation 10 February 2004. Online at http://
www.w3.org/TR/2004/REC-rdf-concepts-20040210
Martnez, J. M. (Ed.) (2003) MPEG-7 overview. ISO/IEC JTC1/SC29/WG11N5525.
Pattison, T., & Phillips, M. (2001) View coordination architecture for information
visualisation. In Australian Symposium on Information Visualisation, 9, 165-169.
Ram, A., Catrambone, R., Guzdial, M.J., Kehoe, C.M., McCrickard, D.S., & Stasko, J. T.
(1999). PML: Adding flexibility to multimedia presentations. IEEE Multimedia,
6(2), 40-52.
Roth, S. F., Lucas, P., Senn, J. A., Gomberg, C. C., Burks, M. B., Stroffolino, P. J., et al.
(1996) Visage: A user interface environment for exploring information. In Information Visualization, 3-12.
Salembier, P., & Smith, J. R. (2001). MPEG-7 multimedia description Schemes. IEEE
Transactions on Circuits and Systems for Video Technology, 11(6), 748-759.
Shih, T. K., & Davis, R. E. (1997). IMMPS: A multimedia presentation design system. IEEE
Multimedia, 4(2), 67-78.
Weitzman, L., & Wittenburg, K. (1994). Automatic presentation of multimedia documents
using relational grammars. In Proceedings of the Second ACM International
Conference on Multimedia (pp. 443-451). ACM Press.
350 Cruz & Sayenko
Section 5
Emergent Semantics
Emergent Semantics: An Overview 351
Chapter 15
Emergent Semantics:
An Overview
Viranga Ratnaike, Monash University, Australia

ABSTRACT
The semantic gap is recognized as one of the major problems in managing multimedia
semantics. It is the gap between sensory data and semantic models. Often the sensory
data and associated context compose situations which have not been anticipated by
system architects. Emergence is a phenomenon that can be employed to deal with such
unanticipated situations. In the past, researchers and practitioners paid little attention
to applying the concepts of emergence to multimedia information retrieval. Recently,
there have been attempts to use emergent semantics as a way of dealing with the
semantic gap. This chapter aims to provide an overview of the field as it applies to
multimedia. We begin with the concepts behind emergence, cover the requirements of
emergent systems, and survey the existing body of research.
INTRODUCTION
Managing media semantics should not necessarily involve semantic descriptions
or classifications of media objects for future use. Information needs, for a user, can be
task dependent, with the task itself evolving and not known beforehand. In such
situations, the semantics and structure will also evolve, as the user interacts with the
content, based on an abstract notion of the information required for the task. That is,
users can interpret multimedia content, in context, at the time of information need. One
way to achieve this is through a field of study known as emergent semantics.
352
Ratnaike, Srinivasan & Nepal
Emergence is the phenomenon of complex structures arising from interactions

between simple units. Properties or features appear that were not previously observed
as functional characteristics of the units. Though constraints on a system can influence
the formation of the emergent structure, they do not directly describe it. While
emergence is a new concept in multimedia, it has been used in fields such as biology,
physics and economics, as well as having a rich philosophical history. To the best of our
knowledge, commercial emergent systems do not currently exist. However, there is
research into the various technologies that would be required. This chapter aims to
outline some characteristics of emergent systems and relevant tools and techniques.
The foundation for computational emergence is found in the Constrained Generating Procedures (CGP) of John Holland (Holland, 2000). Initially there are only simple units
and mechanisms. These mechanisms interact to form complex mechanisms, which in turn
interact to form very complex mechanisms. This interaction results in self-organization
through synthesis. If we relate the concept of CGP to multimedia, the simple units are
sensory data, extracted features or even multimedia objects. Participating units can also
come from other sources such as knowledge bases. Semantic emergence occurs when
meaningful behaviour (phenotype) or complex semantic representation (genotype)
arises from the interaction of these units. This includes user interaction, the influence
of context, and relationships between media.
Context helps to deal with the problem of subjectivity, which occurs when there are
multiple interpretations of a multimedia instance. World knowledge and context help to
select one interpretation from the many. Ideally, we want to form semantic structures that
can be understood by third parties who do not have access to the multimedia instance.
This is not the same as relevance to the user. A system might want to determine what is
of interest to one user, and have that understood by another.
We note that a multimedia scene is not reality; it is merely a reference to a referent
in reality. Similarly, the output from emergence is a reference, hopefully useful to the user
of the information. We use the linguistic terms reference and referent to indicate
existence in the modeled world and the real world, respectively. There is a danger
in confusing the two (Minsky, 1988). The referenced meaning is embedded in our
experience. This is similar to attribute binding using Dublin Core metadata (Hillmann,
2003), where the standard attribute name is associated with the commonly understood
semantic.
The principal benefit of emergence is dealing with unanticipated situations. Units
in unanticipated configurations or situations will still interact with each other in simple
ways. Emergent systems, ideally, take care of themselves, without needing intervention
or anticipation on the part of the system architect (Staab 2002). However, the main
advantage of emergent semantics is also its greatest flaw. As well as dealing with
unanticipated situations, it can also produce unanticipated results. We cannot control
the outcomes. They might be useful, trivial or useless, or in the worst case
misleading. However, we can constrain the scope of output by constraining the inputs
and the ground truths. We can also ask for multiple interpretations. Sometimes, a
structure is better understood if one can appreciate the other forms it can take.
In the next section, we state the requirements of emergent semantics. This will be
followed by a description of existing research. In the last section, we identify gaps in the
research, and suggest future directions.
EMERGENT SYSTEMS
Both complete order (regularity) and complete chaos (randomness) are very simple.
Complexity occurs between the two, at a place known as the edge of chaos (Langton,
1990). Emergence results in complex systems, forming spontaneously from the interactions of many simple units. In nature, emergence is typically expressed in self-assembly,
such as (micro-level) crystal formation and (macro-level) weather systems. These
systems form naturally without centralized control. Similarly, emergence is useful in
computer systems, when centralized control is impractical. The resources needed in these
systems are primarily simple building blocks capable of interacting with each other and
their environment (Holland, 2000). However, we are not interested in all possible complex
systems that may form. We are interested in systems that might form useful semantic
structures. We need to set up environments where the emergence is likely to result in
complex semantic representation or expression (Whitesides & Grzybowski, 2003;
Crutchfield, 1993; Potgeiter & Bishop, 2002). It is therefore necessary to understand the
characteristics and issues involved in emergent information systems.
We lead our discussion through the example of an ant colony. An ant colony is
comprised, primarily, of many small units known as ants. Each ant can only do simple
tasks; for example, walk, carry, lay a pheromone trail, follow a trail, and so forth. However,
the colony is sophisticated enough to thoroughly explore and manage its environment.
Several characteristics of emergent systems are demonstrated in the ant colony
metaphor: interaction, synthesis and self-organization. The main emergent phenomenon is self-organization, expressed in specialized ants being where the colony needs
them, when appropriate. These ants and others, the ant interactions, the synthesis and
self-organization, compose the ant colony. See Bonabeau and Theraulaz (2000) for more
details.
This section describes the characteristics and practical issues of emergent systems.
They constitute our requirements. These include, but are not limited to, interaction,
synthesis, self-organization, knowledge representation, context and evaluation.
Characteristics of Emergent Systems

The requirements of emergent semantic systems (Figure 1) can be seen from three
main perspectives: information, mechanism and formation. Information is what we
ultimately want (the result of the emergence). It can also be what we need (a source of
units for interaction). Some of this information is implicit in the multimedia and context.
Other information is either implicit or explicit in stored knowledge.
It might seem that nothing particularly useful happens at the scope of two units
interacting. However, widening our field of view to take in the interaction of many units,
we should see synthesis of complex semantic units, and eventually self-organization of
the unit population into semantic structures. These semantic structures can then be used
to address the information need and augment the knowledge used by initial interaction.
Context influences the emergent structures. We need mechanisms which enable
different interactions to occur depending on the context. These mechanisms also need
to implicitly select, from the data, the salient units for each situation; either that or cause
the interaction of those units, to have a greater effect.
354
Figure 1. Emergent semantic systems
Information
If humans are to evaluate the emergence, they must either observe system behaviour
(phenotype) or a knowledge representation (genotype). The system representation must
be translatable to terms a human can understand, or to an intermediate representation that
can provide interaction. Typically, for this to be possible, the domain needs to be well
known. Unanticipated events might not be translated well. Though we deal with the
unanticipated, we must communicate in terms of the familiar. Emergence must be in terms
of the system being interpreted. Otherwise we run the risk of infinite regression
(Crutchfield, 1993). The environment, context and user should be included as part of the
system. We need semantic structures, which contain the result of emergence, to be part
of the system.
Context will either determine which of the many interpretations are appropriate or
constrain the interpretation formation. Context is taken mainly from the user or from the
application domain. Spatial and temporal positioning of features can also provide
context, depending on the domain. The significance of specialized information, such as
geographical position or time point, would be part of application domains such as fire
fighting or astronomy. It is known in film theory as the Kuleshov effect. Reordering shots
in a scene affects interpretation (Davis, Dorai, & Nack, 2003). Context supplies the
system with constraints on relationships between entities. It can also affect the
granularity and form of semantic output: classification, labelled multimedia objects,
metadata, semantic networks, natural language description, or system behaviour. Different people will want to know different things.
Mechanism
The defining characteristic of useful emergent systems is that simple units can
interact to provide complex and useful structures. Interaction1 is the notion that units2
in the system will interact to form a combined entity, which has properties that no unit
has separately. The interaction is significant. Examining the units in isolation will not
completely explain the properties of the whole. For two or more units to interact some
mechanism must exist which enables them to interact. The mere presence of two salient
units doesnt mean that they are able to interact.
Before we can reap the benefits of units interacting, we need units. These units
might be implied by the data. Explicit selection of units by a central controller would not
be part of an emergent process. Emergence involves implicit selection of the right units
to interact. The environment should make it likely for salient units to interact. Possibly
all units interact, with the salient units interacting more. In different contexts, different
units will be the salient units. The context should change which units are more likely to
interact, or the significance of their interaction.
Formation
Bridge laws, linking micro and macro properties, are emergent laws if they are not
semantically implied by initial micro conditions and micro laws (McLaughlin, 2001).
Synthesis involves a group of units composing a recognisable whole. Most
systems do analysis which involves top-down reduction and a control structure
performing analysis. Synthesis is essentially the interaction mechanism seen at another
level or from a different perspective. A benefit of emergence is that the system designer
is freed from having to anticipate everything. Synthesis involves bottom-up emergence,
which results in a complex structure. The unanticipated interaction of simple units might
carry out an unanticipated and complex task. We lessen the need for a high-level control
structure that tries to anticipate all possible future scenarios. Boids (Reynolds, 1987)
synthesizes flocking behaviour in a population of simple units. Each unit in the flock
follows simple laws, knowing only how to interact with its closest neighbours. The
knowledge of forming a flock isnt stored in any unit in the flock. Unanticipated obstacles
are avoided by the whole flock, which reforms if split up.
Self-Organization involves a population of units which appear to determine their
own collective form and processes. Self-assembly is the autonomous organization of
components into patterns or structures without human intervention (Whitaker, 2003;
Whitesides & Grzybowski, 2003). Similar though less complex, self-organization occurs
in artificial life (Waldrop, 1992). It attempts to mimic biological systems by capturing
356
an abstract model of evolution. Organisms have genes, which specify simple attributes
or behaviours. Populations of organisms interact to produce complex systems.
ISSUES
Evaluation
The Semantic Gap is industry jargon for the gap between sensory information and
the complex model in a humans mind. The same sensory information provides some of
the units which participate in computational emergence. The semantic structures, which
are formed, are the systems complex model. Since emergence is not something
controlled, we cannot make sure that the systems complex model will be the same as the
humans complex model. The ant colony is not controlled, though we consider it
successful. If the ant colony self-organized in a different way, we might consider that
structure successful as well. There may be many acceptable, emergent semantic
structures. We need to know whether the semantic emergence is appropriate, to either
the user or task. Therefore, we need to evaluate the emergence, either through direct
communication of the semantic structure or though system behaviour.
Scalability
This notion of scale is slightly different to traditional notions. We can scale with
respect to domain and richness of data. Most approaches for semantics constrain the
domain of knowledge, such that the constraints themselves provide ground truths. If a
system tries to cater for more domains, it loses some of the ground truths. Richness of
data refers to numbers of units and types of units available. If the amount of data is too
small, we might not have enough interaction to create a meaningful structure. A higher
amount of data increases the number of units and types available. Unit pairings increase
exponentially with increasing units. A system where all units try to interact with all other
units might stress the process power of the system. A system without all possible
interactions might miss the salient interactions. It is also uncertain whether increasing
data richness will lead to finer granularity of semantic structure or lesser ability to settle
on a stable structure.
Augmentation
Especially for iterative processes, it might be useful to incrementally add knowledge
back to the system. The danger here is that in order to reapply what has been learned
the system will have to recognise situations which have occurred before with different
sensory characteristics; for example, two pictures of the same situation taken from
different angles.
CURRENT RESEARCH
Having described the requirements in abstract, we will now describe the tools and
techniques which address the requirements.
Information
Knowledge representation, for emergence, includes ontology, metadata and genetic algorithm strings. The use of templates and grammars, which can communicate
semantics in terms of the media, arent emergent techniques as their semantic structure
is predefined and the multimedia content anticipated.
Metadata (data about data) can be used as an alternative semantic description of
multimedia content. MPEG-7 has a description stream, which is associated with the
multimedia stream by using temporal operators. The description resides with the data.
However, it is difficult in advance to provide metadata for every possible future
interpretation of an event. The metadata can instead be derived from emergent semantic
structures. If descriptions are needed, natural language can be derived from predicates
associated with modeled concepts (Kojima, Tamura, & Fukunaga, 2002).
Classically, ontology is the study of being. The computer industry uses the term
to refer to fact bases, repositories of properties and relations between objects, and
semantic networks (such as Princetons WordNet). Some ontology is used for reference,
with multimedia objects or direct sensory inputs being used to index the ontology
(Hoogs, 2001; Kuipers, 2000). Other ontology attempts to capture how humans communicate their own cognitive structures. The Semantic Web attempts to use ontology to
access the semantics implicit in human communication (Maedche, 2002). Semantic
networks consist of a skeleton of low-level data which can be augmented by adding
semantic annotation nodes (Nack, 2002). The low-level data consists of the multimedia
or ground truths, which can act as units in an emergent system. The annotation nodes
can contain the results of emergence, and they are not permanent. This has the
advantage of providing metadata-like properties, which can also be changed for
different contexts.
In genetic algorithms, knowledge representation (genotype) lies in evolving strings.
The strings can contain units and the operators that act on them (Gero & Ding, 1997). The
genotypes evolve over several generations, with the successful3 genes selected to
generate the next generation. Knowledge and context are acquired across generations.
Context is taken from the domain, the data instances, or the user. A users context
is mainly taken from their interaction history. Their personal history and current mental
state are harder to measure. The users role during context gathering can be active (direct
manipulation) or passive (observation).
Direct Manipulation
The user can actively communicate context to the system. Santini, Gupta, and Jain
(2001) ask their users to organize images in a database. They use the example of a portrait.
If the portrait is in a cluster of paintings, then the semantic is painting. If it is in a cluster
of people, the semantic is people or faces. The same image can be a reference to
different referents, which can be intangible ideas as well as tangible objects.
CollageMachine (Kerne, 2002) is a Web browsing tool which tries to predict user
browsing intentions. The system tries to predict possible lines of user inquiry and selects
multimedia components of those to display. Reorganization of those components, by the
user, is used by the system to adjust its model.
358
Observation
The context of a multimedia instance is taken from past and future subjects of user
attention. The entire path taken, or group formed, by a user provides an interpretation
for an individual node. The whole provides the context for the part. Emergence depends
on what the user thinks the data are, though the user does not need to know how they
draw conclusions from observing the data. The emergence of semantics can be made by
observing human and machine agent interaction (Staab, 2002). Context, at each point in
the users path, is supplied by their navigation (Grosky, Sreenath, & Fotouhi, 2002). The
users interpretation can be different from the authors intentions. The Web is considered a directed graph (nodes: Web pages, edges: links). Adjacent nodes are considered
likely to have similar semantics, though attempts are made to detect points of interest
change. The meaning of a Web page (and multimedia instances in general) emerges
through use and observation.
Grouping
In order to model what humans can observe, it is often helpful to model human
vision. In computer vision, grouping algorithms (based on human vision) are used to
form higher-level structures from units within an image (Engbers & Smeulders, 2003). An
algorithm can be an emergent technique if it can adapt dynamically to context.
Context can come from sources other than users. Multiple media can be associated
with data events to help in disambiguating semantics (Nakamura & Kanade, 1997).
Context in genetic algorithms is sensed over many generations, if one interprets that
better performance in the environment is response to context. The domain, in schemata
agreement, is partly defined by the parties involved.
Mechanism
Mechanisms of automatic, implicit unit selection and interaction are yet to be
developed for semantic emergence. This is a gap in the literature that will need to be filled.
Current mechanisms involve the user as a unit. The semantics emerge through interaction
of the users own context with multimedia components (Santini & Jain, 1999; Kerne, 2002).
The user decides which things interact, either actively or passively. In genetic algorithms, fitness functions decide how gene strings evolve (Gero & Ding, 1997). Genetic
algorithms can be used to lessen the implicit selection problem by reducing the search
spaces of how units interact, which units interact, and which things are considered units.
A similar situation arises with evaluation of emerged semantics. The current
thinking is that humans are needed to evaluate accuracy or reasonableness. In genetic
algorithms, the representation (genotype) can be evaluated indirectly by testing the
phenotype (expression).
Simply having all the necessary sensory (and other) information present will not
necessarily result in interaction occurring. Information from ontology could be used in
decision making, or in suggesting other units for interaction. Explicitly identifying units
for interaction might be a practical nonemergent step. Units can be feature patterns rather
than individual features (Fan, Gao, Luo, & Hacid, 2003). Templates can be used to search
for units suggested by the ontology. Well-known video structures can be used to locate
salient units within video sequences (Russell, 2000; Dorai & Venkatesh, 2001; Venkatesh
& Dorai, 2001). Data can provide context by affecting the perception or emotions of the
observer. Emotions can be referents. Low-level units, such as tempo and colour, in the
multimedia instance, act as symbols which reference them.
Formation
In genetic algorithms, synthesis occurs between generations. The genotype is selforganizing. With direct manipulation and user observation, synthesis and organization
come in the form of users putting things together.
In schemata agreement, region synthesis leads to self-organization. It is designed
to be adaptable to unfamiliar schemata. Agreement can be used to capture relationships
later. Emergence occurs as pairs of nodes in decentralised P2P (peer-to-peer) systems
attempt to form global semantic agreements by mapping their respective schemata
(Aberer, Cudre-Mauroux, & Hauswirth, 2003). Regions of similar property emerge as
nodes in the network are connected pair wise, and as other nodes link to them (Langley,
2001). Unfortunately, this work does not deal with multimedia. There has been recent
interest in combining the areas of multimedia, data mining and knowledge discovery.
However, the semantics here are not emergent. There is also data mining research into
multimedia using Self-Organizing Maps (SOM), but this is not concerned with semantics
(Petrushin, Kao, & Khan, 2003; Simoff & Zaiane, 2000).
CONCLUSION
A major difficulty in researching the concept of emergent semantics in multimedia
is that there are no complete systems integrating the various techniques. While there is
work in knowledge representation, with respect to both semantics and multimedia, to the
best of our knowledge, theres very little in interaction, synthesis and selforganization. There is the work on schemata agreement (nonmultimedia) and some work
on Self-Organizing Maps (nonsemantic), but nothing combining them. The little that has
been done involves users to provide context and genetic algorithms to reduce problem
spaces.
One of the gaps to be filled is developing interaction mechanisms, which enable
possibly unanticipated data to interact with each other and their environment. Even if
we can trust the process, we are still dependent on its inputs the simple units that
interact. The set of units needs to be sufficiently rich to enable acceptable emergence.
Ideally, salient features (even patterns) should naturally select themselves during
emergence, though this may require participation of all units, placing a high computational load on the system. Part of the problem, for emergence techniques, is that the simple
interactions must occur in parallel, and in numbers great enough to realise selforganization. The future will probably have more miniaturized systems, capable of true
parallelism in quantum computers. A cubic millimetre of the brain holds the equivalent
of 4 km of axonal wiring (Koch, 2001). Perhaps greater parallelism will permit interaction
of all available units.
There is motivation for research into nonverbal computing, where the users are
illiterate (Jain, 2003). Without user ability to issue and access abstract concepts, the
concepts must be inferred. Experiential computing (Jain, 2003; Sridharan, Sundaram, &
Rikasis, 2003) allows users to interact with the system environment, without having to
360
build a mental model of the environment. They seek a symbiosis formed from human and
machine, taking advantage of their respective strengths. These systems are insight
facilitators. They help us make sense of our own context by engaging our senses directly,
as opposed to being confronted by an abstract description. Experiential computing, while
in its infancy now, might in the future enable implicit relevance feedback. The users
interactions with the system could cause both emergence and verification of semantics.
REFERENCES
Aberer, K., Cudre-Mauroux, P., & Hauswirth, M. (2003). The chatty Web: Emergent
semantics through gossiping. Paper presented at the WWW2003, Budapest, Hungary.
Bonabeau, E., & Theraulaz, G. (2000). Swarm smarts. Scientific American, 282(3), 54-61.
Crutchfield, J. P. (1993). The calculi of emergence. Paper presented at the Complex
Systems - from Complex Dynamics to Artificial Reality, Numazu, Japan.
Davis, M., Dorai, C., & Nack, F. (2003). Understanding media semantics. Berkeley, CA:
ACM Multimedia 2003 Tutorial.
Dorai, C., & Venkatesh, S. (2001, September 10-12). Bridging the semantic gap in content
management systems: Computational media aesthetics. Paper presented at the
COSIGN 2001: Computational Semiotics (pp. 94-99), CWI Amsterdam.
Engbers, E. A., & Smeulders, A. W. M. (2003). Design considerations for generic
grouping in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 445-457.
Fan, J., Gao, Y., Luo, H., & Hacid, M.-S. (2003). A novel framework for semantic image
classification and benchmark. Paper presented at the ACM SIGKDD, Washington,
DC.
Gero, J. S., & Ding, L. (1997). Learning emergent style using and evolutionary approach.
In B.Varma & X. Yao (Eds.), paper presented at the ICCIMA (pp. 171-175), Gold
Coast, Australia.
Grosky, W. I., Sreenath, D. V., & Fotouhi, F. (2002). Emergent semantics and the
multimedia semantic Web. SIGMOD Record, 31(4), 54-58.
Hillmann, D. (2003). Using Dublin core. Retrieved February 16, 2004, from http://
dublincore.org/documents/usageguide/
Holland, J. H. (2000). Emergence: From chaos to order (1st ed.). Oxford: Oxford
University Press.
Hoogs. (2001, 10-12 October). Multi-modal fusion for video understanding. Paper
presented at the 30th Applied Imagery Pattern Recognition Workshop (pp. 103108), Washington, DC.
Jain, R. (2003). Folk computing. Communications of the ACM, 46(3), 27-29.
Kerne, A. (2002). Concept-context-design: A creative model for the development of
interactivity. Paper presented at the Creativity and Cognition, Vol. 4 (pp. 92-122),
Loughborough, UK.
Koch, C. (2001). Computing in single neurons. In R. A. Wilson & F. C. Keil (Eds.), The
MIT Encyclopedia of the Cognitive Sciences (pp. 174-176). Cambridge, MA: MIT
Press.
Kojima, A., Tamura, T., & Fukunaga, K. (2002). Natural language description of human
activities from video images based on concept hierarchy of actions. International
Journal of Computer Vision, 150(2), 171-184.
Kuipers, B. J. (2000). The spatial semantic hierarchy. Artificial Intelligence, 119, 191233.
Langley, A. (2001). Freenet. In A. Oram (Ed.), Peer-to-peer: Harnessing the benefits of
a disruptive technology (pp. 123-132). Sebastopol, CA: OReilly.
Langton, C. (1990). Computation at the edge of chaos: Phase transitions and emergent
computation. Physica D, 42(1-3), 12-37.
Maedche. (2002). Emergent semantics for ontologies. IEEE Intelligent Systems, 17(1),
85-86.
McLaughlin, B. P. (2001). Emergentism. In R. A. Wilson & F. C. Keil (Eds.), The MIT
Encyclopedia of the Cognitive Sciences (pp. 267-269). Cambridge, MA: MIT Press.
Minsky, M. L. (1988). The society of mind (1st ed.). New York: Touchstone (Simon &
Schuster).
Nack, F. (2002). The future of media computing. In S. Venkatesh & C. Dorai (Eds.), Media
computing (159-196). Boston: Kluwer.
Nakamura, Y., & Kanade, T. (1997, November). Spotting by association in news video.
Paper presented at the Fifth ACM International Multimedia Conference (pp. 393401) Seattle, Washington.
OED. (2003). Oxford English Dictionary. Retrieved February 2004, from
dictionary.oed.com/entrance.dtl
Petrushin, V. A., Kao, A., & Khan, L. (2003). The Fourth International Workshop on
Multimedia Data Mining, MDM/KDD 2003. Vol. 6(1). (pp. 106-108).
Potgeiter, A., & Bishop, J. (2002). Complex adaptive systems, emergence and engineering: The basics. Retrieved February 20, 2004, from http://people.cs.uct.ac.za/
~yng/Emergence.pdf
Reynolds, C. (1987). Flocks, herds, and schools: A distributed behavioral model.
Computer Graphics, 21(4), 25-34.
Russell, D. (2000). A design pattern-based video summarization technique. Paper
presented at the Proceedings of the 33rd Hawaii International Conference on
System Sciences (p. 3048).
databases. IEEE Transactions on Knowledge and Data Engineering, 13(3), 337351.
Santini, S., & Jain, R. (1999, Jan). Interfaces for emergent semantics in multimedia
databases. Paper presented at the SPIE, San Jose, California.
Simoff, S. J., & Zaiane, O. R. (2000). Report on MDM/KDD2000: The First International
Workshop on Multimedia Data Mining. SIGKDD Explorations, 2(2), 103-105.
Sridharan, H., Sundaram, H., & Rikasis, T. (2003, November 7). Computational models for
experiences in the arts, and multimedia. Paper presented at the ACM Multimedia
2003, First ACM Workshop on Experiential Teleprescence, Berkeley, CA, USA.
Staab, S. (2002). Emergent semantics. IEEE Intelligent Systems, 17(1), 78-79.
Venkatesh, S., & Dorai, C. (2001). Computational media aesthetics: Finding meaning
beautiful. IEEE Multimedia, 10-12.
Waldrop, M. M. (1992). Life at the edge of chaos. In Complexity, (pp. 198-240). New York:
Touchstone (Simon & Schuster).
362
Whitaker, R. (2003). Self-organization, autopoiesis and enterprises. Retrieved February

3, 2005, from http://www.acm.org/sigois/auto/Main.html
Whitesides, G. M., & Grzybowski, B. (2003). Self-assembly at all scales. Science, 295,
2418-2421.
ENDNOTES
1
2
3
The action or influence of persons or things on each other (OED, 2003).

The user can also be considered a unit.
According to a fitness function, which measures the phenotype (the strings
expression or behaviour).
Emergent Semantics from Media Blending
363
Chapter 16
Emergent Semantics from

Media Blending
Edward Altman, Institute for Infocomm Research, Singapore
Lonce Wyse, Institute for Infocomm Research, Singapore
ABSTRACT
The computation of emergent semantics for blending media into creative compositions
is based on the idea that meaning is endowed upon the media in the context of other
media and through interaction with the user. The interactive composition of digital
content in modern production environments remains a challenging problem since
much critical semantic information resides implicitly within the media, the relationships
between media models, and the aesthetic goals of the creative artist. The composition
of heterogeneous media types depends upon the formulation of integrative structures
for the discovery and management of semantics. This semantics emerges through the
application of generic blending operators and a domain ontology of pre-existing
media assets and synthesis models. In this chapter, we will show the generation of
emergent semantics from blending networks in the domains of audio generation from
synthesis models, automated home video editing, and information mining from
multimedia presentations.
INTRODUCTION
Today, there exists a plethora of pre-existing digital media content, synthesis
models, and authored productions that are available for the creation of new media
productions for games, presentations, reports, illustrated manuals, and instructional
364 Altman & Wyse
materials for distance education. Technologies from sophisticated authoring environments for nonlinear video editing, audio synthesis, and information management systems are increasingly finding their way into a new class of easy to use, partially
automated, authoring tools. This trend in media production is expanding the life cycle
of digital media from content-centric authoring, storage, and distribution to include usercentric semantics for performing stylized compositions, information mining, and the
reuse of the content in ways not envisioned at the time of the original media creation. The
automation of digital media production at a semantic level remains a challenging problem
since much critical information resides implicitly within the media, the relationships
between media, and the aesthetic goals of the creative artist. A key problem in modern
production environments is therefore the discovery and management of media semantics
that emerges from the structured blending of pre-existing media assets. This chapter
introduces a model-based framework for media blending that supports the creative
composition of media elements from pre-existing resources.
The vast quantity of pre-existing media from CDs, the Internet, and local recordings
that are currently available has motivated recent research into automation technologies
for digital media (Davis, 1995; Funkhouser et al., 2004; Kovar & Gleicher, 2003).
Traditional authoring tools require extensive training before the user becomes proficient
and normally consume enormous time to compose relatively simple productions even by
skilled professionals. This contrasts with the needs of the non-professional media author
who would prefer high level insights into how media elements can be transformed to
create the target production, as well as tools to automate the composition from semantically meaningful models. Such creative insights arise from the ability to flexibly
manipulate information and discover new relationships relative to a given task. However,
current methods of information retrieval and content production do not adequately
support exploration and discovery in mixed media (Santini, Gupta, & Jain, 2001). A key
problem for media production environments is that the task semantics for content
repurposing depends upon both the media types and the context of the current task. In
this chapter we claim that many semantics based operations, including summarization,
retrieval, composition, and synchronization can be represented as a more general
operation called, media blending. Blending is an operation that occurs across two or
more media elements to yield a new structure called, the blend. The blend is formed by
inheriting partial semantics from the input media and generating an emergent structure
containing information from the current task and the source media. Thus the semantics
of the blend emerges from interactions among the media descriptions, the task to be
performed, and the creative input of the user.
Automated support for managing the semantics of media content would be beneficial for diverse applications, such as video editing (Davis, 1995; Kellock & Altman, 2000),
sound synthesis (Rolland & Pachet, 1995), and mining information from presentations
(Dorai, Kermani, & Stewart, 2001). A common characteristic among these domains that
will be emphasized in this chapter is the need to manage multiple media sources at the
semantic level. For sound production, there is a rich set of semantics associated with
sound effects collections and audio synthesis models that typically come with semantically labeled control parameters. In the case of automatic home video editing, the
control logic is informed by the relationships between music structure and video cuts as
described in film theory to yield a production with a particular composition style (Sharff,
365
1982). In the case of presentation mining from e-learning content, there is an association
between pedagogical structures within a lecture video and other content resources, such
as textbooks and slide presentations, that can be used to inform a search engine when
responding to a students query. In each case, the user-centric production of media
involves the dynamic blending of information from different media. In this chapter, we
will show the construction of blending networks for user-centric media processing in the
domains of audio generation from sound synthesis models, automated home video
editing, and presentation mining.
BACKGROUND
Models are fundamental for the construction of blending networks (Veale &
ODonoghue, 2000). Blending networks have their origins in frame based reasoning
systems and have recently been applied in cognitive linguistics to link discourse analysis
with fundamental structures of cognition. According to Conceptual Integration Theory
from cognitive linguistics, thought and language depend upon our capabilities to
manipulate webs of mappings between mental spaces (Fauconnier, 1997). These mental
space mappings form the basis for the understanding of metaphors and other forms of
discourse as conceptual blends. Similarly, the experiential qualities of media constitute
a form of discourse that can only be understood through the creation of deep models of
media (Staab, Maedche, Nack, Santini, & Steels, 2002). Prior work on conceptual blending
provides a theoretical framework for the extension of blending theory to digital media.
Consequently, audio perception, video appreciation, and information mining may be
viewed as a form of media discourse. Accordingly, the claim in this chapter is that
principles of conceptual blending derived from analysis of language usage may also be
applied to the processing of media. In the remainder of this section we will describe three
scenarios for media blending, then review the literature on conceptual blending and
metaphor. The following sections will relate these structures to concrete examples of
media blending.
Audio Models
Intricate relations between audio perception and cognition associated with sound
production techniques pose interesting challenges regarding the semantics that emerge
from combinations of constituent elements. The semantics of sound tend to be more
flexible than the semantics associated with graphics and have a more tenuous relationship to the world of objects and events than do graphics. For example, the sound of a
crunching watermelon can be indicative of a cool refreshing indulgence on a summer day,
or it can add juicy impact to a punch for which the sound is infamously used in film
production. The art of sound effects production depends heavily on the combination,
reuse, and recontextualization of libraries of prerecorded material. Labeling sounds in a
database in a way that supports reuse in flexible semantic contexts is a challenge.
A current trend in audio is the move toward structured representations (Rolland &
Pachet, 1995) that we will call models. A sound model is a parameterized algorithm for
generating a class of sounds as shown schematically in Figure 1. Models are useful in
media production partly because of their low memory/bandwidth requirements, since it
366 Altman & Wyse
Figure 1. A sound model includes a synthesis algorithm capable of generating a

specific range of sounds, and parameters that determine the behaviour of the model
Control Parameters
Map Controls to Synth Params
Model
Synth Parameters
Synthesizer algorithm
Audio Signal
takes much less memory to parameterize a synthesis model than it does to code the raw
audio data. Also, models meet the requirement from interactive media, such as games, that
audio be generated in real time in response to unpredictable events in an interactive
environment.
The sound model design process involves building an algorithm from component
signal generators and transformers (modulators, filters, etc.) that are patched together
in a signal flow network that generates audio at the output. Models are designed to meet
specifications on a) the class of sounds the model needs to cover and b) the method of
controlling the model through parameters that are exposed to the user. Models are
associated with semantics in a way that general-purpose synthesizers are not, because
they are specialized to create a much narrower range of sounds. They also take on
semantics by virtue of the real-time interactivity they have responsive behaviors
in a way that recorded sounds do not.
As media objects, models present interesting opportunities and challenges for
effective exploitation in graphical, audio, or mixed media. A database of sound models
is different from a database of recorded sounds in that the accessible sounds in the
database are (i) not actually present, but potential and (ii) infinite in variability due to the
dynamic parameterization that recorded sounds do not afford. Model building is a labor
intensive job for experts, so exploiting a database of pre-existing sound models potentially has tremendous value.
Another trend in audio, as well as other forms of digital media, is to attempt to
automatically extract semantics from raw media data. The utility of being able to identify
a baby crying or a window breaking in an audio stream should be self-apparent, as
should the difficulty of the task. Typically, audio analysis is based on adaptive
association between low-level signal features (such as spectral centroid, basis vectors,
zero crossings, pitch, and noise measures) and labels provided by a supervisor, or
based on an association with data from another media stream such as video. The difficulty
367
lies in the fact that there is no such thing as the semantics, and any semantics there
may be, are dependent upon contexts both in and outside of the media itself. The humanin-the-loop and the intermediate representations between physical attributes and deep
semantics that models offer can be effective bridges across this gap.
Video Editing
The blending of two or more media that combines perceptual aspects from each
media to create a new effect is a common technique in film production. In film editing, the
visual presentation of the scene tells the story from the characters point of view. Music
is added to convey information about the characters emotional state, such as fear,
excitement, calm, and joy. Thus, when the selection of video cut points along key visual
events are synchronized with associated features in the music, the audience experiences
the blended media according to the emergent semantics of the cinematic edit (Sharff,
1982).
The non-professional media author of a home video may know what style of editing
they prefer, but lack the detailed knowledge, or time, to perform the editing operations.
Similarly, they may know what music selections to add to the edited video, but lack the
tools and insight to match the beat, tempo, and other features from the music with suitable
events in the video content. The challenge for semi-automated video editing tools is to
combine the stylistic editing logic with metadata descriptions of the selected music and
video, then opportunistically blend the source media to create the final product (Kellock
& Altman, 2000; Davis, 1995).
Presentation Mining
The utilization of media semantics is important not only for audio synthesis and
video editing, but also for information intensive tasks, such as composing and subsequently mining multimedia presentations. There are a rapidly growing number of
corporate media archives, multimedia presentations, and modularized distance learning
courseware which contain valuable information that remains inaccessible outside the
original production context. For instance, a common technique for authoring modular
courseware is to produce a series of short, self contained multimedia presentations for
topics in the syllabus, then customize the composition of these elements for the target
audience (Thompson Learning, n.d.; WebCT, n.d.). The control logic for the sequencing
and navigation through the course content is specified through the use of description
languages. However this normally does not include a semantic description of pedagogical events, domain models, or dependencies among media resources that would aid in the
exploration of the media by the user. Once the course is constructed, it becomes very
difficult to modify or adapt the content to new contexts.
Recorded corporate presentations and distance learning lectures are notoriously
difficult to search for information or reuse in a different context. This difficulty arises from
the fact that the semantics of the presentation is fixed at the time of production. The media
blending framework is designed to support the discovery and generation of emergent
semantics through the use of ontologies for modeling domain information, composition
logic, and media descriptions.
368 Altman & Wyse
MEDIA BLENDING FRAMEWORK

The key issue of this chapter is to empower the media producer to more easily create
complex media assets by leveraging control over emergent semantics derived from media
blends. Current media production techniques involve the human interaction with
collections of media libraries and the use of specialized processing tools, but they do not
yet provide support for utilizing semantic information in creative compositions. The
development of standards, such as mpeg-7, AAF, and SMIL, facilitate the description,
shared editing, and structured presentation of media elements (Nack & Hardman, 2002).
The combination of description languages and rendering engines for sound (C-Sound,
MAX), and video (DirectShow, QuickTime) provide powerful tools for composing and
rendering media after it is produced. Recent efforts toward automated media production
(Davis, 1995; Kellock & Altman, 2000) begin to demonstrate the power of model-based
tools for authoring creative compositions. These approaches depend upon proprietary
methods and tend to work only in specialized contexts. The objective of media blending
is to unify these disparate approaches under a common framework that results in more
efficient methods for managing media semantics.
In this section we will motivate the need for a media blending framework by citing
current limitations in audio design methods, then provide an illustrative example of an
audio blending network. This section concludes with a concise formulation of media
blending.
Sound Semantics in Audio Production

Production houses typically have hundreds of CDs of sound effect material stored
in databases and access the audio by searching an index of semantic labels. A fragment
of a sound effects database shown in Table 1 illustrates that sounds typically acquire
their semantics from the context in which they were recorded.
One thing that makes it difficult to repurpose sounds from a database labeled this
way is that sounds within a category can sound very different, and sounds in different
categories can sound very similar. That is, the sounds in production libraries are
generally not classified by clusters of acoustic features. Instead, there are several classes
of semantic descriptors typically used for sound:
Sources as descriptors. Dog barking, tires screeching, gun shot. The benefit of
using sources as semantic descriptors is that the descriptions come from lay
language that everybody speaks. Sources are very succinct descriptions and come
with a rich set of relationships to other objects that we know about. A drawback
to sources as descriptors is that some sounds have no possible, or at least obvious,
physical cause (e.g., the sound of an engine changing in size). Even if a physical
source is responsible for a sound, it may be impossible to identify. Similarly, any
given sound may have many unrelated possible sources. Finally, a given source
can have acoustically unrelated sounds associated with it, for example, a train
generates whistles, steam, rolling, and horn sounds.
Actions and events as descriptors. Dog barking, tires screeching, gun shot.
Russolos early musical noise machines, or intonurumori, had onomatopoetic
names allied with actions, including howler, roarer, crackler, rubber, hummer,
gurgler, hisser, whistler, burster, croaker, and rustler (Russolo, 1916). The benefit
369
Table 1. Semantic descriptions in a database of common sound effects
of actions and events as descriptors is that they can often be assigned even when
source identification is impossible (a screech is descriptive whether the sound is
from tires or a child). Actions and events are also familiar to a layperson for
describing sounds (scraping, falling, pounding, screaming, sliding, rolling, coughing, clicking). A drawback is that in some cases it may be difficult or impossible for
sounds to be described this way. Unrelated sounds can also have the same
description in terms of actions and events. Finally, the description can be quite
subjective one persons gust is anothers blow.
Source attributes as sound descriptors. Big dog, metal floor, hollow wood. Such
descriptions are often easier to obtain than source identification and are still useful
even when source identification is impossible. These attributes are often scalar,
which makes them quantitative and easier to deal with for a computer. The
drawbacks are that it may be difficult to assign attributes for some sounds, many
sounds may have the same attributes, and the assignment can be quite subjective.
Sounds may also belong together simply because they frequently co-occur in the
environment or in man-made media. A beach sounds class could include crashing
waves, shouting people, and dogs barking. Loose categories such as indoor and
outdoor are often useful especially in media production. A recording of a dog
barking indoors would be useless for an outdoor scene.
When producers have the luxury of a high budget and are creating their own sound
effects, sounds are typically constructed from a combination of recorded material,
synthetic material, and manipulation. Typical manipulation techniques include time
reversal, digital effects, such as filtering, delay, pitch shifting, and overlaying many
different tracks to create an audio composite. To achieve the desired psychological
impact for an event with sound, it is often the case that recordings of actual sounds
generated by the real events are useless. For example, the sounds of a real punch or a
real gun firing are entirely inadequate for creating the impression of a punch or a gun shot
in cinema. The sounds that one might want to use as starting material to construct the
effects come from unrelated material with possibly unrelated semantic labels in the stored
database (Mott, 1990).
370 Altman & Wyse
Semantic labels tend to commit a sound in a database to a certain usage unless the
database users know how to work around the labels to suit their new media context. On
the other hand, low-level physical signal attributes are not very helpful at providing
human-usable knowledge about a sound, either. In the 1950s, Pierre Schaeffer made a
valiant attempt at coming up with a set of generic source-independent sound descriptors.
Rough English translations of the descriptors include mass, dynamics, timbre, melodic
profile, mass profile, grain, and pace. He hoped that any sound could be described by
a set of values for these descriptors. He was never satisfied with the results of his
taxonomical attempts.
More recently, Dennis Smalley has developed his theory of Spectromorphology
(Smalley, 1997) with terms that are more directly related to aural perception: onsets
(departure, emergence, anacrusis, attack upbeat, downbeat), continuants (passage,
transition, prolongation, maintenance, statement), terminations (arrival, disappearance,
closure, release, resolution), motions (push/drag, flow, rise, throw/fling, drift, float, fly),
and growth (unidirectional, such as ascent, planar, descent and reciprocal, such as
parabola, oscillation, undulation). This terminology has had some success in the
analysis of electroacoustic works of music, which are notoriously difficult due to the
unlimited sonic domain from which they draw their material and because of the lack of
definitive reference to extra-sonic semantics. In fact, the clearest limitation of
spectromorphology is its inability to address the referential dimension of much contemporary music.
The Blending Network

Sound models are endowed with semantics in their parameterization since the
models are built for a specific media context to cover a certain range of sounds and
manipulate perceptually meaningful features under parametric control. Thus, models are
given names such as footsteps, and parameters are given names such as walkers
weight, limp, and perhaps a variety of characteristics concerning the walking surface.
If one knew the parameter values used for a particular footstep model to create a certain
sound, then one could interpret the semantics from the parameter names and their values
as, for example, a heavy person walking quickly over a wet gravel surface. A single
model could be associated with a wide range of different semantics depending upon the
assumed values of the parameters.
In the context of audio production, models have one clear advantage over a library
of recorded sounds in that they permit their semantics to be manipulated. Rarely does
a sound producer turn to a library of sounds and use the sound unadulterated in a new
media production. Recordings lend themselves to standard audio manipulation techniques they can be layered, time-altered, put through standard effects processors
such as compressors and limiters, phateners, harmonizers, etc. However, models, by
design, give a sound designer handles on the sound semantics, at least for the semantics
they were designed to capture. With only a recording, and no generative model, it would
be difficult, for example, to change a walk into a run because in addition to the interval
between the foot impacts there are a myriad of other differences in the sound due to heeltoe timing and the interaction between the foot and the surface characteristics. In a good
footsteps model, the changes would all be a coordinated function of parameters
controlling speed and style.
371
The footsteps model is used in the following example to illustrate the central
principles of media blending networks. Consider an audio designer who has been given
the task of creating the sound of two people passing on a stairway. In the model library
there are separate models for a person going up the stars and for going down the stars,
but there is no model for two people passing. The key moment for the audio designer is
the event when two people meet on the stairway so that both complete the step at the
same time. This synchronization is not a part of either input model, but it has a semantic
meaning that is crucial for the overall event.
The illustrations of blending networks use diagrams to represent the models and
relationships. In these diagrams, models are represented by circles; parameters by points
in the circles; and connections between parameters by lines. Each model may be realized
as a complex software object that can be modified at the time the blending network is
constructed. Thus the sound designer would use a high level description language to
specify the configuration of the models and their connections. The configuration
description is then compiled into the blending network which could then be run to
produce the desired sound.
The Footstep network contains two input models corresponding to the audio model
for walking up the stairs and the model for walking down the stairs. Each model in Figure
2 is distinct, however they have semantically similar parameters. The starting time for
climbing the stairs is t 1, the starting time for descending the stairs is t 2, the person going
up is p1, and the person going down is p2.
The two audio models have parameters that are semantically labeled. The crossmodel mapping that connects corresponding parameters in the input models is illustrated
by dashed lines in Figure 3. In addition to the starting times, t i, and the persons, pi, that
are specified explicitly in the input models, connections are established between other
similar pairs of parameters, such as walking speed, si, and location, li.
The two input models inherit information from an abstract model for walking that
includes percussion sounds, walking styles, and material surfaces. This forms a generic
model that expresses the common features associated with the two inputs. The common
features may be simple parameters, such as start time, person, speed, and location as in
Figure 4. More generally, the generic model may be used to specify the components and
relationships in more complex models as a domain ontology, as we shall see later.
The blending framework in Figure 5 contains a fourth model which is typically called
the blend. The two stair components in the input models are mapped onto a single set
of stairs in the blend. The local times, t1 and t 2, are mapped onto a common time t in the
blend. However, the two people and their locations are mapped according to the local time
of the blend. Therefore, the first input model represents the audio produced while going
up the stairs, whereas the second model represents the audio produced while going
down. The projection from these input models onto the blend preserves time and location.
The Footstep network exhibits in the blend model various emergent structures that
are not present in the inputs. This emergent structure is derived from several mechanisms
available through the dynamic construction of the network. For example, the composition
of elements from the inputs causes relations to become available in the blend that do not
exist in either of the inputs. According to this particular construction, the blend contains
two moving individuals instead of the single individual in each of the inputs. The
individuals are moving in opposite directions, starting from opposite ends of the stairs,
372 Altman & Wyse
Figure 2. Input models for the Footstep blending network
p1
Input 1
p2
t1
Input 2
t2
Figure 3. Cross-model mapping between the input footstep models
Input 1
p2
p1
l1
t1 s
1
Input 2
t2 ls2
2
Figure 4. Inclusion of the generic model for the Footstep input models
Generic Space
Input 1
p1
t1
p2
Input 2
t2
and their positions and relative temporal patterns can be compared at any time that they
are on the stairs.
At this point the construction of the blending network is complete and constitutes
a meta-model for the two people walking in opposite directions on the same stairs. Since
this is a generative model, we can now run the scenario dynamically. In the blend there
is new structure: There is no encounter in either of the input models, but the blend
contains the synchronized stepping of the two individuals. The input models continue
to exist in their original form, therefore information about time, location, and walking
speed in the blend space can be projected back to the input models for evaluation there.
This final configuration with projection of the blend model back to the input models is
illustrated in Figure 5.
Blending Theory
Blending is an operation that occurs across two or more input spaces to yield a new
space, the blend. The blend is formed by inheriting partial structure from the input spaces
373
Figure 5. Space mapping for the blending model in the integrated footstep network
Generic Space
Input 1
p1
p2
t1
t2
p1'
Input 2
p'2
t'
Blend Space
and generating an emergent structure containing information not explicitly present in

either input. A blend involves the use of information from two sources such that there
are two sets of bindings that map between the input spaces and from the input spaces
to the blend space. Computationally, the blending network constitutes a double scope
binding between the pair of inputs and the blend space (Fauconnier, 1997). The double
scope binding configuration for creating a blend is illustrated in Figure 5 and is composed
of the following elements:
Input Spaces: a pair of inputs, I1 and I2, to the network along with the models for
processing the inputs. In the Footstep network, the inputs were the audio models
for generating the footstep sounds.
Cross Space Mapping: direct links between corresponding elements in the input
spaces I1 and I2 or a mapping that relates the elements in one input to the
corresponding elements in the other.
Generic Space: defines the common structure and organization shared by the
inputs and specifies the core cross-space mapping between inputs. The domain
ontology for the input models is included in the generic space.
Blend Space: information from the inputs I1 and I2 is partially projected onto a fourth
space containing selective relations from the inputs. Additionally, the blend model
inherits structure from the ontologies used in the generic model, as well as specific
functions derived from the context of the current user task. The two footstep
models were integrated into a single blend model with projection of parameter
values.
Emergent Structure: The blend contains new information not explicitly present in
the inputs that becomes available as a result of processing the blending network.
In the Footstep network, the synchronization of the footsteps at the meeting point
in the blend model was the key emergent structure.
The key contribution of Conceptual Integration Theory has been the elaboration
of a mechanism for double scope binding for the explanation of metaphor and the
374 Altman & Wyse
processing of natural language discourse. The basic diagram for the double scope
binding from the two input models onto the blend model was previously illustrated in
Figure 5. This network is formed using the generic model to perform cross-space mapping
between the two input models, then projecting selected parameters of the input models
onto the blend model. Once the complete network has been composed, the parameter
values are bound and information is continuously propagated while dynamically running
the network. We will next discuss the computational theory for blending, then examine
the double scope binding configuration applied to audio synthesis, video editing, and
presentation mining.
Computational Theory
The blending framework for discovering emergent semantics in media consists of
three main components: ontologies that provide a shared description of the domain;
operators that apply transformations to the inputs and perform computations on the
input models; and an integration mechanism that helps the user discover emergent
structure in the media.
Ontologies
Ontologies are a key enabling technology for semantic media. An ontology may be
defined as a formal and consensual specification of a conceptualization that provides a
shared understanding of a domain. Moreover, the ontology provides an understanding
that can be communicated across people and application systems. Ontologies may be of
several types ranging from the conceptual specification of a domain to an encoding of
computer programs and their relationships. In addition to providing a structured representation, ontologies offer the promise of a shared and common understanding of a domain
that can be communicated between people and application systems. Thus, the use of
ontologies brings together two essential elements for discovering semantics in media:
Ontologies define a formal semantics for information that can be processed by a

computer.
Ontologies define real-world semantics that facilitates the linking of machine

processable content with user-centric meaning.
Ontologies are used in the media blending framework to model relationships among
media elements, as well as provide a domain model for user-centric operators. This
provides a level of abstraction that is critical for model-based approaches to media
synthesis. As we have seen in the sound synthesis models and automated video editing
examples; the input media may come from disparate sources, be described by differing
metadata schema, and pose unique processing constraints. Moreover, the intended
usage of the media is likely to be different from the initial composition and annotation.
Thus the ontologies provide a critical link between the end user and the computer which
is necessary for emergent semantics.
Operators
The linkages among the two input spaces and the media blend in Figure 5 are
supported by a set of core operators. Two of these operators are called projection and
375
compression. Projection is the process in which information in one space is mapped onto
corresponding information in another space. In the video editing domain, projection
occurs through the mapping of temporal structures from music onto the duration and
sequencing of video clips. The hierarchical structure of music and the editing instructions for the video can both be modeled as a graph. Since each model is represented by
a graph structure, projection amounts to a form of graph mapping. In general, the mapping
between models is not direct, so the ontology from the generic space is used to construct
the transformation that maps information between input spaces.
Compression is the process in which detail from the input spaces is removed in the
blend space in order to provide a condensed description that can be easily manipulated.
Compression is achieved in an audio morph through the low dimensional control
parameters for the transformation between to two input sounds. The system performs
these operations in the blended space and projects the results back to any of the available
input spaces.
Integration Mechanism
The consequence of defining the operators for projection and compression is that
a new integration process called, running the blend becomes possible within this
framework (Fauconnier, 1997). Running the blend is the process of reversing the direction
of causality, thereby using the blend to drive the production of inferences in either of
the input spaces. In the blend space of the video editing example, the duration of an edited
video clip is related to the loudness of the music and the start and stop times are determined
by salient beats in the music. The process of running the blend causes these constraints
to propagate back to the music input model to determine the loudness value and the timing
of salient beats that satisfy the editing logic of the music video blend. Thus the process
of running the blend means that operations applied in the blend model are projected back
to the inputs to derive emergent semantics, such as music driven video editing.
In the case of mining presentations for information, preprocessing by the system
analyzes the textbook to extract terms and relations, which are then added as concept
instances of a domain ontology within the textbook model. The user seeks to query the
courseware for information that combines the temporal sequencing of the video lecture
models with the structured organization of the textbook model. This integrated view of
the media is constructed by invoking a blending model, such as find path or find
similar to translate the user query into primitives that are suitable for the input models.
Once the lecture presentation network has been constructed, the user can run the blend
to query the input models for a path through the lecture video that links any two given
topics from the textbook. The integration mechanism of this blend provides a parameterized model of a path that can be used to navigate through the media, mine for relationships, or compose answers to the original query. The emergent semantics of this media
blending model is a path through the video content that exhibits relationships derived
from the textbook. Due to the double scope binding of this network, the blending model
can also be used to project information from the video onto the textbook, thereby
imposing the temporal sequencing of the video presentation onto the hierarchical
organization of the textbook. Additional blending networks, such as the find similar
blend, can be used to integrate information from the two input sources to discover
similarities.
376 Altman & Wyse
Thus, the integration mechanism for multimedia presentations produces a path in

the blend model. The system can then run the blend on the path using all of the emergent
properties of the path, such as contracting, expanding, branching, and measuring
distance. The specialized domain operators can now be applied and their consequences
projected back onto the input spaces. The blend model for presentations encodes
strategies that the user may use to locate information, discover the relationships between
two topics, or recover from a comprehension failure while viewing the video presentation.
EMERGENT MEDIA SEMANTICS

The ability to derive emergent semantics in the media blending framework depends
upon the specification of an ontology, the definition of operators, and processing the
integration network to run the blend. This section draws upon the domains of audio
synthesis, video editing, and media mining to illustrate the key features of this framework.
The combination of a reference ontology with synthesis models facilitates the
manipulation of media semantics. The example of an audio morph between two input
audio models illustrates the use of model semantics to provide the cross-mapping
between models and the use of the domain ontology to discover emergent structure.
The blending framework for video editing employs a domain specific ontology and
defines a set of operators that are applied to the configuration of models illustrated in
Figure 5. The key operators, as defined in Conceptual Integration Theory (Fouconnier,
1997), enable the construction of a metaphorical space that contains a mapping of
selected properties from each of the input models. The emergent semantics of the edited
music video is derived from the cross space mappings of the blending network.
The effective use of emergent semantics is obtained through the processing of the
blend model. The final part of this chapter uses media mining in the e-learning domain
to provide a detailed example for the integration of a domain ontology, operators, and
a query mechanism that runs the blend. This shows that the blending framework provides
a systematic way to manage the media semantics that emerges from heterogeneous media.
The Audio Morph

Model based audio synthesis is based on a collection of primitive units and
functional modules patched together to build sound models. Units include oscillators,
noise generators, filters, event generators, and arithmetic operators. Units have certain
types of signals and ranges they expect at its inputs, and produce certain types of signal
at their respective outputs. Some units can be classed together (e.g. signal generators),
and many bear other types of relationships to one another. The composition of units and
modules into sound models is informed and constrained by their parameterizations and
the relations between entities which can be specified as a domain ontology. The ontology
might include information such as the fact that an oscillator takes a frequency argument,
that a sine wave oscillator is-a oscillator, and that the null condition (when a
transform unit has no effect, or the output of a generator is constant) for an oscillator
occurs when its frequency is set to 0. Experts have implicit knowledge of these ontologies
that they use to build and manipulate model structures. There have been attempts to make
these ontologies explicit so that non-experts would have support in achieving their
semantically specified modeling objectives (Slaney, Covell, & Lassiter, 1996).
377
Synthesis algorithms are typically represented as a signal flow diagram, as in the

MAX/MSP software package. An example of a simple patch for sinusoidal amplitude
modulation of a stored audio signal is shown in Figure 6.
The representation of datatype, functional, and relational information within a
domain ontology of sound synthesis unites provides a new rich source of information
that can be queried and mined for hints on how to transform or to build new models. The
model components and structure define and constrain the behavior of the sound
generating objects and relate the dynamics of parametric control to changes in the sound.
They do not themselves determine the semantics, but are more closely allied to semantic
descriptions than a flat audio stream. A model component ontology would thus be
useful for making associations between model structures and the semantics that models
are given by designers or by the contexts in which their sounds are used. Relationships,
such as distances and morphologies between models, could be discovered and exploited
based not only on the sounds they produce, but also on the structure of the models that
produce the sounds. Applications of these capabilities include model database management and tools for semantically-driven model building and manipulation.
The availability of a domain ontology along with model building and manipulation
tools provides the key resources for discovering emergent semantics among sound
models. An example of emergent semantics comes from the media morph. Most people
are familiar with the concept in the realm of graphics, perhaps less so with sound. A
morph is the process of smoothly changing one object into another over a period of
time. There are several issues that make this deceptively simple concept challenging to
implement in sound, not the least of which is that the individual source and target sound
objects may themselves be evolving in time. Also, two objects are not enough to
determine a morph they must both have representations in a common space so that
a path of intermediate points can be defined. Finally, given two objects in a consistent
Figure 6. The central multiplication operation puts a sinusoidal amplitude modulation

(coming from the right branch) on to the stored sample signal on the left branch in
this simple MAX/MSP patch.
378 Altman & Wyse
representational space, there are typically an infinite number of paths that can be
traversed to reach one from the other, some of which will be effective in a given usage
context, others possibly not.
The work on this issue has tended to focus on various ways of interpolating
between spectral shapes of recorded sounds (Slaney et al., 1996). This approach works
well when the source and target sounds are static so that the sounds can be transformed
into a spectral representation. Similar to the case with graphical morphs, corresponding
points can be identified on the two objects in this space. A combination of space warping
and interpolation are used to move between the target and source.
A much deeper and informative representation of a sound is provided in terms of
a sound model. There are several different ways that models can be used to define a
morph. The more the model structures can be exploited, the richer are the possible
emergent semantics.
If two different sounds can be generated by the same model, then a morph can be
trivially defined by selecting a path from the parameter setting that generates one to a
setting that generates the other. In this case, the blend space is the same as the model
for the two sounds, so although the morph may be more interesting than the spectral
variety discussed above, no new semantics can be said to emerge.
If we are given two sounds, each with a separate model capable of generating the
sounds, then the challenge is to find a common representational space in which to create
a path that connects the two sound objects. One possible solution would be to define
a set of feature detectors (e.g., spectral measurements, pitch, measures of noisiness)
that would provide a kind of description of any sound. This solves the problem of finding
a common space in which both source and target can be represented. Next, a region of
the feature space that the two model sound classes have in common needs to be
identified, and paths from the source and target need to be delineated such that they
intersect in that region. If the model ranges do not intersect in the feature space, then
a series of models with ranges that form a connected subspace needs to be created to
support such a path so that a morph can be built using a series of models as illustrated
in Figure 7. This process requires knowledge about the sound generation capabilities of
each model at a given point in feature space.
We mentioned earlier that a model is defined not only by the sounds within its range,
but in the paths it can take through the range as determined by the control
parameterizations. The dynamic behavior defined by the possible paths play a key role
in any semantics the model might be given. The connected feature space region defines
a path between the source and target sounds in a particular way that will create and
constrain a semantic interpretation. However, in this case, the new model is less than
satisfying because as a combination of other models, only one of which is active at a time,
it cant actually generate sounds that were not possible with the extant models.
Moreover, if the kludging together of models is actually perceived as such, then new
semantics fail to arise.
Another way to solve the problem would be to embed the two different models in
to a blended structure where each original model can be viewed as a special case given
by specific parameter settings of the meta-model. This could be done trivially by building
a meta-model that merely mixes the audio output from each model separately, with a
parameter that controls the relative contribution from each submodel. Again we have a
379
Figure 7. A morph in feature space performed using one model that can generate the
source sound, another model that can generate the target sound, and a path passing
through a point in feature space that both models are capable of generating. If the
source-generating model and the target-generating model do not overlap in feature
space, intermediate models can be used so that a connected path through feature space
is covered.
trivial morph that would not be very satisfying, and because sound mixes from
independent sources are generally perceived as mixes rather than as a unified sound by
a single source, the semantics of the individual component models would presumably be
clearly perceptible.
There are, however, much richer ways of embedding two models into a blended
structure such that each submodel is a sufficient description of the meta-model under
specific parameter settings. The blended structure wraps the two submodels that
generate the morphing source and target sounds and exposes a single reduced set of
parameters. There must exist at least one setting for the meta-model parameters such that
the original morphing target sound is produced, and one such that the original morphing
source sound is produced in order to create the transformation from source sound to
target sound. The meta-model parameterization defines the common space in which both
the original sounds exist and in which any number of paths may be constructed
connecting the two. We discussed this situation earlier, except in this case, the metamodel is genuinely new, and has its own set of capabilities and constraints defined by
the relationship between the structure of the two original models, but present in neither.
New semantics emerge from the domain ontology, mappings between models, and the
integration network created in the blend.
As a concrete example of an audio morph with emergent semantics, consider two
different sounds: one the result of waveshaping on a sinusoid, the other the result of
amplitude modulation of a sampled noise source as illustrated in Figure 8. Each structure
creates a distinctive kind of distortion of the input signal. One way of combining these
two models into a meta-model is shown in Figure 9. To combine these two models, we use
knowledge about the constituent components of the models, which could be exploited
automatically if they were represented as a formal ontology as discussed above. In
380 Altman & Wyse
Figure 8. Two kinds of signal distortion. a) This patch puts the recorded sample through
a non-linear transfer function (tanh). The amount of effect determines how nonlinear the shaping is, with zero causing the original sample to be heard unchanged.
b) A sinusoidal amplitude modulation of the recorded sample.
(a)
(b)
particular, knowing the input and output types and ranges for signal modifying units, and
knowing specific parameter values for which the modifiers have no effect on the signal
(the null condition), we can structure the model for morphing. Knowledge of null
conditions, in particular, was used so that the effect of one submodel on the other would
be nullified at the extreme values of the morphing parameter. Using knowledge of the
modifier units signal range expectations and transformations permits the models to be
integrated at a much deeper structural level than treating the models as black boxes would
permit.
Most importantly, blending the individual model structures creates a genuinely new
model capable of a wide range of sounds that neither submodel was capable of generating
alone, yet including the specific sounds from each submodel that were the source and
the target sounds for the morph. A new range of sounds implies new semantic possibilities.
New semantics can be said to arise in another aspect as well. In the particular blend
illustrated above, most of the parameters exposed by the original submodels are still
available for independent control. At the extreme values for the morphing parameter, the
original controls have the same effect that they had in their original context. However,
at the in between values for the morphing parameter, the controls from the submodel
have an effect on the sound output that is entirely new and dependent upon the particular
submodel blend that is constructed. This emergent property is not present in the trivial
morph described earlier which merely mixed the audio output of the two submodels
individually. Since a morph between two objects is not completely determined by the
endpoints, but by the entire path through the blend space, there is a creative role for a
human-in-the-loop to complete the specification of the morph according the usage
context.
381
Figure 9. Both the waveshaping (WS) and the amplitude modulation (AM) models
embedded in a single meta-model. When the WS vs. AM morphing parameter is at
one extreme or the other we get only the effect of either the WS or the AM model
individually. When the morph parameter is at an in between state, we get a variety of
new combinations of waveshaping of the AM signal and/or amplitude modulation of
the waveshaped signal (depending on the other parameter settings).
We have shown, that given knowledge about how elementary units function within
models in the form of an ontology, structures for different models can be combined in
such a way that gives rise to new sound ranges and new handles for control. Semantics
emerge that are related to those of the model constituents, but in rich and complex ways.
There are, in general, many ways that sound models may be combined to form new
structures. Some combinations may work better in certain contexts than others. How
desired semantics can be used to guide the construction process is a topic that warrants
further study.
The Video Edit

The proliferation of digital video cameras has enabled the casual user to easily
create recordings of events. However, these raw recordings are of limited value unless
they are edited. Unfortunately, manual editing of home videos requires considerable time
and skill to create a compelling result. During many decades of experimentation, the film
industry has developed a grammar for the composition of space, time, image, music, and
382 Altman & Wyse
sound that are routinely used to edit film (Sharff, 1982). The mechanisms that underlie
cinematic editing can be described as a blending network that leverages the cognitive
perceptions of audio and imagery to create a compelling story. The Video Edit example
borrows from such cinematic editing techniques to construct a blending network for the
semi-automatic editing of home video. In this network, the generic model is an encoding
of the cinematic editing rules relating music and video, and the input models represent
structural features of the music and visual events in the video.
Encoding of aesthetic decisions for editing video to music is key for creating the
blending model. Traditional film production techniques start by creating the video track,
then add sound effects and music to enhance the affective qualities of the video. This
is a highly labor intensive process that does not lend itself well to automation. In the case
of the casual user with a raw home video, the preferred editing commands emphasize
functional operations, such as selecting the overall style of cinematic editing, choosing
to emphasize people in the video, selecting the music, and deciding how much native
audio to include in the final production.
The generic model for the music and video inputs is a collection of editing units that
describe simple relations between fragments of audio and video. Each unit captures
partial information associated with a cinematic editing rule, thus the units in the generic
model can be composed in a graph structure to form more complex editing logic. One
example of an insertion unit specifies that the length of a video clip to be inserted should
be inversely proportional to the loudness of the music. During the construction of the
blending network the variables for video length and music loudness are bound and
specific values are propagated during the subsequent running of the blend to
dynamically produce the final edited video. Another example of a transition unit specifies
how two video clips are to be spliced together. When this unit is added to the graph
structure, it specifies the type of transition between video clips, the duration of the
transition, and the inclusion of audio or graphical special effects. Yet another insertion
unit may relate the timing and visual characteristics of people in the video to various
structural features in the music. The generic model may therefore be viewed as an
ontology of simple editing units that can be composed into a graph structure by the
blending model for subsequent editing of the music and video inputs.
The video input model contains the raw video footage plus the shots detected in
the video, where each shot is a sequence of contiguous video frames containing similar
images. The raw video frames are then analyzed in terms of features for color, texture,
motion, as well as simple models for the existence of people in the shot or other salient
events. This analysis provides the metadata for a model representing the input video for
use in the subsequent video editing. For example, the video model includes techniques
for finding parts of the shots which contain human faces. The information about faces
can be combined with the editing logic to create a final production which emphasizes the
people in the video. In this way, the system can automatically construct a people-oriented
model of the input video.
The model for the input music needs to support the editing logic of the generic model
for cinematic editing and the high level commands from the user interface. The basic
music model is composed of a number of parameters, including the tempo, rhythm, and
loudness envelope for the music. The combined inputs of the video model and the music
model in Figure 10 are integrated with the cinematic styles in the blending model to
383
Figure 10. The blending network for automatic editing of video according to the
affective structures in the music and operators for different cinematic styles.
Styles
Music
Raw Video
Music Analysis
Music
Data
Music
Description
Video Analysis
Video
Description
Video
Data
Composition Logic
Media Production
produce a series of editing decisions when to make cuts, which kinds of transitions
to use, what effects to add, and when to add them.
The composition logic in the blend model integrates information from three places:
a video description produced by the video analysis, a music description produced by the
music analysis, and information about the desired editing style as selected by the user.
The composition logic uses the blending model to combine these three inputs in order
to make the best possible production from the given material one which is as stylish
and artistically pleasing as possible. It does this by representing the blended media
construction as a graph structure and opportunistically selecting content to complete
the media graph. This process results in the emergent semantics of a music video which
inherits partial semantics from the music and from the video.
The Presentation
The Presentation example illustrates the use of emergent structure to facilitate
information retrieval from online e-learning courseware. A simple form of emergent
structure is a path that combines concept relationships from a textbook with temporal
sequencing from the video presentation. The path structure can then be manipulated to
gain insight into the informational content of the courseware by performing all of the
standard operations afforded by paths, such as traversal, compression, expansion,
branching, and the measurement of distance.
384 Altman & Wyse
Find Path Scenario

As distance education expands by placing ever larger quantities of course content
online, students and instructors have increasing difficulty navigating the courseware
and assimilating complex relationships from the multimedia resources. The separation in
space and time between students and teachers also makes it difficult to effectively
formulate questions that would resolve media based comprehension failures. The find
path scenario addresses this problem by combining media blending with information
retrieval to assist the student in formulating complex queries about lecture video
presentations.
Suppose that a student is half way through a computer science course on the
Introduction to Algorithms, which teaches the computational theory for popular
search optimization techniques (Corman, Leiserson, Rivest, & Stein, 2001). The topic of
Dynamic Programming was introduced early in the course, then after several intervening
topics the current topic of Greedy Algorithms is presented. The student realizes that
these two temporally distant topics are connected, but the relationship is not evident
from the lecture presentations due to the temporal separation and the web of intervening
dependencies between topics. To resolve this comprehension failure, the student would
like to compose a simple query, such as Find a path linking greedy algorithms to
dynamic programming and have the presentation system identify a sequence of
locations in the lecture video which can be composed to provide an answer to the query.
Note that the path that the student is seeking does not exist in either the textbook
or the lecture video. The textbook contains a hierarchical organization of topics and
instances of concept relations. The sequencing of topics from the beginning of the book
to the end provides a linear ordering of topics, but not necessarily a chronological
ordering. The lecture videos do not contain the semantics of a path since they have a
purely chronological sequencing of content. As we shall see, partial structure from the
textbook and the video must be projected onto the blend to create the emergent structure
of the path.
Media Blending Network

Traditional approaches to media integration are based on the use of a generic model
to provide a unified indexing scheme for the input media plus cross media mapping
(Benitez & Chang, 2002). The media blending approach adds the blend model as an
integral part of the integration mechanism. This has two consequences. Firstly, the blend
model adds considerable richness to the media semantics in terms of the operators that
can be applied, such as projection, compression, and the propagation of information back
into the input spaces. Secondly, by explicitly providing for emergent structures in the
computational framework, we can potentially achieve a higher level of integration of
multimedia resources. Of particular interest for the Presentation network is the semantics
of a path that emerges from the media blend. Once the path is obtained, then all of the
common operations on paths, such as expand, compress, extend, append, branch, and
the measurement of distance can be applied to the selected media elements.
The configuration of models used to construct the Presentation network is illustrated in Figure 11. The textbook model provides one input to the network. This model
contains instances of terms and relations between terms that can be extracted through
standard text processing techniques. The instances of terms and relations are mapped
385
Figure 11. Media integration in the Presentation network for the Find Path blend.
ontology
Generic Model
term
Textbook Model
topic
text
topic
terms
ontology
Path Blend
term
topic
media
slide
transcript
segment
context
ontology
sequence
Lecture Model
video
time
Emergent Structure
terms
path
video
onto the abstract concepts of the domain ontology. The ontology is subsequently
converted to a graph structure for efficient search. The textbook has explicit structure
due to the hierarchical organization of chapters and topics, as well as the table of
contents, and index. There is also implicit structure in the linear sequencing of topics and
the convention among textbooks that simpler material comes before more complex
material.
The lecture model provides the second input which represents the lecture video,
transcripts, and accompanying slide presentation. The metadata for the lecture model can
be derived automatically through the analysis of perceptual events in the video to
classify shots according to the activity of the instructor. The text of the transcripts can
be analyzed to extract terms and indexed for text based queries. The slide presentation
and associated video time stamps provide an additional source of key terms and images
that can be cross mapped to the textbook model.
The generic model for the Presentation network contains the core domain ontology
of terms and relations used in the course. As we shall see later, the cross-space mapping
between the textbook model and the lecture model occurs at the level of extracted terms
and their locations in the respective media. The concepts in the core ontology thus
provide a unified indexing scheme for the term instances that occur as a result of media
processing in the two input models.
The blend model for the find path scenario receives projections of temporal
sequence information from the lecture video and term relations from the textbook. When
the user issues a query to find a path in the lecture video that goes from topic A to topic
386 Altman & Wyse
B, the blend model first accesses the textbook model to expand the query terms associated
with the topics A and B. The graph representation of the textbook model is then searched
for a path linking these topics. Once a path is found among the textbook terms, the original
query is expanded into a sequence of lecture video queries, one for each of the terms plus
local context from the textbook. The blend model then evaluates each video query and
assembles the selected video clips into the final path structure (see insert in Figure 11).
At this point, the blend model has fully instantiated the path as a blend of the two inputs.
Once the path has been constructed, the user can run the blend to perform various
operations on the path. Note that the blending network has added mappings between the
input models, but has not modified the original models. Thus, the path blend can, for
instance, be used to project the temporal sequencing from the time stamps of the lecture
video back onto the textbook model to construct a navigational path in the textbook with
sequential dependencies from the video.
Operators
Mappings between models in the Presentation network in Figure 11 support a set
of core operators for information retrieval in mixed media. Two of these operators are
called projection and compression. As seen in previous examples, projection is the
process in which information in one model is mapped onto corresponding information
in another model. Since both input models are represented by a graph structure, where
links between nodes are relations, projection between inputs amounts to a form of graph
matching to identify corresponding elements. These elements are then bound so that
information can pass directly between the models. A second source of binding occurs
between each input model and the emergent structure that is constructed in the blend
model. This double scope binding enables the efficient projection of information within
the network.
Compression is another core operator of the blending network that supports media
management through semantics. For example, traditional methods for constructing a
video summary require the application of specialized filters to identify relevant video
segments. The segments are then composed to form the final summary. Instead of
operating directly on the input media, compression operates on the emergent path
structure and projects the results back to the input media. Thus by operating on the path
blend, one can derive the shortest time path, the most densely connected path, or the path
with the fewest definitions from the lecture video. The system performs these operations
on the blended model and projects the results back to either of the available input models
to determine the query result.
The consequence of defining the operators for projection and compression is that
a new process called, running the blend becomes possible within this framework.
Running the blend is the process of reversing the direction of causality within the
network, thereby using the blend to drive the production of inferences in either of the
input spaces. In the find path example, the application of projection and compression
on the path blend means that the user can manage the media using higher level semantics.
Moreover, all of the standard operations on paths, such as contracting, expanding,
reversing, etc. can now be performed and their consequences projected back onto the
input spaces. Finally, the user can perform a series of queries in the blend and project
the results back to the inputs to view the results.
387
Integration Mechanism
The network of models in the Presentation blend provides an integration mechanism
for the multimedia resources. Once the network is constructed, it is possible to process
user queries by running the blend. In the find path scenario, the student began with a
request to find a relationship between Dynamic Programming (DP) and Greedy Algorithms (GA). The system searches the domain ontology of the textbook model to discover
a set of possible paths linking DP with GA, subject to user preferences and event
descriptions. The user preferences, event descriptions, and relations among the path
nodes in the ontology are used to formulate a focused search for similar content in the
video presentation. The resultant temporal sequence of video segments is added to the
emergent path structure in the blend model.
As discussed previously, when the student requested to find a path from topic DP
to topic GA, a conceptual blend was formed which combined the ontology from the
textbook with the temporal sequencing of topics from the lecture. The result was a
chronological path through the sequence of topics linking DP to GA. This path can now
be used in an intuitive way to compress time, expand the detail, select alternative routes,
or combine with another path. The resultant path through a sequence of interrelated
media segments in the find path blend is the emergent structure arising from the
processing of the users query. Thus, one can now start from the constructed path and
project information back onto the input spaces to mine for additional information that was
previously inaccessible. For example, one could use the path to select a sequence of text
locations in the textbook that correspond to the same chronological presentation of the
topics that occurs in the lecture. Thus, the blending network effectively uses the
instructors knowledge about the pedagogical sequencing of topics to provide a
navigational guide through the textbook.
We have designed a system for indexing mixed media content using text, audio,
video, and slides and the segmentation of the content into various lecture components.
The GUI for the ontology based exploration and navigation of lecture videos is shown
in Figure 12. By manipulating these lecture components in the media blend, we are able
to present media information to support insight generation and aid in the recovery of
comprehension failures during the viewing of lecture videos. This dynamic composition
of cross media blends provides a malleable representation for generating insights into
the media content by allowing the user to manage the media through the high level
semantics of the media blend.
FUTURE TRENDS
The development of the Internet and the World Wide Web has lead to the
globalization of text based exchanges of information. The subsequent use of web
services for the automatic generation of web pages from databases for both human and
machine communication is being facilitated by the development of Semantic Web
technologies. Similarly, we now have the capacity to capture and share multimedia
content on a large scale. Clearly, the plethora of pre-existing digital media and the
popularization of multimedia applications among non-professional users will drive the
demand for authoring tools that provide a high level of automation.
388 Altman & Wyse
Figure 12. User interface for the Presentation network. Display contains the following
frames (clockwise from top left corner): Path Finder, Video Player, Slide, Slide Index,
Textbook display, and a multiple timeline display of search results presented as
gradient color hotspots on a bar chart.
Preliminary attempts toward the use of models for the generation of sound effects
for games and film, as well as the retrieval of video from databases has been primarily
directed toward human-to-human communication. The increasing use of generative
models for media synthesis and the ability to dynamically construct networks for
combining these models will create new ways for people to experience media. Since the
semantics of the media is not fixed, but arises from the media and the way that it is used,
the discovery of emergent semantics through ontology based operations is becoming
a significant trend in multimedia research. The convergence of generative models,
automation, and ontologies will also facilitate the exchange of media information between
machines and support the development of a media semantic web.
In order to realize these goals further progress is needed in the following technologies:
Tools to support the development of generative models for media synthesis.
Use of semantic descriptions for the composition of models into blending networks.
Automation of user tasks based on the mediation between blending networks as

an extension to the ongoing research in database schema mediation.
Formalization of ontologies for domain knowledge, synthesis units, and relationships among generative models.
Discovery and generation of emergent semantics.
389
The trend toward increasing the automation of media production through the
creation of media models relies upon the ability to manage the semantics that emerges
from user-centric operations on the media.
CONCLUSIONS
In this chapter we have presented a framework for media blending that has proved
useful for discovering emergent semantics. Concrete examples drawn from the domains
of video editing, sound synthesis and the exploration of multimedia content for lecture
based courseware have been used to illustrate the key components of the framework.
Ontologies for sound synthesis components and the perceptual relations among sounds
were used to describe how emergent properties arise from the morphing of two audio
models into a new model. From the domain of automatic home video editing, we have
described how the basic operators of projection and compression lead to the emergence
of a stylistically edited music video with combined semantics of the source music and
video. In the video presentation example, we have shown how multiple media specific
ontologies can be used to transform high level user queries into detailed searches in the
target media.
In each of the above cases, the discovery and/or generation of emergent semantics
involved the integration of descriptions from four distinct spaces. The two input spaces
contain the models and metadata descriptions of the source media that are to be
combined. The generic space contains the domain specific information and mappings
that relate elements in the two input spaces. Finally, the blend space is where the real work
occurs for combining information from the other spaces to generate a new production
according to audio synthesis designs, cinematic editing rules, or navigational paths in
presentations as discussed in this chapter.
REFERENCES
Benitez, A. B., & Chang, S. F. (2002). Multimedia knowledge integration, summarization
and evaluation. Proceedings of the 2002 International Workshop on Multimedia
Data Mining, Edmonton, Alberta, Canada (pp. 39-50).
Corman, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to
algorithms. Cambridge, MA: MIT Press.
Davis, M. (1995). Media Streams: An iconic visual language for video Representation.
In Baecher, R. M., Grudin, J., Buxton, W. A. S., & Greenberg, S. (Eds.), Readings
in human-computer interaction: Toward the year 2000 (2nd ed.) (pp. 854-866). San
Francisco: Morgan Kaufmann Publishers.
Dorai, C., Kermani, P., & Stewart, A. (2001, October). E-learning media navigator.
Proceedings of the 9th ACM International Conference on Multimedia, Ottawa,
Canada (pp. 634-635).
Fauconnier, G. (1997). Mappings in thought and language. Cambridge, UK: Cambridge
University Press.
390 Altman & Wyse
Funkhouser, T., Kazhdan, M., Shilane, P., Min, P., Kiefer, W., Tal, A., et al. (2004,
August). Modeling by example. ACM Transactions on Graphics (SIGGRAPH
2004).
Kellock, P., & Altman, E. J. (2000). System and method for media production. Patent WO
02/052565.
Kovar, L., & Gleicher, M. (2003). Flexible automatic motion blending with registration
curves. Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on
Computer Animation, San Diego, California (pp. 214-224).
Mott, R. L. (1990). Sound Effects: Radio, TV and film. Focal Press.
Nack, F., & Hardman, L. (2002). Towards a syntax for multimedia semantics. CWI
Technical Report, INS-R0204, April.
Rolland, P.-Y., & Pachet, F. (1995). Modeling and applying the knowledge of synthesizer
patch programmers. In G. Widmer (Ed.), Proceedings of the IJCAI-95 International
Workshop on Artificial Intelligence and Music, 14th International Joint Conference on Artificial Intelligence, Montreal, Canada. Retrieved June 1, 2004, from
http://citeseer.ist.psu.edu/article/rolland95modeling.html
Russolo, L. (1916). The art of noises. Barclay Brown (translation). New York: Pendragon
Press.
databases. IEEE Transaction of Knowledge and Data Engineering, 337-351.
Sharff, S. (1982). The elements of cinema: Toward a theory of cinesthetic impact. New
York: Columbia University Press.
Slaney, M., Covell, M., & Lassiter, B. (1996). Automatic audio morphing. Proceedings
of IEEE International Conference Acoustics, Speech and Signal Processing,
Atlanta, 1-4. Retrieved June 1, 2004, from http://citeseer.nj.nec.com/
slaney95automatic.html
Smalley, D. (1997). Spectromorphology: Explaining sound shapes. Organized Sound,
2(2), 107-126.
Staab, S., Maedche, A., Nack, F., Santini, S., & Steels, L. (2002). Emergent semantics. IEEE
Intelligent Systems: Trends & Controversies, 17(1), 78-86.
Thompson Learning (n.d.). Retrieved June 1, 2004, from http://www.thompson.com/
Veale, T., & ODonoghue, T. (2000). Computation and blending. Cognitive Linguistics,
11, 253-281.
WebCT (n.d.). Retrieved June 1, 2004, from http://www.webct.com
Glossary 391
Glossary
B Frame: One of three picture types used in MPEG video. B pictures are bidirectionally
predicted, based on both previous and following pictures. B pictures usually use
the least number of bits. B pictures do not propagate coding errors since they are
not used as a reference by other pictures.
Bandwidth: There are physical constraints on the amount of data that can be transferred
through a specific medium. The constraint is measured in terms of the amount of
data that can be transferred over a measure of time, and is known as the bandwidth
of the particular medium. Bandwidth is measured in bps (bits per second).
Bit-rate: The rate at which a presentation is streamed, usually expressed in Kilobits per
second (Kbps).
bps: Bits-Per-Second
Compression/Decompression: A method of encoding/decoding signals to reduce the
data rate needed allows transmission (or storage) of more information than the
media would otherwise be able to support.
Extensible Markup Language: see XML
Encoding/Decoding: Encoding is the process of changing data from one form into
another according to a set of rules specified by a codec. The data is usually a file
containing audio, video or still image. Often the encoding is done to make a file
compatible with specific hardware (such as a DVD Player) or to compress or reduce
the space the data occupies.
392 Glossary
Feature: A Feature is a distinctive characteristic of the data which signifies something

to somebody. A Feature for a given data set has a descriptor (i.e., feature
representation) and an instantiation (descriptor value); for example, colour histograms.
Frame Rate: The number of images captured or displayed per second. A lower frame rate
produces a less fluid motion and saves disk space. A higher setting results in a fluid
image and a larger movie file.
GIF (Graphic Interchange Format): A common format for image files, especially suitable
for images containing large areas of the same color.
GUI (Graphical User Interface): The term given to that set of items and facilities which
provide the user with a graphic means for manipulating screen data rather than
being limited to character-based commands. Graphic User Interface tool kits are
provided by many different vendors and contain a variety of components including
(but not limited to) tools for creating and manipulating Windows, Menu Bars,
Status Bars, Dialogue Boxes, Pop-Up Windows, Scroll Or Slide Bars, Icons,
Radio Buttons, Online and Context-Dependent Help Facilities. Graphic User
Interface tool kits may also provide facilities for using a mouse to locate and
manipulate on-screen data and activate program components.
Hyperlink: A synonym for link A hyperlink links a document to another document,
or a document fragment to another document fragment. A link on a document can
be activated and the user is then taken to that linked document or fragment. In Web
pages links were originally underlined to indicate that there is a link.
I Frame: An I frame is encoded as a single image, with no reference to any past or future
frames. Often video editing programs can only cut MPEG-1 or MPEG-2 encoded
video on an I frame since B frames and P frames depend on other frames for encoding
information.
Index: An index is a database feature used for locating data quickly within a table. Indexes
are defined by selecting a set of commonly searched attribute(s) on a table and
using the appropriate platform-specific mechanism to create an index.
Interactive Multimedia: Term used interchangeably with multimedia whose input and
output are interleaved, like a conversation, allowing the users input to depend on
earlier output from the same run.
Internet Protocol: IP provides a datagram service between hosts on a TCP/IP network
IP thus runs on top of TCP. It routes the packets of data that are transmitted
to the correct host. IP also takes apart datagrams and puts them back together again.
ISO (International Standard Organization): The ISO is an international body that
certifies standards ranging from visual signs (such as the i-sign for information,
Glossary 393
seen at places such as airports and information kiosks) to language characters

(such as proposed by ANSI and ASCII).
JPEG (Joint Photographic Experts Group): JPEG is most commonly mentioned as a
format for image files. JPEG format is preferred to the GIF format for photographic
images as opposed to line art or simple logo art.
Key Frame: In movie compression, the key frame is the baseline frame against which
other frames are compared for differences. The key frames are saved in their
entirety, while the frames in between are compressed based on their differences
from the key frame.
Lossless: A video/image compression method that retains all of the information present
in the original data.
Metamodel: A metamodel is a formal model used to create a lower-level model. A
metamodel consists of the rules and conventions to be used when the lower-level
model is created.
MPEG (Motion Picture Expert Group): Set of ISO an IEC standards:
MPEG1: Audio - MPEG1 Layers 2(MP2) or 3 (MP3), an Audio codec system used
on the Internet offering good compression rations; widely used Video compression
format.
MPEG2: Audio AAC (Advanced Audio Codec) the core codec of the future;
Enhanced Video compression format (also used with DVD and HDTV applications); A/V transport and sync.
MPEG 4: Makes the whole stuff suitable for the Internet (applications layers set).
Lets wait and see. (Ex.: BIFFs offer alternatives and extensions to SMIL and VRML,
even more advanced Audio coding techniques (very low datarate, scaleable
datarates, speech audio combinations)); Multiple delivery media - Mixed content
MPEG 7: (Meta)Search and identification techniques for media streams / multimedia
content description interface.
MPEG 21: The purpose of MPEG-21 is to build an infrastructure for the delivery
and consumption of multimedia content.
MPEG-7 Descriptor: A Descriptor is a representation of a Feature. It defines the syntax
and semantics of Feature representation. A single Feature may take on several
descriptors for different requirements.
Multimedia: The use of computers to present text, graphics, video, animation, and sound
in an integrated way.
Noninterlaced: The video signal created when frames or images are rendered from a
graphics program. Each frame contains a single field of lines being drawn one after
another. See also interlaced.
394 Glossary
P Frame: A P-frame is a video frame encoded relative to the past reference frame. A
reference frame is a P- or I-frame. The past reference frame is the closest preceding
reference frame.
Pixel: A picture element; images are made of many tiny pixels. For example, a 13-inch
computer screen is made of 307,200 pixels (640 columns by 480 rows).
Protocol: A protocol is a standard of technical conventions allowing communication
between different electronic devices. It consists of a set of rules for the communication between devices
Query: Queries are the primary mechanism for retrieving information from a database and
consist of questions presented to the database in a predefined format. Many
database management systems use the Structured Query Language (SQL) standard
query format.
QuickTime: A desktop video standard developed by Apple Computer. QuickTime
Animation was created for lossless compression of animated movies and QuickTime
Video was created for lossy compression of desktop video.
Scene: A meaningful segment of the video.
Schema: A Schema is a mechanism similar to defining data types.
SQL (Structured Query Language): A specialized programming language for sending
queries to databases.
Streaming: Multimedia files are typically large. A user does not want to wait until the
entire file is received through an Internet connection. Streaming makes possible a
portion of a files content can be viewed before the entire file is received. The data
of the file is continuously sent from the server. While loading, the user can begin
viewing the streamed data.
TCP (Transmission Control Protocol): TCP protocol ensures the safe transmission of
data between two hosts. Information is transmitted in packets.
TCP/IP: The TCP and IP protocols in combination is the basic protocol of the Internet.
URL (Uniform Resource Locator): The standard way to give the address of any resource
on the Internet that is part of the World Wide Web (WWW).
Web (WWW) (World Wide Web): The universe of hypertext servers (HTTP servers)
which are the servers that allow text, graphics, sound files, and so forth, to be mixed
together.
Glossary 395
XML (Extensible Markup Language): XML is simplified (application profile)

SGML(Standard Generalized Markup Language [ISO8879]), the international standard for markup languages. XML allows one to create their own markup tags, and
is designed for use on the World Wide Web.
396 About the Authors
About the Authors

Uma Srinivasan worked as principal research scientist at CSIRO, leading a team of
scientists and engineers in two specialist areas: Multimedia Delivery Technologies and
Health Data Integration. She holds a PhD from the University of New South Wales,
Australia and has over 20 years experience in Information Technology research,
consultancy and management. Her work in the area of video semantics and event
detection has been published in several international conferences, journals and book
chapters. She has been an active member of the MPE-7 conceptual modelling group. She
has been particularly successful in working with multi-disciplinary teams in developing
new ideas for information architectures and models, many of which have been translated
into elegant information solutions. She serves as a visiting faculty at the University of
Western Sydney, Australia. Dr. Srinivasan is now the founder and director of PHI
Systems, a company specialising in delivering Pervasive Health Information technologies.
Surya Nepal is a senior research scientist working on consistency management in longrunning Web service transactions project at CSIRO ICT Centre. His main research
interest is in the development and implementation of algorithms and frameworks for
multimedia data management and delivery. He obtained his BE from Regional Engineering
College, Surat, India; ME from the Asian Institute of Technology, Bangkok, Thailand;
and PhD from RMIT University, Australia. At CSIRO, Surya undertook research into
content-management problems faced by large collections of images and videos. His main
interests are on high-level event detection and processing. He has worked in the areas
of content-based image retrieval, query processing in multimedia databases, event
detection in sports video, and spatio-temporal modeling and querying video databases.
He has several papers in these areas to his credit.
* * *
Brett Adams received a BE in information technology from the University of Western
Australia, Perth, Australia (1995). He then worked for three years developing software,
particularly for the mining industry. In 1999 he began a PhD at the Curtin University of
About the Authors 397
Technology, Perth, on the extraction of high-level elements from feature film, which he
completed in 2003. His research interests include computational media aesthetics, with
application to mining multimedia data for meaning and computationallyassisted, domain
specific, multimedia authoring.
Edward Altman has spent the last decade developing a theory of media blending as a
senior scientist in the Media Semantics Department at the Institute for Infocomm
Research (I2R) and before that as a visiting researcher at ATR Media Integration &
Communications Research Laboratories. His current research involves the development
of Web services, ontology management, and media analysis to create interactive
environments for distance learning. He was instrumental in the development of
semiautomated video editing technologies for muvee Technologies. He received a PhD
from the University of Illinois in 1991 and performed post-doctoral research in a joint
computer vision and cognitive science program at the Beckman Institute for Advanced
Science and Technology.
Susanne Boll is assistant professor for multimedia and Internet-technologies, Department of Computing Science, University of Oldenburg, Germany. In 2001, Boll received
her doctorate with distinction at the Technical University of Vienna, Austria. Her studies
were concerned with the flexible multimedia document model ZYX, designed and realized
in the context of a multimedia database system. She received her diploma degree with
distinction in computer science at the Technical University of Darmstadt, Germany
(1996). Her research interests lie in the area of personalization of multimedia content,
mobile multimedia systems, multimedia information systems, and multimedia document
models. The research projects that Boll is working on include a framework for personalized multimedia content generation and development of personalized (mobile) multimedia
presentation services. She has been publishing her research results at many international
workshops, conferences and journals. Boll is an active member of SIGMM of the ACM
and German Informatics Society (GI).
Cheong Loong Fah received a B Eng from the National University of Singapore and a
PhD from the University of Maryland, College Park, Center for Automation Research
(1990 and 1996, respectively). In 1996, he joined the Department of Electrical and
Computer Engineering, National University of Singapore, where he is currently an
assistant professor. His research interests are related to the basic processes in the
perception of three-dimensional motion, shape, and their relationship, as well as the
application of these theoretical findings to specific problems in navigation and in
multimedia systems, for instance, in the problems of video indexing in large databases.
Isabel F. Cruz is an associate professor of computer science at the University of Illinois
at Chicago (UIC). She holds a PhD in computer science from the University of Toronto.
In 1996, she received a National Science Foundation CAREER award. She is a member of
the National Research Councils Mapping Science Committee (2004-2006). She has been
invited to give more than 50 talks worldwide, has served on more than 70 program
committees, and has more than 60 refereed publications in databases, Semantic Web,
visual languages, graph drawing, user interfaces, multimedia, geographic information
systems, and information retrieval.
Ajay Divakaran received a BE (with Honors) in electronics and communication engineering from the University of Jodhpur, Jodhpur, India (1985), and an MS and PhD from
Rensselaer Polytechnic Institute, Troy, New York (1988 and 1993, respectively). He was
an assistant professor with the Department of Electronics and Communications Engineering, University of Jodhpur, India (1985-1986). He was a research associate at the
Department of Electrical Communication Engineering, Indian Institute of Science, in
Bangalore, India (1994-1995). He was a scientist with Iterated Systems Inc., Atlanta,
Georgia (1995-1998). He joined Mitsubishi Electric Research Laboratories (MERL) in 1998
and is now a senior principal member of the technical staff. He has been an active
contributor to the MPEG-7 video standard. His current research interests include video
analysis, summarization, indexing and compression, and related applications. He has
published several journal and conference papers, as well as four invited book chapters
on video indexing and summarization. He currently serves on program committees of key
conferences in the area of multimedia content analysis.
Thomas S. Huang received a BS in electrical engineering from the National Taiwan
University, Taipei, Taiwan, ROC, and an MS and ScD in electrical engineering from the
Massachusetts Institute of Technology (MIT), Cambridge. He was with the faculty of
the Department of Electrical Engineering at MIT (1963-1973) and with the faculty of the
School of Electrical Engineering and Signal Processing at Purdue University, West
Lafayette, Indiana (1973-1980). In 1980, he joined the University of Illinois at UrbanaChampaign, where he is now William L. Everitt distinguished professor of electrical and
computer engineering, research professor at the Coordinated Science Laboratory, and
head of the Image Formation and Processing Group at the Beckman Institute for
Advanced Science and Technology. He is also co-chair of the institutes major research
theme (human computer intelligent interaction). During his sabbatical leaves, he has
been with the MIT Lincoln Laboratory, Lexington, MA; IBM T.J. Watson Research
Center, Yorktown Heights, NY; and Rheinishes Landes Museum, Bonn, West Germany.
He held visiting professor positions at the Swiss Federal Institutes of Technology,
Zurich and Lausanne, Switzerland; University of Hannover, West Germany; INRSTelecommunications, University of Quebec, Montreal, QC, Canada; and University of
Tokyo, Japan. He has served as a consultant to numerous industrial forums and
government agencies both in the United States and abroad. His professional interests
lie in the broad area of information technology, especially the transmission and processing of multidimensional signals. He has published 14 books and more than 500 papers
in network theory, digital filtering, image processing, and computer vision. He is a
founding editor of the International Journal Computer Vision, Graphics, and Image
Processing, and editor of the Springer Series in Information Sciences (Springer Verlag).
Dr. Huang is a member of the National Academy of Engineering; a foreign member of the
Chinese Academies of Engineering and Sciences; and a fellow of the International
Association of Pattern Recognition and of the Optical Society of America. He has
received a Guggenheim Fellowship, an AV Humboldt Foundation Senior US Scientist
Award, and a Fellowship from the Japan Association for the Promotion of Science. He
received the IEEE Signal Processing Societys Technical Achievement Award in 1987 and
the Society Award in 1991. He was awarded the IEEE Third Millennium Medal in 2000.
In addition, in 2000 he received the Honda Lifetime Achievement Award for contributions to motion analysis. In 2001, he received the IEEE Jack S. Kilby Medal. In 2002, he
received the King-Sun Fu Prize from the International Association of Pattern Recognition
and the Pan Wen-Yuan Outstanding Research Award.
Jesse S. Jin graduated with a PhD from the University of Otago, New Zealand. He worked
as a lecturer in Otago, a lecturer, senior lecturer and associate professor at the University
of New South Wales, an associate professor at the University of Sydney. He is now the
chair professor of IT at The University of Newcastle. Professor Jins areas of interest
include multimedia technology, medical imaging, computer vision and the Internet. He
has published more than 160 articles and 14 books and edited books. He also has one
patent and is in the process of filing three more. He has received several million dollars
in research funding from government agents (ARC, DIST, etc.), universities (UNSW,
USyd, Newcastle, etc.), industries (Motorola, NewMedia, Cochlear, Silicon Graphics,
Proteome Systems, etc.), and overseas organisations (NZ Wool Board, UGC HK, CAS,
etc.). He established a spin-off company that won the 1999 ATP Vice-Chancellor New
Business Creation Award. He is a consultant to companies such as Motorola, Computer
Associates, ScanWorld, Proteome Systems, HyperSoft.
Ashraf A. Kassim (M81) received his BEng (First Class Honors) and MEng degrees in
electrical engineering from the National University of Singapore (NUS) (1985 and 1987,
respectively). From 1986 to 1988, he worked on the design and development of machine
vision systems at Texas Instruments. He went on to obtain his PhD in electrical and
computer engineering from Carnegie Mellon University, Pittsburgh (1993). Since 1993,
he has been with the Electrical & Computer Engineering Department at NUS, where he
is currently an associate professor and deputy head of the department. Dr Kassims
research interests include computer vision, video/image processing and compression
Wolfgang Klas is professor at the Department of Computer Science and Business
Informatics at the University of Vienna, Austria, heading the multimedia information
systems group. Until 2000, he was professor with the Computer Science Department at
the University of Ulm, Germany. Until 1996, he was head of the Distributed Multimedia
Systems Research Division (DIMSYS) at GMD-IPSI, Darmstadt, Germany. From 1991 to
1992, Dr. Klas was a visiting fellow at the International Computer Science Institute (ICSI),
University of California at Berkeley, USA. His research interests are in multimedia
information systems and Internet-based applications. He currently serves on the editorial board of the Very Large Data Bases Journal and has been a member and chair of
program committees of many conferences.
Shonali Krishnaswamy is a research fellow at the School of Computer Science and
Software Engineering at Monash University, Melbourne, Australia. Her research interests include service-oriented computing, distributed and ubiquitous data mining, software agents and rough sets. She received her masters and PhD in computer science from
Monash University. She is a member of IEEE and ACM.
Neal Leshs research efforts currently focus on human-computer collaborative interface
agents and interactive data exploration. He has recently published papers on a range of
topics including computational biology, data mining, information visualization, humanrobot interaction, planning, combinatorial optimization, storysharing systems, and
intelligent tutoring. Before joining MERL, Neal completed a PhD at the University of
Washington and worked briefly as a post-doctoral student at the University of Rochester.
Joo-Hwee Lim received his BSc (Hons I) and MSc (by research) in computer science from
the National University of Singapore (1989 and 1991, respectively). He has joined
Institute for Infocomm Research, Singapore, in October 1990. He has conducted research
in connectionist expert systems, neural-fuzzy systems, handwriting recognition, multiagent systems, and content-based retrieval. He was a key researcher in two international
research collaborations, namely the Real World Computing Partnership funded by METI,
Japan, and the Digital Image/Video Album project with CNRS, France, and School of
Computing, National University of Singapore. He has published more than 50 refereed
international journal and conference papers in his research areas including contentbased processing, pattern recognition, and neural networks.
Namunu C. Maddage is currently pursuing a PhD in computer science in the School of
Computing, National University of Singapore, Singapore. His research interests are in the
areas of music modeling, music structure analysis and audio/music data mining. He
received a BE in 2000 in the Department of Electrical & Electronic Engineering from Birla
Institute of Technology (BIT), Mesra, in India.
Ankush Mittal received a BTech and masters (by research) degrees in computer science
and engineering from the Indian Institute of Technology, Delhi. He received a PhD from
the National University of Singapore (2001). Since October 2003, he has been working
as assistant professor at the Indian Institute of Technology - Roorkee. Prior to this, he
was serving as a faculty member in the Department of Computer Science, National
University of Singapore. His research interests are in multimedia indexing, machine
learning, and motion analysis.
Baback Moghaddam is a senior research scientist of MERL Research Lab at Mitsubishi
Electric Research Labs, Cambridge, MA, USA. His research interests are in computational vision with a focus on probabilistic visual learning, statistical modeling and pattern
recognition with application in biometrics and computer-human interface. He obtained
his PhD in electrical engineering and computer science (EECS) from the Massachusetts
Institute of Technology (MIT) in 1997. Here, he was a member of the Vision and Modeling
Group at the MIT Media Laboratory, where he developed a fully-automatic vision system
which won DARPAs 1996 FERET Face Recognition Competition. Dr. Moghaddam was
the winner of the 2001 Pierre Devijver Prize from the International Association of Pattern
Recognition for his innovative approach to face recognition and received the Pattern
Recognition Society Award for exceptional outstanding quality for his journal paper
Bayesian Face Recognition. He currently serves on the editorial board of the journal
titled Pattern Recognition and has contributed to numerous textbooks on image
processing and computer vision (including the core chapter in Springer Verlags latest
biometric series, Handbook of Face Recognition).
Anne H.H. Ngu is an associate professor with the Department of Computer Science, Texas
State University, San Marcos, Texas. Ngu received her PhD in 1990 from the University
of Western Australia. She has more than 15 years of experience in research and
development in IT with expertise in integrating data and applications on the Web,
multimedia databases, Web services, and object-oriented technologies. She has worked
in different countries as a researcher, including the Institute of Systems Science in
Singapore, Tilburg University, The Netherlands; Telcordia Technologies and MCC in
Austin, Texas. Prior to moving to the United States, she worked as a senior lecturer in
the School of Computer Science and Engineering, University of New South Wales
(UNSW). Currently, she also holds an adjunct associate professor position at UNSW and
summer faculty scholar position at Lawrence Livermore National Laboratory, California.
Krishnan V. Pagalthivarthi is associate professor in the Department of Applied Mechanics, Indian Institute of Technology Delhi, India. Dr.Krishnan received his BTech
from IIT Delhi (1979) and obtained his MSME (1984) and PhD (1988) from Georgia Institute
of Technology. He has supervised several students studying for their MTech,MS
(R),and PhD degrees and has published numerous research papers in various journals.
Silvia Pfeiffer received her masters degree in computer science and business management from the University of Mannheim, Germany (1993). She returned to that university
in 1994 to pursue a PhD within the MoCA (Movie Content Analysis) project, exploring
novel extraction methods for audio-visual content and novel applications using these.
Her thesis of 1999 was about audio content analysis of digital video. Next, she moved
to Australia to work as a research scientist in digital media at the CSIRO in Sydney. She
has explored several projects involving automated content analysis in the compressed
domain, focusing on segmentation applications. She has also actively submitted to
MPEG-7. In January 2001, she had initial ideas for a web of continuous media, the
specifications of which were worked out within the continuous media web research group
that she is heading.
Conrad Parker works as a senior software engineer at CSIRO, Australia. He is actively
involved in various open source multimedia projects, including development of the Linux
and Unix sound editor Sweep. With Dr. Pfeiffer, he developed the mechanisms for
streamable metadata encapsulation used in the Annodex format, and is responsible for
development of the core software libraries, content creation tools and server modules of
the reference implementation. His research focuses on interesting applications of
dynamic media generation, and improved TCP congestion control for efficient delivery
of media resources.
Andr Pang received his Bachelor of Science (Honors) at the University of New South
Wales, Sydney, Australia (2003). He has been involved with the Continuous Media Web
project since 2001, helping to develop the first specifications and implementations of the
Annodex technology and implementing the first Annodex Browser under Mac OS X.
Andr is involved in integrating Annodex support into several media frameworks, such
as the VideoLAN media player, DirectShow, xine, and QuickTime. In his spare time, he
enjoys researching about compilers and programming languages, and also codes on
many different open-source projects.
Viranga Ratnaike is a PhD candidate in the School of Computer Science and Software
Engineering, Faculty of Information Technology, Monash University, Melbourne,
Australia. He holds a Bachelor of Applied Science (Honors) in computer science. After
being a programmer for several years, he decided to return to fulltime study, and pursue
a career in research. His research interests are in emergence, artificial intelligence and
nonverbal knowledge representation.
Olga Sayenko received her BS from the University of Illinois at Chicago (2001). She is
working toward her MS under the direction of Dr. Cruz with the expected graduation date
in July 2004.
Karin Schellner studied computer science at the Technical University of Vienna and
received her diploma degree in 2000. From 1995 to 2001, she was working at IBM Austria.
From 2001 to 2003, she worked at the Department of Computer Science and Business
Informatics at the University of Vienna. Since 2003, she has been member of Research
Studios Austria Digital Memory Engineering. She has been responsible for the concept,
design and implementation of the data model developed in CULTOS.
Ansgar Scherp received his diploma degree in computer science at the Carl von
Ossietzky University of Oldenburg, Germany (2001) with the diploma thesis process
model and development methodology for virtual laboratories. Afterwards, he worked
for two years at the University of Oldenburg where he developed methods and tools for
virtual laboratories. Since 2003 he has been working as a scientific assistant at the
research institute OFFIS on the MM4U (Multimedia for you) project. The aim of this
project is the development of a component-based object-oriented software framework
that offers extensive support for the dynamic generation of personalized multimedia
content.
Xi Shao received a BS and MS in computer science from Nanjing University of Posts and
Telecommunications, Nanjing, PRChina (1999 and 2002, respectively). He is currently
pursuing a PhD in computer science in the School of Computing, National University of
Singapore, Singapore. His research interests include content-based audio/music analysis, music information retrieval, and multimedia communications.
Chia Shen is associate director and senior research scientist of MERL Research Lab at
Mitsubishi Electric Research Labs, Cambridge, MA, USA. Dr. Shens research investigates HCI issue in our understanding of multi-user, computationally augmented interactive surfaces, such as digital tabletops and walls. Her research probes new ways of
thinking in terms of UI design, interaction technique development, and entails the reexamination of the conventional metaphor and underlying system infrastructure, which
have been traditionally geared towards mice and keyboard-based, single-user desktop
computers and devices. Her current research projects include DiamondSpin, UbiTable
and PDH (see www.merl.com/projects for details).
Jialie Shen received his BSc in applied physics from Shenzhen University, China. He is
now a PhD candidate and associate lecturer in the School of Computer Science and
Engineering at the University of New South Wales (Sydney, Australia). His research
interests include database systems, indexing, multimedia databases and data mining.
Bala Srinivasan is a professor of information technology in the School of Computer
Science and Software Engineering at the Faculty of Information Technology, Monash
Univeristy, Melbourne, Australia. He was formerly an academic staff member of the
Department of Computer Science and Information Systems at the National University of
Singapore and the Indian Institute of Technology, Kanpur,India. He has authored and
jointly edited six technical books and authored and co-authored more than 150 international refereed publications in journals and conferences in the areas of multimedia
databases, data communications, datamining and distributed systems.
He is a founding chairman of the Australiasian database conference. He was awarded the
Monash Vice-Chancellor medal for post-graduate supervision. He holds a Bachelor of
Engineering (Honors) in electronics and communication engineering, and a masters and
PhD, both in computer science.
Qi Tian received his PhD in electrical and computer engineering from the University of
Illinois at Urbana-Champaign (UIUC), Illinois (2002). He received his MS in electrical and
computer engineering from Drexel University, Philadelphia, Pennsylvania (1996), and a
BE in electronic engineering from Tsinghua University, China (1992). He has been an
assistant professor in the Department of Computer Science at the University of Texas at
San Antonio (UTSA) since 2002 and an adjunct assistant professor in the Department
of Radiation Oncology at the University of Texas Health Science Center at San Antonio
(UTHSCSA) since 2003. Before he joined UTSA, he was a research assistant at the Image
Formation and Processing (IFP) Group of the Beckman Institute for Advanced Science
and Technology and a teaching assistant in the Department of Electrical and Computer
Engineering at UIUC (1997- 2002). During the summer of 2000 and 2001, he was an intern
researcher with the Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA.
In the summer 2003, he was a visiting professor of NEC Laboratories America, Inc.,
Cupertino, CA, in the Video Media Understanding Group. His current research interests
include multimedia, computer vision, machine learning, and image and video processing.
He has published about 40 technical papers in these areas and has served on the program
committee of several conferences in the area of content-based image retrieval. He is a
senior member of IEEE.
Svetha Venkatesh is a professor at the School of Computing at the Curtin University of
Technology, Perth, Western Australia. Her research is in the areas of large-scale pattern
recognition, image understanding and applications of computer vision to image and
video indexing and retrieval. She is the author of about 200 research papers in these areas
and is currently co-director for the Center of Excellence in Intelligent Operations
Management.
Utz Westermann is a member of the Department of Computer Science and Business
Informatics at the University of Vienna. He received his diploma degree in computer
science at the University of Ulm, Germany (1998), and his doctoral degree in technical
sciences at the Technical University of Vienna, Austria (2004)4. His research interests
lie in the area of context-aware multimedia information systems. This includes metadata
standards for multimedia content, metadata management, XML, XML databases, and
multimedia databases. Utz Westermann has participated in several third-party-funded
projects in this domain.
Campbell Wilson received his masters degree and PhD in computer science from
Monash University. His research interests include multimedia retrieval techniques,
probabilisitic reasoning, virtual reality interfaces and adaptive user profiling. He is a
member of the IEEE.
Ying Wu received his PhD in electrical and computer engineering from the University of
Illinois at Urbana-Champaign (UIUC), Urbana, Illinois (2001). From 1997 to 2001, he was
a research assistant at the Beckman Institute at UIUC. During the 1999 and 2000, he was
with Microsoft Research, Redmond, Washington. Since 2001, he has been an assistant
professor at the Department of Electrical and Computer Engineering of Northwestern
University, Evanston, Illinois. His current research interests include computer vision,
machine learning, multimedia, and human-computer interaction. He received the Robert
T. Chien Award at UIUC, and is a recipient of the NSF CAREER award.
Lonce Wyse heads the Multimedia Modeling Lab at the Institute for Infocomm Research
(I2R). He also holds an adjunct position at the National University in Singapore where
he teaches a course in sonic arts and sciences. He received his PhD in 1994 in cognitive
and neural system from Boston University specializing in vision and hearing systems,
and then spent a year as a Fulbright Scholar in Taiwan before joining I2R. His current
research focus is applications and techniques for developing sound models.
Changsheng Xu received his PhD from Tsinghua University, China (1996). From 1996 to
1998, he was a research associate professor in the National Lab of Pattern Recognition,
Institute of Automation, Chinese Academy of Sciences. He joined the Institute for
Infocomm Research (I2R) of Singapore in March 1998. Currently, he is head of the Media
Analysis Lab in I2R. His research interests include multimedia content analysis/indexing/
retrieval, digital watermarking, computer vision and pattern recognition. He is a senior
member of IEEE.
Jie Yu is a PhD candidate of computer science at University of Texas at San Antonio
(UTSA). He received his bachelors degree in telecommunication engineering from Dong
Hua University, China (2000). He has been a research assistant and teaching assistant
in the Department of Computer Science at UTSA since 2002. His current research in image
processing is concerned with the study of efficient algorithms in content-based image
retrieval.
Sonja Zillner studied mathematics at the University Freiburg, Germany, and received her
diploma degree in 1999. Since 2000, she has been a member of the scientific staff of the
Department of Computer Science and Business Informatics at the University of Vienna.
Her research interests lie in the areas of semantic multimedia content modeling and ecommerce. She has participated in the EU project CULTOS (Cultural Units of Learning
Tools and Services).
Samar Zutshi received his masters degree in information technology from Monash
University, Melbourne, Australia, during which he did some work on agent communication. After a stint in the software industry he is back at Monash doing what he enjoys
research and teaching. He is working in the area of relevance feedback in multimedia
retrieval for his PhD.
406 Index
Index
A
abstraction 188
acoustical music signal 117
ADSR 100
amplitude envelope 100
annodex 161
artificial life 355
attributes ranking method 107
audio
model 365
morph 376
production 368
B
beat space segmentation 122
blending
network 370
theory 372
C
censor 93
Cepstrum 102
class relative indexing 40
classical approach 290
clips 173
clustering 137
CMWeb 161
collection DS 186
combining multiple visual features
(CMVF) 3
composite image features 6
computational theory 374
constrained generating procedures
(CGP) 352
content-based
multimedia retrieval 289
music classification 99
music summarization 99
context information extraction 82
continuous media Web 160
D
data layer 340
DelaunayView 334
description scheme (DS) 183
digital cameras 33
digital item 164
distance-based access methods 6
domain 235
dynamic authoring 252
dynamic Bayesian network (DBN) 77
Index 407
emergent semantics 351

EMMO 305
enhanced multimedia metaobjects 305,
311
Enterprise Java Beans (EJBs) 310
keyword integration approach 292

Kuleshov experiments 78
L
LayLab 336
logical media parts 315
low energy component 102
facial images 35
film semiotics 138
functional aspect 306
G
genotype 352
granularity 187
graphical user interfaces (GUI) 247
ground truth based method 107
H
human visual perception 10
human-computer partnership 223
hybrid dimension reducer 6
hybrid training algorithm 13
I
IETF 161
image
classification 35
databases 1
distortion 19
feature dimension reduction 4
feature vectors 2
similarity measurement 5
iMovie 226
indexing 1
instrument
detection 119
identification 119
integration lLayer 341
Internet 162
Internet engineering task force 161
interpretation models 150
J
just-in-time (JIT) 228
machine learning approach 106, 292

mathematical-conceptual approach 293
media aspect 306
media
blending 363
content 248
semantics 351
medical images 35
metric trees 2
MM4U 246
MPEG-21 163
MPEG-7 165, 182, 335
multimedia authoring 223, 251
composition 262
content 183, 247
descriptions scheme (MDS) 183
research 161
semantics 187, 288, 294, 333
software engineering 272
multimedia-extended entity relationship
137
multimodal operators 149
music
genre classification 114
representation 100
structure analysis 106
summarization 104
video 109
video alignment 111
video structure 109
video summarization 109
N
narrative theory 227
network training 22
408 Index
neural network Initialization 20

O
object recognition 30
ontology objects 315
P
pattern
classification 30
classifiers 30
discovery indexing (PDI) 36, 41
discovery scheme 41
matching approach 106
recognition 30
personalization engine 253
personalized multimedia content 246
phenotype 352
pitch content features 103
presentation
layer 345
mining 367
presentation-oriented modeling 306
probabilistic-statistical approach 291
Q
query by example (QBE) 38
R
relevance feedback 288
resource description framework (RDF)
335
rhetorical structure theory (RST) 227
rhythm extraction 122
rhythmic content features 103
S
search
engines 161
engine support 177
segment DS 186
self-organization 355
semantic
aspect 306
DS 186
gap 289
indexing 37
region detection 117
region indexing (SRI) 34
Web 306
semantics 30, 136, 333
Shockwave file format 249
Sightseeing4U 277
SMIL 162
song structure 121
spatial
access methods (SAMs) 2
relationship 140
relationship operators 148
spectral
centroid 101
contrast feature 103
spectrum
flux 101
rolloff 101
speech 118
Sports4U 278
synchronized multimedia interaction
language 162
system architecture 340
T
temporal
grouping 137
motion activity 80
ordering 88
relationship 140
relationship operators 148
URI 166
timbral textural features 100
time segments 173
time-to-collision (TTC) 80
U
usage information 186
user interaction 187
user interfaces 255
user response 295
user-centric modeling 298
V
video
Index 409
content 135
data 77
data model 142
edit 381
editing 367
metamodel framework 136
object (VO) 142
retrieval systems 77
semantics 135
VIMET 135
visual keywords 34
W
World Wide Web 160
Z
zero crossing rates 102
Instant access to the latest offerings of Idea Group, Inc. in the fields of
I NFORMATION SCIENCE , T ECHNOLOGY AND MANAGEMENT!
InfoSci-Online
Database
BOOK CHAPTERS
JOURNAL AR TICLES
C ONFERENCE PROCEEDINGS
C ASE STUDIES
The Bottom Line: With easy

to use access to solid, current
and in-demand information,
InfoSci-Online, reasonably
priced, is recommended for
academic libraries.
The InfoSci-Online database is the

most comprehensive collection of
full-text literature published by
Idea Group, Inc. in:
- Excerpted with permission from

Library Journal, July 2003 Issue, Page 140
n
n
n
n
n
n
n
n
n
Distance Learning
Knowledge Management
Global Information Technology
Data Mining & Warehousing
E-Commerce & E-Government
IT Engineering & Modeling
Human Side of IT
Multimedia Networking
IT Virtual Organizations
BENEFITS
n Instant Access
n Full-Text
n Affordable
n Continuously Updated
n Advanced Searching Capabilities
Start exploring at
www.infosci-online.com
Recommend to your Library Today!

Complimentary 30-Day Trial Access Available!
A product of:
Information Science Publishing*

Enhancing knowledge through information science
*A company of Idea Group, Inc.

www.idea-group.com
New Releases from Idea Group Reference
Idea Group
REFERENCE
The Premier Reference Source for Information Science and Technology Research
ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
DATA WAREHOUSING
AND MINING
INFORMATION SCIENCE
AND TECHNOLOGY
AVAILABLE NOW!
Edited by: John Wang,

Montclair State University, USA
Two-Volume Set April 2005 1700 pp
ISBN: 1-59140-557-2; US $495.00 h/c
Pre-Publication Price: US $425.00*
*Pre-pub price is good through one month
after the publication date
Provides a comprehensive, critical and descriptive examination of concepts, issues, trends, and challenges in this
rapidly expanding field of data warehousing and mining
A single source of knowledge and latest discoveries in the
field, consisting of more than 350 contributors from 32
countries
Five-Volume Set January 2005 3807 pp

ISBN: 1-59140-553-X; US $1125.00 h/c
ENCYCLOPEDIA OF
DATABASE TECHNOLOGIES
AND APPLICATIONS
Offers in-depth coverage of evolutions, theories, methodologies, functionalities, and applications of DWM in such
interdisciplinary industries as healthcare informatics, artificial intelligence, financial modeling, and applied statistics
Supplies over 1,300 terms and definitions, and more than
3,200 references
DISTANCE LEARNING
April 2005 650 pp

ISBN: 1-59140-560-2; US $275.00 h/c
*Pre-publication price good through
one month after publication date
Four-Volume Set April 2005 2500+ pp

ISBN: 1-59140-555-6; US $995.00 h/c
Pre-Pub Price: US $850.00*
*Pre-pub price is good through one
month after the publication date
MULTIMEDIA TECHNOLOGY
AND NETWORKING
ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
More than 450 international contributors provide extensive coverage of topics such as workforce training,
accessing education, digital divide, and the evolution of
distance and online education into a multibillion dollar
enterprise
Offers over 3,000 terms and definitions and more than
6,000 references in the field of distance learning
Excellent source of comprehensive knowledge and literature on the topic of distance learning programs
Provides the most comprehensive coverage of the issues,
concepts, trends, and technologies of distance learning
April 2005 650 pp

ISBN: 1-59140-561-0; US $275.00 h/c
*Pre-pub price is good through
one month after publication date
www.idea-group-ref.com
Idea Group Reference is pleased to offer complimentary access to the electronic version
for the life of edition when your library purchases a print copy of an encyclopedia
For a complete catalog of our new & upcoming encyclopedias, please contact:
701 E. Chocolate Ave., Suite 200 Hershey PA 17033, USA 1-866-342-6657 (toll free) cust@idea-group.com
Multimedia Networking:
Technology, Management
and Applications
Syed Mahbubur Rahman
Minnesota State University, Mankato, USA
Today we are witnessing an explosive growth in

the use of multiple media forms in varied
application areas including entertainment,
communication, collaborative work, electronic
commerce and university courses. Multimedia
Networking: Technology, Management and
Applications presents an overview of this
expanding technology beginning with application
techniques that lead to management and design
issues. The goal of this book is to highlight major
multimedia networking issues, understanding
and solution approaches, and networked
multimedia applications design.
ISBN 1-930708-14-9 (h/c); eISBN 1-59140-005-8 US$89.95 498 pages 2002
The increasing computing power, integrated with multimedia and telecommunication

technologies, is bringing into reality our dream of real time, virtually face-to-face
interaction with collaborators sitting far away from us.
Syed Mahbubur Rahman, Minnesota State University, Mankato, USA
Its Easy to Order! Order online at www.idea-group.com!
Mon-Fri 8:30 am-5:00 pm (est) or fax 24 hours a day 717/533-8661
Idea Group Publishing

Hershey London Melbourne Singapore
An excellent addition to your library
File: Edit2
6/6/2005, 9:04:15AM
Chapter 1 Toward Semantically Meaningful Feature Spaces for Efficient Indexing

in Large Image Databases
1
Chapter 2 From Classification to Retrieval: Exploiting Pattern Classifiers in
Semantic Image Indexing and Retrieval
30
Chapter 3 Self-Supervised Learning Based on Discriminative Nonlinear Features
and Its Applications for Pattern Classification
52
Chapter 4 Context-Based Interpretation and Indexing of Video Data
77
Chapter 5 Content-Based Music Summarization and Classification
99
Chapter 6 A Multidimensional Approach for Describing Video Semantics
135
Chapter 7 Continuous Media Web: Hyperlinking, Search and Retrieval of
Time-Continuous Data on the Web
160
Chapter 8 Management of Multimedia Semantics Using MPEG-7
182
Chapter 9 Visualization, Estimation and User Modeling for Interactive Browsing
of Personal Photo Libraries
193
Chapter 10 Multimedia Authoring: Human-Computer Partnership for Harvesting
Metadata from the Right Sources
223
Chapter 11 MM4U: A Framework for Creating Personalized Multimedia Content
246
Chapter 12 The Role of Relevance Feedback in Managing Multimedia Semantics: A
Survey
288
Chapter 13 EMMO: Tradeable Units of Knowledge-Enriched Multimedia Content
305
Chapter 14 Semantically Driven Multimedia Querying and Presentation
333
Chapter 15 Emergent Semantics: An Overview
351
Chapter 16 Emergent Semantics from Media Blending
363
Glossary
391
About the Authors
396
Index
406
Page: 1

IRM Press Managing Multimedia Semantics eBook-kB PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

IRM Press Managing Multimedia Semantics eBook-kB PDF

Transféré par

Droits d'auteur :

Formats disponibles

Managing

Publisher of innovative scholarly and professional

Hershey London Melbourne Singapore

Published in the United States of America by

Today most documented information is in digital from. Digital information, in

ORGANISATION OF THIS BOOK

Section 1: Semantic Indexing and Retrieval of Images

Section 2: Audio and Video Semantics:

Section 3: User-Centric Approach to Manage

Section 4: Managing Distributed Multimedia

formalism based on enhanced multimedia metaobjects (Emmo) that can be exchanged in

Section 5: Emergent Semantics

Efficient Indexing in Large Image Databases 1

The optimized distance-based access methods currently available for multimedia

2 Ngu, Shen & Shepherd

Efficient Indexing in Large Image Databases 3

4 Ngu, Shen & Shepherd

retrieval effectiveness is obtained as more visual features are incorporated. Furthermore,

This observation suggests that an effective content-based image retrieval system

Efficient Indexing in Large Image Databases 5

be represented as neural network weights between units in successive layers. NLDR

Image Similarity Measurement

6 Ngu, Shen & Shepherd

Distance-Based Access Methods

HYBRID DIMENSION REDUCER

Composite Image Features

Efficient Indexing in Large Image Databases 7

8 Ngu, Shen & Shepherd

Architecture of Hybrid Image Feature Dimension

PCA for Dimension Reduction

Mathematically, PCA method can be described as follows. Given a set of N feature

Efficient Indexing in Large Image Databases 9

Since trace (S) =

accounts for the total variance of the original set of feature

vi is the i-th eigenvector of S.

10 Ngu, Shen & Shepherd

Classification Based on Human Visual Perception

Neural Network for Dimension Reduction

Efficient Indexing in Large Image Databases 11

Figure 2. A three-layer multiplayer perceptron layout

in fact acts as a nonlinear dimensionality reducer. In Wu (1997), a special neural network

is used as the activation function. Supervised learning is appropriate in our neural

12 Ngu, Shen & Shepherd

wij (t + 1) = wij (t ) + wij (t )

The training procedure of the network consists of repeated presentations of the

Efficient Indexing in Large Image Databases 13

The Hybrid Training Algorithm

14 Ngu, Shen & Shepherd

EXPERIMENTS AND DISCUSSIONS

Test Image Collection

Efficient Indexing in Large Image Databases 15

Figure 3. Overall architecture of a content-based image retrieval system based on

16 Ngu, Shen & Shepherd

(log ranki log i )

Query Effectiveness of Reduced Dimensional Image

Ave. Normalized Recall Rate

Ave. Normalized Prec. Rate

(a) Recall rate

(b) Precision rate

Efficient Indexing in Large Image Databases 17

Table 1. Comparison of different dimensionality reduction methods in query effectiveness

Effects on Query Effectiveness Improvement with

18 Ngu, Shen & Shepherd