Vous êtes sur la page 1sur 6


During the last years, rapid diffusion of inexpensive tools (mostly embedded in smartphones,
tablets, and other common mobile devices) for image, video, and audio capturing, together
with improvements in storage technologies, have favored the creation of massive collections of
digital multimedia content. However, the difficulty of finding relevant items increases with the
growth of the amount of available data. To face this problem, two main approaches are often
employed, both involving the use of metadata: the first includes manually annotating
multimedia content by means of textual information (e.g., titles, descriptive keywords from a
limited vocabulary, and predetermined classification schemes), while the second amounts to
using automated feature extraction and object recognition to classify content.
The latter is naturally related to content-based multimedia retrieval (CBMR) systems and often
represents the only viable solution since text annotations are mostly nonexistent or
incomplete; moreover, CBMR has proven to dramatically reduce time and effort to obtain
multimedia information, whereas frequent annotation additions and updates to massive data
stores often require entering manually all the attributes that might be needed for the queries.
While early algorithms proposed in this field mainly focused on feature-based similarity search
(over images, video, and audio), the interest of researchers rather than concentrating only on
the optimization of the underlying computation (semantic gap). In this direction, interesting
research topics include various kinds of interaction between humans and computers
(experiential and affective computing), the introduction of new features to improve the
detection/recognition process, the analysis of new media types (3D models, virtual reality, etc.),
and the identification of test sets for benchmarking. In this chapter, we describe the most
representative work in the field of content based multimedia retrieval.
Browsing and Summarization

There have been a wide variety of innovative ways of browsing and summarizing multimedia
Spierenburg and Huijsmans [1997] proposed a method for converting an image database into a
movie. The intuition was that one could cluster a sufficiently large image database so that
visually similar images would be in the same cluster. After the cluster process, one can order
the clusters by the inter-cluster similarity, arrange the images in sequential order and then
convert to a video. This allows a user to have a gestalt understanding of a large image database
in minutes.
Sundaram, et al. [2002] took a similar approach toward summarizing video. They introduced
the idea of a video skim which is a shortened video composed of informative scenes from the
original video.
The fundamental idea is for the user to be able to receive an abstract of the story but in video
format. Snoek, et al. [2005] propose several methods for summarizing video such as grouping
by categories and browsing by category and in time.
Chiu, et al. [2005] created a system for texturing a 3D city with relevant frames from video
shots. The user would then be able to fly through the 3D city and browse all of the videos in a
directory. The most important frames would be located on the roofs of the buildings in the city
so that a high altitude fly through would result in viewing a single frame per video.
Uchihashi, et al. [1999] suggested a method for converting a movie into a cartoon strip in the
Manga style from Japan. This means altering the size and position of the relevant keyframes
from the video based on their importance.
Tian, et al. [2002] took the concept of variable size and positions of images to the next level by
posing the problem as a general optimization criterion problem. What is the optimal
arrangement of images on the screen so that the user can optimally browse an image database.
Liu, et al. [2004] address the problem of effective summarization of images from WWW image
search engines. They compare a rank list summarization method to an image clustering scheme
and find that their users find the clustering scheme allows them to explore the image results
more naturally and effectively.
Content-Based Image Retrieval

The main difference between content-based and text-based retrieval systems is that human
interaction is an indispensable part of the latter system. Humans tend to use high-level features
(concepts) to interpret images and measure their similarity.
In general, there is no direct link between the high-level concepts and the low-level features.
Though many complex algorithms have been designed to describe color, shape, and texture
features, these algorithms cannot adequately model image semantics and have a lot of
limitations while dealing with broad content image databases.

There are Eakins three levels of queries in CBIR:

 Level 1: Retrieval by primitive features such as color, texture, shape, or the spatial
location of image elements
 Level 2: Retrieval of objects of given type identified by derived features, with some
degree of logical inference
 Level 3: Retrieval by abstract attributes, involving a big amount of high-level reasoning
about the aim of the objects or scenes depicted
A CBIR system should provide full support in bridging the semantic gap between numerical
image features and the richness of human semantics in order to support query by high-level
The state-of-the-art techniques in reducing the semantic gap include mainly five categories:

 Using object ontology to define high-level concepts

 Using machine learning tools to associate low-level features with query concepts
 Introducing relevance feedback (RF) into retrieval loop for continuous learning of users’
 Generating semantic template (ST) to support high-level image retrieval
 Making use of both the visual content of images and the related textual information
(e.g. tags, keywords, descriptions, etc.) from the Web

Retrieval at Level 3 is difficult and less common.

Possible Level 3 retrieval can be found in domain-specific areas such as art museums or
newspaper libraries. Current systems mostly perform retrieval at Level 2.
There are three fundamental components in these systems:
1. Low-level image feature extraction
2. Similarity measure
3. Semantic gap reduction
Low-level image feature extraction is the basis of CBIR systems. To perform CBIR, image
features can be either extracted from the entire image or from regions (region-based image
retrieval, RBIR).
To perform RBIR, the first step is to implement image segmentation. Then, low level
features such as color, texture, shape, or spatial location can be extracted from the
segmented regions. Similarity between two images is defined based on region features.

Content-Based Audio Retrieval

A major challenge with content-based retrieval of audio is that there are different types of
audio and that the strategies to index and query them differ substantially. This is mainly due to
the psychological aspects of audio perception — and thus perceived similarity — also known as
One large area of research is audio retrieval of music, referred to as Content-Based Music
Retrieval (CBMR) or just Music Information Retrieval (MIR). It is a multidisciplinary field that
straddles different domains — ranging from computer science to psychology.
There is a large community surrounding CBMR organised in the International Society for Music
Information Retrieval (ISMIR), which holds the annual MIREX evaluation campaign for MIR
MIR tasks can be characterised by their specificity; that is, the desired degree of similarity
between the document and the query, and their granularity, i.e. whether comparison takes
place on a document level or on the level of fragments thereof.
This classification scheme is illustrated in figure. Based on these two dimensions, the authors of
classify existing Query-by-Example techniques for music into four larger categories: audio
identification (fingerprinting), audio matching, version identification and category-based
retrieval. We will not go into details about the latter.

 Audio identification This is a high-specificity, low-granularity task. Given a small audio

fragment as query, the task consists in identifying the particular recording it belongs to.
The notion of similarity is hence very close to identity.
 Audio matching This is a mid-specificity, mid-granularity task. It consists in finding
variations of a given, musical fragment as they occur in different performances or
arrangements of the same piece (e.g. a live performance). These variations may differ in
aspects like tempo or execution of note groups.
 Version identification This is a low-specificity task. It consists in finding versions that
may differ considerably from the query fragment in terms of instrumentation, key,
tempo and even melody. Changes like this usually occur in cover songs or remixes of a
musical piece
High Performance Indexing

In the early multimedia database systems, the multimedia items such as images or video were
frequently simply files in a directory or entries in an SQL database table.
From a computational efficiency perspective, both options exhibited poor performance because
most filesystems use linear search within directories and most databases could only perform
efficient operations on fixed size elements. Thus, as the size of the multimedia databases or
collections grew from hundreds to thousands to millions of variable sized items, the computers
could not respond in an acceptable time period.
Even as the typical SQL database systems began to implement higher performance table
searches, the search keys had to be exact such as in text search. Audio, images, and video were
stored as blobs which could not be indexed effectively.
Therefore, researchers [Egas, et al. 1999; Lew 2000] turned to similarity based databases which
used tree-based indexes to achieve logarithmic performance. Even in the case of multimedia
oriented databases such as the Informix database, it was still necessary to create custom
datablades to handle efficient similarity searching such as k-d trees [Egas, et al. 1999]. In
general the k-d tree methods had linear worst case performance and logarithmic average case
performance in the context of feature based similarity searches.
A recent improvement to the k-d tree method is to integrate entropy based balancing [Scott
and Shyu 2003]. Other data representations have also been suggested besides k-d trees. Ye and
Xu [2003] show that vector quantization can be used effectively for searching large databases.
Elkwae and Kabuka [2000] propose a 2-tier signature based method for indexing large image
databases. Type 1 signatures represent the properties of the objects found in the images. Type
2 signatures capture the inter-object spatial positioning. Together these signatures allow them
to achieve a 98% performance improvement.
Shao, et al. [2003] use invariant features together with efficient indexing to achieve near
realtime performance in the context of k nearest neighbor searching. Other kinds of high
performance indexing problems appear when searching peer to peer (P2P) networks due to the
curse of dimensionality, the high communication overhead and that all searches within the
network are based on nearest neighbor methods.
Muller and Henrich [2003] suggest an effective P2P search algorithm based on compact peer
data summaries. They show that their model allows peers to only communicate with a small
sample and still retain high quality of results.