50 Years of object recognition: Directions forward

Alexander Andreopoulos
, John K. Tsotsos
IBM Research – Almaden, 650 Harry Road, San Jose, CA 95120-6099, United States
Department of Computer Science and Engineering, Centre for Vision Research, York University, Toronto, ON, Canada M3J 1P3
a r t i c l e i n f o
Article history:
Received 2 January 2012
Accepted 26 April 2013
Available online 3 May 2013
Active vision
Object recognition
Object representations
Object learning
Dynamic vision
Cognitive vision systems
a b s t r a c t
Object recognition systems constitute a deeply entrenched and omnipresent component of modern intel-
ligent systems. Research on object recognition algorithms has led to advances in factory and office auto-
mation through the creation of optical character recognition systems, assembly-line industrial inspection
systems, as well as chip defect identification systems. It has also led to significant advances in medical
imaging, defence and biometrics. In this paper we discuss the evolution of computer-based object recog-
nition systems over the last fifty years, and overview the successes and failures of proposed solutions to
the problem. We survey the breadth of approaches adopted over the years in attempting to solve the
problem, and highlight the important role that active and attentive approaches must play in any solution
that bridges the semantic gap in the proposed object representations, while simultaneously leading to
efficient learning and inference algorithms. From the earliest systems which dealt with the character rec-
ognition problem, to modern visually-guided agents that can purposively search entire rooms for objects,
we argue that a common thread of all such systems is their fragility and their inability to generalize as
well as the human visual system can. At the same time, however, we demonstrate that the performance
of such systems in strictly controlled environments often vastly outperforms the capabilities of the
human visual system. We conclude our survey by arguing that the next step in the evolution of object
recognition algorithms will require radical and bold steps forward in terms of the object representations,
as well as the learning and inference algorithms used.
Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction
Artificial vision systems have fascinated humans since pre-his-
toric times. The earliest mention of an artificial visually-guided
agent appears in classical mythology, where a bronze giant named
Talos was created by the ancient god Hephaestus and was given as
a gift to King Minos of the Mediterranean island of Crete [1].
According to legend the robot served as a defender of the island
from invaders by circling the island three times a day, while also
making sure that the laws of the land were upheld by the island’s
The fascination and interest for vision systems continues today
unabated, not only due to purely intellectual reasons related to ba-
sic research, but also due to the potential of such automated vision
systems to drastically increase the productive capacity of organiza-
tions. Typically, the most essential component of a practical visu-
ally-guided agent is its object recognition module.
Modern computer vision research has its origins in the early
1960s. The earliest applications were pattern recognition systems
for character recognition in office automation related tasks [2,3].
Early work by Roberts in the 1960s [4] first identified the need to
match two-dimensional features extracted from images with the
three-dimensional representations of objects. Subsequent research
established the practical difficulties in reliably and consistently
accomplishing such a task, especially as the scene complexity in-
creased, as the illumination variability increased, and as time, cost,
and sensor noise constraints became more prevalent.
Early systematic work on vision systems is also traced to the
Hitachi labs in Japan where the term machine vision originated,
to distinguish its more pragmatic goal of constructing practical
applications [5], as compared to the more general term computer
vision, popularly used to also include less pragmatic goals. An early
research thrust in 1964 involved the automation of the wire-bond-
ing process of transistors, with the ultimate goal of replacing hu-
man workers. Even though the automated system achieved 95%
accuracy in lab tests, this was deemed too low to replace human
workers. By 1973 however, fully automated assembly machines
had been constructed [6], resulting in the world’s first image-based
machine for the automatic assembly of semiconductor devices.
Arguably, the most successful application of machine vision
1077-3142/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
This paper has been recommended for acceptance by Sven Dickinson.

Corresponding author.
E-mail addresses: aandreo@us.ibm.com (A. Andreopoulos), tsotsos@cse.yorku.ca
(J.K. Tsotsos).
Fax: +1 416 736 5872.
Computer Vision and Image Understanding 117 (2013) 827–891
Contents lists available at SciVerse ScienceDirect
Computer Vision and Image Understanding
j our nal homepage: www. el sevi er. com/ l ocat e/ cvi u
technologies is in the assembly and verification processes of the
semiconductor industry, enabling the mass production and inspec-
tion of complex semiconductors such as wafers [5]. Due to the
sheer complexity of this task, human workers could not have pos-
sibly solved such problems reliably and efficiently, thus demon-
strating how vision technologies have directly contributed to
many countries’ economic development by enabling the semicon-
ductor revolution experienced over the last couple of decades.
Early recognition systems also appeared in biomedical research
for the chromosome recognition task [7,8]. Even though this work
initially had limited impact, its importance became clearer later.
Recognition technologies are also successfully used in the food
industry (e.g., for the automated classification of agricultural prod-
ucts [9]), the electronics and machinery industry (for automated
assembly and industrial inspection purposes [10]), and the phar-
maceutical industry (for the classification of tablets and capsules)
[5]. Many of the models used for representing objects are also
effectively employed by the medical imaging community for the
robust segmentation of anatomical structures such as the brain
and the heart ventricles [11,12]. Handwritten character recogni-
tion systems are also employed in mail sorting machines as well
as for the digitization and automated indexing of documents
[13,14]. Furthermore, traffic monitoring and license plate recogni-
tion systems are also successfully used [15,16], as are monetary bill
recognition systems for use with ATMs [5]. Biometric vision-based
systems for fingerprint recognition [17], iris pattern recognition
[18], as well as finger-vein and palm-vein patterns [19,20] have
also gained acceptance by the law enforcement community and
are widely used.
Despite the evident success of recognition systems that are tai-
lored for specific tasks, robust solutions to the more general prob-
lem of recognizing complex object classes that are sensed under
poorly controlled environments, remain elusive. Furthermore, it
is evident from the relevant literature on object recognition algo-
rithms that there is no universal agreement on the definitions of
various vision subtasks. Often encountered terms in the literature
such as detection, localization, recognition, understanding, classifi-
cation, categorization, verification and identification, are often ill
defined, leading to confusion and ambiguities. Vision is popularly
defined as the process of discovering from images what is present
in the world and where it is [23]. Within the context of this paper,
we discern four levels of tasks in the vision problem [24]:
v Detection: is a particular item present in the stimulus?
v Localization: detection plus accurate location of item.
v Recognition: localization of all the items present in the stimulus.
v Understanding: recognition plus role of stimulus in the context
of the scene.
The localization problem subsumes the detection problem by
providing accurate location information of the a priori known item
that is being queried for in the stimulus. The recognition problem
denotes the more general problem of identifying all the objects
present in the image and providing accurate location information
of the respective objects. The understanding problem subsumes
the recognition problem by adding the ability to decide the role
of the stimulus within the context of the observed scene.
There also exist alternative approaches for classifying the vari-
ous levels of the recognition problem. For example, [25] discerns
five levels of tasks of increasing difficulty in the recognition
v Verification: Is a particular item present in an image patch?
v Detection and localization: Given a complex image, decide if a
particular exemplar object is located somewhere in this image,
and provide accurate location information on this object.
v Classification: Given an image patch, decide which of the multi-
ple possible categories are present in that patch.
v Naming: Given a large complex image (instead of an image
patch as in the classification problem) determine the location
and labels of the objects present in that image.
v Description: Given a complex image, name all the objects pres-
ent in the image, and describe the actions and relationships of
the various objects within the context of this image. As the
author indicates, this is also sometimes referred to as scene
Within the context of this paper we will discern the detection,
localization, recognition and understanding problems, as previ-
ously defined.
For relatively small object database sizes with small inter-ob-
ject similarity, the problem of exemplar based object detection in
unoccluded scenes, and under controlled illumination and sensing
conditions, is considered solved by the majority of the computer
vision community. Great strides have also been made towards
solving the localization problem. Problems such as occlusion and
variable lighting conditions still make the detection, localization
and recognition problems a challenge. Tsotsos [21] and Dickinson
[22] present the components used in a typical object recognition
system: that is, feature extraction, followed by feature grouping,
followed by object hypothesis generation, followed by an object
verification stage (see Fig. 1). The advent popularity of machine
learning approaches and bags-of-features types of approaches
has blurred somewhat the distinction between the above men-
tioned components. It is not uncommon today to come across pop-
ular recognition approaches which consist of a single feature
extraction phase, followed by the application of cascades of one
or more powerful classifiers.
The definition of an object is somewhat ambiguous and task
dependent, since it can change depending on whether we are deal-
ing with the detection, localization, recognition or understanding
problem. According to one definition [26], the simpler the problem
is (i.e., the further away we are fromthe image understanding prob-
lemas defined above), the closer the definition of an object is to that
of a set of templates defining the features that the object must pos-
sess under all viewpoints and conditions under which it can be
sensed. As we begin dealing with more abstract problems (such as
the object understanding problem) the definition of an object be-
comes more nebulous and dependent on contextual knowledge,
since it depends less on the existence of a finite set of feature tem-
plates. For example, the object class of toys is significantly abstract
anddepends onthe context. See the work of Edelman[27] for a char-
acterization of what might constitute a proper definition of an ob-
ject. It is important to emphasize that there were multiple starting
points that one can identify for early definitions of what constitutes
an object, since this is highly dependent on the recognition system
used. As previously discussed, one early starting point was work
on the block-world system which led to definitions and generaliza-
tions involving 3D objects. However, there were also other signifi-
cantly different early definitions, which emerged from early
applications on character and chromosome recognition and the
analysis of aerial images. These latter applications led to progress
inpatternrecognition, feature detectionandsegmentationbut dealt
with objects of a different type. These latter approaches are closely
related to modern2Dappearance basedobject recognitionresearch.
Arguably, one of the ultimate goals of recognitionresearchis toiden-
tify a common inference, learning and representational framework
for objects, that is not application domain specific. In Section 4.3
we discuss how insights from neuroscience might influence the
community in the search for such a framework.
Early research in computer vision was tightly coupled to the
general AI problem, as is evidenced by the large overlap in the pub-
828 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
lication outlets used by the two communities up until the late
1980s. Subsequent approaches to vision research shifted to more
mathematically oriented approaches that were significantly differ-
ent from the classical AI techniques of the time. This spurred a
greater differentiation between the two communities. This differ-
entiation between the communities is somewhat unfortunate,
especially for dealing with the understanding problem defined
above, due to the evident need for high level reasoning capabilities
in order to deal with certain vision problem instances which are
intractable if these problems are attempted to be solved by
extracting a classical object representation from a scene. The need
for reasoning capabilities in artificial vision systems is further sup-
ported by experiments demonstrating that the human visual sys-
tem is highly biased by top-down contextual knowledge (an
executive controller), which can have a drastic effect on how our
visual system perceives the world [29]. More recently a new re-
search thrust is evident, in particular on the part of EU, Japanese,
and South Korean funding agencies, towards supporting the crea-
tion of more pragmatic object recognition systems that are tightly
coupled with cognitive robotics systems. This is evidenced by the
fact that during the last decade, around one billion euros have been
invested by EU-related funding agencies alone, towards supporting
research in cognitive robotics. This last point is further evidenced
by the more recent announcement in the US of the National Robot-
ics Initiative for the development of robots that work alongside hu-
mans in support of individuals or groups. Assuming that current
trends continue, it is fair to predict that research-wise, the future
for vision-based robotics systems looks bright.
The passive approach to vision refers to system architectures
which exhibit virtually no control over the data acquisition pro-
cess, and thus play a minor role in improving the vision system’s
performance. The passive approach has dominated the computer
vision literature, partly due to the influential bottom-up approach
to vision advocated by Marr [23], but also partly due to a number
of difficulties with implementing non-passive approaches to vi-
sion, which are elaborated upon later in this paper. Support for a
passive approach to the vision problem is evident even in one of
the earliest known treatises on vision [30], where vision is de-
scribed as a passive process that is mediated by what is referred
to as the ‘‘transparent’’ (diauam

e1), an invisible property of nature
that allows the sense organ to take the form of the visible object.
In contrast, approaches which exhibit a non-trivial degree of
intelligent decision making in the image and data acquisition pro-
cess, are referred to as ‘‘active’’ approaches [31,32]. Active ap-
proaches offer a set of different techniques for solving similar
sets of vision problems. Active approaches are motivated by the
fact that the human visual system has two main characteristics:
the eyes can move and visual sensitivity is highly heterogeneous
across visual space [33]. As we will discuss later in this manuscript,
active approaches are most useful in vision applications where the
issue of mobility and power efficiency becomes a significant factor
for determining the viability of the constructed vision system. We
can classify these approaches as ‘‘limited active’’ approaches which
control a single parameter (such as the focus), and ‘‘fully active’’
approaches which control more than one parameter within its full
range of possibilities.
From the late 1990s and for the next decade, interest in active
vision research by the computer vision community underwent
somewhat of a hiatus. However, the recent funding surge in cogni-
tive vision systems and vision based robotics research has reinvig-
orated research on active approaches to the recognition problem.
Historically, early related work is traced to Brentano [34], who
introduced a theory that became known as act psychology. This rep-
resents the earliest known discussion on the possibility that a sub-
ject’s actions might play an important role in perception. Barrow
and Popplestone [35] presented what is widely considered the first
(albeit limited) discussion on the relevance of object representa-
tions and active perception in computer vision. Garvey [36] also
presented an early discussion on the benefits of purposive ap-
proaches in vision. Gibson [37] critiques the passive approach to
vision, and argues that a visual system should also serve as a medi-
ator in order to direct action and determine when to move an eye
in one direction instead of another direction. Such early research,
followed by a number of influential papers on object representa-
tions [38–41], sparked the titillating and still relatively unexplored
question of how task-directed actions can affect the construction of
optimal (in terms of their encoding length) and robust object rep-
resentations. The concept of active perception was popularized by
Bajcsy [42,31], as ‘‘a problem of intelligent control strategies ap-
plied to the data acquisition process’’. The use of the term active vi-
sion was also popularized by Aloimonos et al. [32] where it was
shown that a number of problems that are ill-posed for a passive
observer, are simplified when addressed by an active observer. Bal-
lard [43] further popularized the idea that a serial component is
necessary in a vision system. Tsotsos [28] proposed that the active
vision problem is a special case of the attention problem, which is
generally acknowledged to play a fundamental role in the human
visual system (see Fig. 2). Tsotsos [29] also presented a relevant lit-
erature survey on the role of attention in human vision.
More recent efforts at formalizing the problem and motivating
the need for active perception, are discussed in [44,45,26]. In
[26] for example, the recognition problem is cast within the frame-
work of Probably-Approximately-Correct learning (PAC learning
[46–48]). This formalization enables the authors to prove the exis-
tence of approximate solutions to the recognition problem, under
various models of uncertainty. In other words, given a ‘‘good’’ ob-
ject detector the authors provide a set of sufficient conditions such
that for all 0 < ; d <
, with confidence at least 1 ÷ d we can effi-
ciently localize the positions of all the objects we are searching for
with an error of at most (see Fig. 3). Another important problem
addressed is that of determining a set of constraints on the learned
object’s fidelity, which guarantee that if we fail to learn a represen-
tation for the target object ‘‘quickly enough’’ it was not due to sys-
tem noise, due to an insufficient number of training examples, or
due to the use of an over-expressive or over-constrained set of ob-
ject representations Hduring the object learning/training phase.
Fig. 1. The components used in a typical object recognition system [21,22].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 829
From a less formal perspective, active control of a vision sensor
offers a number of benefits [28,24]. It enables us to: (i) Bring into
the sensor’s field of view regions that are hidden due to occlusion
and self-occlusion. (ii) Foveate and compensate for spatial non-
uniformity of the sensor. (iii) Increase spatial resolution through
sensor zoom and observer motion that brings the region of interest
in the depth of field of the camera. (iv) Disambiguate degenerate
views due to finite camera resolution, lighting changes and in-
duced motion [49]. (v) Compensate for incomplete information
and complete a task.
An active vision system’s benefits must outweigh the associated
execution costs [28,33,24]. The associated costs of an active vision
system include: (i) Deciding the actions to perform and their exe-
cution order. (ii) The time to execute the commands and bring the
actuators to their desired state. (iii) Adapt the system to the new
viewpoint, find the correspondences between the old and new
viewpoint and compensate for the inevitable ambiguities due to
sensor noise. By modeling the costs in a way that improves the effi-
ciency of a task solution, a significant benefit could be achieved.
For example, the cost could include the distance associated with
various paths that a sensor could follow in moving from point A
to point B, and the task could involve searching for an object in a
certain search region. In such a case, the cost can help us locate
the item of interest quickly, by minimizing the distance covered
while searching for the object.
We should point out that a significant portion of the active vi-
sion research has been applied on systems where the vision algo-
rithms are not applied concurrently to the execution of the
actions. Dickmanns introduced a slightly different paradigm to vi-
sion, where machine vision is typically applied on dynamic scenes
viewed from a moving platform, or in other words, where vision
algorithms are executed concurrently to the actions performed
[50]. He referred to this framework as dynamic vision. Even though
early work on active vision [32] was based on the argument that
the observer is in motion, in practice, most active object recogni-
tion systems assume that the sensor is stationary when the images
are acquired. We will discuss this framework in more detail in
Section 3.
In recent work the role of learning algorithms has become much
more important to the object recognition problem. This has re-
sulted in a blurring of the distinction that emerged during the
1980s, between computer vision research and classical AI. This
has also resulted in an emerging debate in the community, as to
the intended audience of many computer vision journals. Often
the main innovation presented in such papers is more closely re-
lated to machine learning, while vision is only treated as a small
after-effect/application of the presented learning-based algorithm.
Another after-effect of this pattern is that recent work has drifted
away from Marr’s early paradigm for vision. Nevertheless, and as
we will see in Section 4, the fact remains that some of the most
successful recognition algorithms rely currently on advanced
learning techniques, thus significantly differentiating them from
early recognition research. In Section 4.3 we discuss how certain
emerging constraints in computing technology might affect the
evolution of learning algorithms over the next few years.
This introductory discussion has outlined the breadth and scope
of the approaches adopted by the vision community over the last
50 years, in attempting to solve the recognition problem. The rest
of the paper presents a more detailed overview of the literature
on object detection, localization and recognition, with a lesser fo-
cus on the efforts made to address the significantly more general
and challenging problem of object understanding. The relevant lit-
erature is broadly categorized into a number of relevant subtopics
that will help the reader gain an appreciation of the diverse ap-
proaches taken by the community in attempting to solve the prob-
lem [51,52]. This survey illustrates the extent to which previous
research has addressed the often overlooked complexity-related
Fig. 2. The spectrum of attentional mechanisms, as proposed by Tsotsos [28].
Notice that within this framework, the active vision problem is subsumed by the
the attention problem. An active vision system is typically characterized by
purposive eye, head or body movements that result in the selection of the visual
field, while an attention system is characterized by the full set of mechanisms and
behaviors identified above.
Fig. 3. A non-rigorous overview of the assumptions under which the object localization and recognition problems (as formalized in [26]) are well-behaved and efficiently
solvable/learnable problems. Notice that within this framework the recognition problem (bottom right box) subsumes the localization problem (top right box) in that any
‘‘good’’ solution to the localization problem, could also be used to solve the recognition problem (although not necessarily optimally), and vice versa.
830 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
challenges which, in our opinion, have inhibited the creation of ro-
bust generic recognition systems. This work also fills a void, by pre-
senting a critical and systematic overview of the literature lying at
the intersection of active vision and object recognition. Our work
also supports the position that active and attentive approaches
[31,29] to the object recognition problem constitute the next nat-
ural evolutionary step in object recognition research.
In Chart 1 we project the algorithms surveyed in this paper
along a number of dimensions, and highlight the evolution of the
dimensions’ relative importance over the years. A number of pat-
terns become evident upon inspecting Chart 1. For example there
is a clear increase in focus over the years with respect to the scala-
bility of inference, search efficiency, and training efficiency. At the
same time, in early work there was a significantly greater focus on
the use of 3D in recognition systems. Similarly we see that the
search for powerful indexing primitives and compact object repre-
sentations was always recognized as an important topic in the lit-
erature, while there is less consistency in the use of function,
context and texture features. These points are elaborated later in
this survey.
The remainder of the paper is organized as follows. In Section 2
we survey classical approaches to the object recognition and
understanding problems, where the data acquisition processes
demonstrate limited intelligence. Section 3 further motivates the
active and dynamic approaches to vision. In Section 4 we discuss
some of the most characteristic approaches adopted over the years
by algorithms that have won various vision challenges. The section
ends with a brief discussion as to where the field appears to be
headed in the near future. Section 5 summarizes the paper.
2. Classical approaches
We present a critical overview of classical approaches to the ob-
ject recognition problem. Most of the methods described exhibit
limited purposive control over the data acquisition process. There-
fore, the word ‘‘passive’’ can also be used to differentiate the ap-
proaches described from ‘‘active’’ approaches to the recognition
problem. In subsequent sections we will build upon work in this
section, in order to overview the less developed field of active ap-
proaches to the recognition problem. Though the earliest work ap-
peared in the late nineteen-eighties, the field still remains in its
infancy, with a plethora of open research problems that need fur-
ther investigation. It will become evident, as we review the rele-
vant literature, that a solution to the recognition problem will
require answers to a number of important questions that were
raised in [26]. That is, questions on the effects that finite computa-
tional resources and a finite learning time have in terms of solving
the problem. The problem of constructing optimal object represen-
tations in particular emerges as an important topic in the literature
on passive and active object recognition. Algorithms whose con-
struction is driven by solutions that provide satisfactory answers
to such questions, must form a necessary component of any reli-
able passive or active object recognition system. The categorization
of the relevant literature on classical approaches to the recognition
problem follows the one proposed by Dickinson,
with modifica-
tions in order to include in the survey some more recent neuromor-
phic approaches that have gained in popularity. This section’s
presentation on passive approaches to the recognition problem is
also used to contextualize the discussion in Section 3, on active ap-
proaches to the problem.
As we survey the literature in the field, we use the standard rec-
ognition pipeline described in Fig. 1 as a common framework for
contextualizing the discussion. We will support the hypothesis
that most research on object recognition is reducible to the prob-
lem of attempting to optimize either one of the modules in the
pipeline (feature-extraction ?feature-grouping ?object-hypoth-
esis ?object-verification) or is reducible to the problem of
attempting to improve the operations applied to the object data-
base in Fig. 1, by proposing more efficient querying algorithms,
or more efficient object representations (which in turn support
better inference and learning algorithms and reduce the related
storage requirements). Sporadically throughout the text, we will
recap our discussion, by comparing some of the most influential
papers discussed so far. We do this by comparing these papers
along various dimensions such as their complexity, the indexing
strength, their scalability, the feasibility of achieving general tasks,
their use of function and context, the level of prior knowledge and
the extent to which they make use of 3D information. We provide
discussions on the assumptions and applicability of the most
Chart 1. A historical perspective (spanning 1971–2012) on the papers that will be discussed with the most detail during the rest of this survey, comparing them along a
number of dimensions. The horizontal axis denotes the mean score of the respective papers from Tables 1–7. Inference Scalability: The focus of the paper on improving the
robustness of the algorithm as the scene complexity or the object class complexity increases. Search Efficiency: The use of intelligent strategies to decrease the time spent
localizing an object when the corresponding algorithm is used for localization. If it is a detection algorithm, this refers to its localization efficiency within the context of a
sliding-window approach (i.e., the degree of the use of intelligent strategies to improve detection efficiency). Training Efficiency: The level of automation in the training
process, and the speed with which the training is done. Encoding Scalability: The encoding length of the object representations as the number of objects increases or as the
object representational fidelity increases. Diversity of Indexing Primitives: The distinctiveness and number of indexing primitives used. Uses Function or Context: The degree to
which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by the algorithm for inference or model
representations. Uses Texture: The degree to which texture discriminating features are used by the algorithm.
Personal communication, CSC 2523: Object Modeling and Recognition, University
of Toronto. Also see [51].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 831
interesting papers and discuss the extent to which these methods
address the aspects of detection, localization and recognition/
2.1. Recognition using volumetric parts
Recognition using volumetric parts such as generalized cylinders,
constitutes one of the early attempts at solving the recognition
problem [21,53]. The approach was popularized by a number of
people in the field such as Nevatia and Binford [54,38], Marr
[23,55] and Brooks [56] amongst others. This section briefly over-
views some of the most popular related approaches. It is interest-
ing to notice that the earliest attempts at solving the object
recognition problem used high level 3D parts based objects, such
as generalized cylinders and other deformable objects, such as
geons and superquadrics. However, in practice, it was too difficult
to extract such parts from images. A number of important points
must be made with regards to parts based recognition. High level
primitives such as generalized cylinders, geons and superquad-
rics—which we describe in more detail later in this paper—provide
high level indexing primitives. View-based/appearance-based ap-
proaches on the other hand provide less complex indexing primi-
tives (edges, lines, corners, etc.) which result in an increase in
the number of extracted features. This makes such low level fea-
tures less reliable as indexing primitives when using object dat-
abases numbering thousands of objects. In other words, the
search complexity for matching image features to object database
features increases as the extracted object feature complexity de-
creases. This explains why most of the work that uses such low le-
vel primitives is only applied to object databases with a small
number of objects. The above described problem is often referred
to in the literature as the semantic gap problem. As it is argued in
[22], a verification/disambiguation/attention-like mechanism is
needed to disambiguate and recognize the objects, because such
primitives of low complexity are often more frequent and ambigu-
ous. In other words, with simple indexing features, the burden of
recognition is no longer determined by the task of deciding which
complex high-level features are in the image, a difficult task by
itself, but instead, it is shifted to the verification stage and discrim-
ination of simple indexing primitives. The parts-based vs. view-
based approach to recognition has generated some controversy
in the field. This controversy is exemplified by a sequence of papers
on what is euphemistically called the Tarr-Biederman debate
[57–60]. The topic of hierarchies of parts-based and view-based
representations has asssumed a central role in the literature for
efficiently representing objects and bridging the above described
semantic gap problems. This is exemplified by a number of papers
that we survey.
In Chart 2 and Table 1 we present a comparison, along certain
dimensions, for a number of the papers surveyed in Sections 2.1,
2.2, 2.3, 2.4, which includes a number of recognition models that
use volumetric parts-based representations. For example, since
much of the early research on volumetric parts used manually
trained models, a single star is used in the corresponding training
efficiency columns. It will become evident that most progress in
the field lies in the image classification problem (as opposed to
the more demanding object localization and segmentation prob-
lems), which is aligned with the current needs of industry for im-
age and video search solutions.
Binford [54] and Nevatia and Binford [38] introduced and pop-
ularized the idea of generalized-cylinder-based recognition. A gen-
eralized cylinder is a parametric representation of 3D cylinder-like
objects, defined by a 3D curve called an axis, and planar cross-sec-
tions normal to the axis. These planar cross-sections are defined by
a cross-section function which in turn depends on the 3D curves’
parameterization. According to Binford’s definition, the cross-sec-
tions’ center of gravity must intersect the 3D curve. Nevatia and
Binford use a subset of generalized cylinders—namely generalized
cones—to recognize the objects present in a particular scene. They
use the range data of a particular scene to segment the scene into
its constituent parts—by clustering together regions with non-rap-
idly changing depth. Each such segmented cluster is, then, further
segmented into parts that can be most easily described by general-
ized cones. They accomplish this by extracting the medial axis of
each segmented cluster—similar to Blum’s medial axis transforma-
tion [70]—, and splitting the cluster into a new subcluster when-
ever the medial axis changes rapidly. This results in a number of
segments with smoothly changing medial axes which can be de-
scribed by the 2D projections of 3D generalized cones. The authors
do not advocate using any common optimization technique to
determine the rotation and scale of the generalized cone that
should be used. Instead, they advocate using a rather brute force
approach—i.e., project each 3D generalized cone to a 2D image
plane, rotate it a number of times and determine which rotation re-
sults in the best fit. The extracted cylinders are then used to build a
graph-like representation of the detected object. For example, if a
detected cylinder’s cone angle exceeds a particular angle threshold,
that particular part is labeled as conical as opposed to cylindrical.
Elongated parts, with a length to width ratio greater than 3.0, are
similarly labeled as well-defined. These qualitative descriptors are
used as object features/indices which are in turn used to recognize
the object from a database of object descriptors.
Chart 2. Summary of the Table 1 scores, consisting of the 1971–1996 papers surveyed that make significant contributions to volumetric part modeling, automatic
programming, perceptual organization and interpretation tree search. Notice that 3D parts and their use for indexing and encoding objects compactly, forms a significant
component of this set of papers.
832 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
Brooks [56] presents the Acronym object recognition system.
The author again uses generalized cones to model the objects as
shown in Figs. 4, 5. Volumetric models and spatial relations
amongst object parts are represented using an object graph. The
author also defines a restriction graph which is used to define a
class and subclass hierarchy for the object we are modeling. In a
way, this provides ‘‘scale’’ information which specifies the amount
of detail we wish to use when defining and searching for an object
(see Fig. 5). For each part’s joint the authors define constraints on
the relations between the various parts. Most often these con-
straints are used to define the permissible angles between the var-
ious joints that would lead to acceptable instances of the object.
For example, if we are interested in modeling an articulated object
such as a piston, constraints can be defined denoting the allowable
articulation of the object’s parts that would lead to an instance of a
piston. The author defines a constraint manipulation system and
shows how the geometric reasoning that this model provides can
be used to reason about the model and discover geometric and
quasi-geometric invariants about a particular object model. These
discovered invariants are positioned in a prediction graph, which
is used in conjunction with extracted image features to determine
whether the desired object exists in the image. The distinguishing
characteristic of Brooks’ work is that it is one of the first systems
having used parts-based recognition and generalized cylinders to
provide reliable results.
Zerroug and Nevatia [61] study the use of the orthographic pro-
jection invariants that are generated from instances of generalized
cylinders. These projective invariants are detected from intensity
images. A verification phase is used to verify the goodness-of-
match of 3D shapes based on the extracted image features, thus
providing an alternative approach for recovering 3D volumetric
primitives from images.
In practice, the main limitation of generalized cylinders lies in
the need to adapt the input scene to the model based on volumet-
ric parts. The inability to come up with a good optimization
scheme for extracting parts from images is a significant drawback
of such algorithms. The models we reviewed here are mainly man-
ually trained, and as we will see when we review deformable mod-
els, it is not clear how to automatically extract such 3D parts-based
components from images. Within the context of Fig. 1, the power
of generalized cylinders and cones lies in their potential future
evolution as a representationally compact and powerful indexing
primitive, which was what motivated early research on the topic.
While it is not clear how to achieve repeatability when extracting
these primitives from2D images, when reliable 3D data is available
the mapping from images to primitives becomes much more pre-
dictable. While ultimately there may not exist a one-to-one map-
ping between an image and a set of generalized cylinders,
Table 1
Comparing some of the more distinct algorithms of Sections 2.1, 2.2, 2.3, 2.4 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are
used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a
not-applicable label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity
increases. Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a
detection algorithm, this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improve
detection efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of
the object representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of
indexing primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is
used by the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (1971–1996) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Binford [54] ⁄ ⁄ ⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄ ⁄
Nevatia and Binford [38] ⁄ ⁄ ⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄ ⁄
Brooks [56] ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄ ⁄
Zerroug and Nevatia [61] ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄ ⁄
Bolles and Horaud [62] ⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Ikeuchi and Kanade [41] ⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄ ⁄
Goad [63] ⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄ ⁄
Lowe [64] ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄⁄ ⁄
Huttenlocher and Ullman [65] ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄ ⁄ ⁄⁄ ⁄
Sarkar and Boyer [66] ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄
Grimson and Lozano-Perez [67] ⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Fan et al. [68] ⁄⁄ ⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄ ⁄
Clemens [69] ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Fig. 4. Some of the volumetric parts used by the Acronym system of [56].
Fig. 5. Examples of the objects generated by Brook’s system [56]. (first-row): Three
specializations of the class of electric motors. (second-row): The modeling of
articulated objects, such as these subcomponents of a piston (adapted from [56]).
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 833
there may very well exist one-to-many and many-to-one map-
pings between images and generalized cylinders which can lead to
a powerful mechanism for forming object hypotheses.
2.2. Automatic programming
A critical issue in object recognition is the problem of extracting
and organizing the relevant knowledge about an object and turn-
ing this knowledge into a vision program. This is referred to as
automatic programming. This subsection briefly overviews some
of the early work on the automatic generation of recognition pro-
grams [62,41,63]. This entails learning the important features of an
object automatically, learning the most important/discriminating
aspects/views of the object and coming up with search strategies
for identifying the object from a jumble of objects. In Section 2.7
we will discuss the role of affordances in improving recognition
performance, since it is believed that during early childhood devel-
opment the association between an object’s visual appearance and
its usage is primed. This in turn will highlight the existence of a rel-
atively unexplored link between active approaches to visual
inspection (see Section 3) and automatic programming algorithms,
which could in principle improve the full standard recognition
pipeline (see Fig. 1).
Before the publication of the early work on automatic program-
ming, many components of successful recognition programs were
handwritten. For example, in the Acronym system presented in
the previous subsection, the user has to manually define an object
graph and a restriction graph for each one of the objects he wishes
the system to be capable of recognizing. When we are constructing
systems that need to recognize thousands of objects, this is obvi-
ously a slow, expensive and suboptimal process.
Goad [63] published one of the first methods for the automatic
generation of object recognition programs. He compiles the visible
edges of an object in the current field of view into an interpretation
tree and uses this tree to interpret the image. However, this work
relies on a single view/aspect of the object. Similarly, Bolles and
Horaud [62] use three-dimensional models of various objects to
find them in range data. This is a system for the recognition of ob-
jects in a jumble and under partial occlusions. Given candidate rec-
ognized objects, a verification stage follows, and then the
algorithm determines some essential object configurational infor-
mation, such as which objects are on top of each other. A disadvan-
tage of such early work is that it relies heavily on edge/line based
models, which are not always suitable for certain objects that are
differentiated by more subtle features such as color and texture.
Within the context of the standard recognition pipeline of Fig. 1,
this work represents an early effort to the generation and verifica-
tion of object hypotheses. While the problem of localizing objects
of interest from a jumble remains relevant and constitutes one of
the earliest problems that vision systems were tasked with solving,
the general version of the problem is still open.
Ikeuchi and Kanade [41] modified slightly the Koenderink and
van Doorn [71] definition of an object’s aspect to create a multi-
view recognition system based on the aspect graphs of simple 3D
polyhedral objects. According to Ikeuchi and Kanade [41], an as-
pect consists of the set of contiguous viewer directions from which
the same object surfaces are visible. The authors use a tessellated
sphere to sample the object from various viewer directions. Subse-
quently, they classify these samples into equivalence classes of as-
pects. Various features of the object are then extracted to achieve
recognition. Features used include face area, face inertia, number
of surrounding faces for each face, distances between the sur-
rounding faces and the face, and the extended gaussian image
(EGI) vectors. The polyhedral object extraction as well as many of
these features depend on accurate object depth/structure extrac-
tion. This is achieved using a light stripe range finder. These
features are used in the interpretation tree for the aspect classifica-
tion stage (see Fig. 6). The interpretation tree provides a methodol-
ogy for determining the aspect currently viewed, the viewer
direction and rotation with respect to the aspect (this is achieved
again by a decision tree type classification on the features) and fi-
nally, once all this information is extracted, it is possible to make a
hypothesis as to the object currently viewed. This is a view based
recognition system and shares the main disadvantage of view
based approaches since a 3D polyhedron with n faces has O(n
) as-
pects making such a system very expensive computationally. This
algorithm uses a light stripe range finder and therefore it belongs
to the group of algorithms that relies on the existence of 3D infor-
mation. Within the context of the standard recognition pipeline
(Fig. 1), this work represents an example of feature grouping for
the discovery of viewpoint invariants. Similar ideas re-emerge in
modern recognition work, often under the disguise of dimensional-
ity reduction techniques, affine invariant interest-point detectors
and features, as well as hierarchical object representations, where
high-level shared features are used to compactly represent an ob-
ject and recognize novel object instances from multiple views. In
Section 3 we will discuss some extensions of aspect graph match-
ing to the problem of next-viewplanning and active multiview rec-
ognition systems.
As we will see in the next subsections, in more recent work, and
as we begin using more low-level features (edges, lines, SIFT-like
Fig. 6. An object model compiled into an interpretation tree. Adapted from Ikeuchi
and Kanade [41]. This interpretation tree consists of five aspects S1, . . ., S5. At each
node of the tree a feature (such as face moment, topology, etc.) is used to
discriminate each node’s aspect group and ultimately classify each aspect. Then,
more features are used to determine the viewpoint direction and rotation of the
aspects. Given a region in an input image, the corresponding features are estimated
from that region, and if the estimated features lead to a path in this interpretation
tree from the root to a leaf, then the object and its attitude/viewpoint have been
determined successfully.
834 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
features [72], etc.) which are currently popular, the need for man-
ual interaction decreases and statistical learning based algorithms
accomplish the learning with much less user interaction. However,
very little work exists in the active vision literature for automati-
cally extracting optimal object representations in terms of the
minimum encoding length and robustness. While there exist hier-
archical approaches, which are meant to provide compact repre-
sentations, there do not exist guarantees that these are the
minimal or the most robust representations. As it is argued in
[26], such optimal representations constitute an important compo-
nent of any real-time vision system. As it is argued in [73], the
problem of creating object representations that are independent
of sensor specific biases has not received attention commensurate
with its importance in the vision problem. The advantages in using
object representations with a minimal representation length are
well known from the machine learning literature (e.g., Occam’s ra-
zor and smaller storage requirements [74]). These advantages are
especially important in hierarchical object representations, since
the goal of such representations is to minimize the encoding length
through a parts based representation of objects. However, from at
least an information theoretic perspective, there are also advanta-
ges in not using a minimum encoding length as this can add a level
of redundancy, and redundancy makes recognition systems less
fragile. For example, the paper by [75], which we discuss in more
detail in a subsequent section, has an inherent redundancy in its
decision system, which might partially explain its good perfor-
mance. It is not clear what is the best representation for maximiz-
ing robustness while also maximizing generalization capability,
and minimizing representation length.
2.3. Perceptual organization
Perceptual organization techniques typically attempt to model
the human visual system’s canny ability to detect non-accidental
properties of low-level features, and group these features, in order
to build more compact object representations. Therefore, percep-
tual organization techniques represent an attempt at improving
the feature-grouping and indexing-primitive generation of the
standard recognition pipeline (Fig. 1). When we extract low level
features such as edges and lines, we are usually interested in find-
ing some sort of correspondence/alignment between those fea-
tures and mapping these groupings to a model of higher
complexity. In a typical image, the number of features n can be
in the hundreds if not thousands, implying that a brute force ap-
proach is impractical for matching 3D object models to 2D image
features. The Gestaltists’ view is that humans perceive the simplest
possible interpretation of the visible data. The greatest success in
the study of perceptual organization has been achieved by assum-
ing that the aim of perceptual organization is to detect stable im-
age groupings that reflect non-accidental image properties [64].
A number of common factors which predispose the element group-
ing were identified by the Gestaltists [76–80] (also see Fig. 7):
v Similarity: similar features are grouped together.
v Proximity: nearby features are usually grouped together.
v Continuity: Features that lead to continuous or ‘‘almost’’ con-
tinuous curves are grouped together.
v Closure: Curves/features that create closed curves are grouped
v Common fate: Features with coherent motion are grouped
v Parallelism: Parallel curves/features are grouped together.
v Familiar configuration: Features whose grouping leads to
familiar objects are usually grouped together.
v Symmetry: Curves that create symmetric groups are grouped
v Common region: Features that lie inside the same closed
region are grouped together.
In the case of familiar configuration in Fig. 7, the features cor-
responding to the Dalmatian dog pop-out easily if this is a famil-
iar image. Otherwise, this can be a challenging image to
understand. There is evidence that the brain uses various mea-
sures—such as the total closure, by measuring the total gap of
perceived contours [81]—as intermediate steps in shape forma-
tion and representation. Berengolts and Lindenbaum [82,83] dem-
onstrate that the distribution of saliency—defined as increasing as
a point gets near an edge-point—is probabilistically modeled
fairly accurately along a curve using the first 3 moments of the
distribution and Edgeworth series. Tests are performed for the
distribution of the saliency for points near a curve’s end and far
away from the curve’s end. The predicted saliency distribution
matches closely the distribution in real images. Such probabilistic
methods are useful for making inferences regarding the organiza-
tion of edges/lines in images.
Lowe [64,84] formalizes some of these heuristics in a probabi-
listic framework. He uses these heuristics to ‘‘join’’ lines and edges
that likely belong together and thus, decreases the overall com-
plexity of the model fitting process. In particular, he searches for
lines which satisfy parallelism and collinearity, and searches for
line endpoints which satisfy certain proximity constraints. For
example, given prior knowledge of the average number d of line
segments per unit area, the expected number N of segments within
a radius r of a given line’s endpoint is N = dpr
. If this value is very
low for a particular region but a second line endpoint within this
radius r has been detected, this is a strong indication that the
two lines are not together accidentally, and the two lines are
joined. Similar heuristics are defined for creating other Gestalt-like
perceptual groups based on parallelism and collinearity. He uses
these perceptual grouping heuristics in conjunction with an itera-
tive optimization process to fit 3D object models onto images, and
recognize the object(s) in the image.
Huttenlocher and Ullman [65] show that under orthographic
projection with a scale factor, three corresponding points between
image features and object model features are sufficient to align a
rigid solid object with an image, up to a reflexive ambiguity. By
taking the Canny edges of an image [85], and limiting their feature
search to edge corners and inflection points, they derive an align-
ment algorithm that aligns those features with a 3D model’s corre-
sponding features. Those features are chosen because they are
relatively invariant to rotations and scale changes. An alignment
runs in O(m
) time, where m is the number of model interest
points and n is the number of image interest points. Once a poten-
tial alignment is found, a verification stage takes place where the
image model is projected on the image and all its interest points
are compared with the image’s interest point for proximity.
Sarkar and Boyer [66] present one of the earliest attempts at
integrating top-down knowledge in the perceptual organization
of images. The authors use the formalism of Bayesian-networks,
to construct a system that is capable of better organizing the exter-
nal stimuli based on certain Gestalt principles. The system exe-
cutes repeatedly two phases. A bottom-up/pre-attentive phase
uses the extracted features from the image to construct a hierarchy
of a progressively more complex organization of the stimuli. Graph
algorithms are then used to mine the image data and find contin-
uous segments, closed figures, strands, junctions, parallels and
intersections. Perceptual inference using Bayesian networks is
used to integrate information about various spatial features and
form composite hypotheses, based on the evidence gathered so
far. The goal is to repeat this process so that ultimately a more
high-level organization and reasoning of the image features is
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 835
Yu and Shi [86,87] define the concept of ‘‘repulsion’’ and
‘‘attraction’’ for the perceptual organization of images and figure-
ground segmentation and show how to use normalized-cuts to
segment the images into perceptually similar regions. They argue
that such forces might contribute to phenomena such as pop-out
and texture segmentation and they discuss their importance to
the problem of visual search.
While investigating the role of perceptual organization in vision
is a vibrant topic of research, most commercially successful recog-
nition systems currently rely on far simpler ‘flat’ architectures, typ-
ically consisting of a simple feature extraction layer followed by a
powerful classifier (see Section 4 on the PASCAL challenges). The
need for object representations with a minimal encoding length
was briefly discussed in [26]. For example, Verghese and Pelli
[88] provide some evidence in support of the view that the human
visual system is a limited capacity information processing device,
by experimentally demonstrating that visual attention in humans
processes about 30–60 bits of information. More complex feature
groupings and indexing primitives inspired by the modeling of
non-accidental image properties, could offer another approach
for improving the standard recognition pipeline of Fig. 1.
2.4. Interpretation tree search
A number of authors have worked on interpretation tree search
based algorithms [67–69]. Grimson and Lozano-Perez [67] discuss
how local measurements of three-dimensional positions and sur-
face normals, that are recorded by a set of tactile sensors or by
three dimensional range sensors, are used to identify and localize
objects. This work represents an example of how interpretation
trees can moderate the explosive growth in the size of the hypoth-
esis space as the number of sensed points and the number of sur-
faces associated with the object model is increased. The sensor is
assumed to provide 3D position and local orientation information
of a small number of points on the object, and as such it serves
as another example of a system that makes use of the range infor-
mation provided by an active range sensor. The authors model the
objects as 3D polyhedra with up to six degrees of freedom relative
to the sensors (three translational and three rotational degrees of
freedom), and use local constraints on the distances between faces,
angles between face normals, and angles of vectors between
sensed points. Given s sensed points and n surfaces in each of
the known objects, the total number of possible interpretations is
. An interpretation is deemed legal if it is possible to determine
a rotation and translation that would align the two sets of points.
Since it is infeasible computationally to carry n
tests on all possi-
ble combinations, an interpretation tree approach—combined with
tree pruning—in conjunction with a generate-and-test approach, is
used to determine the proper alignment. The constraints used for
tree pruning (see Fig. 8), include (i) the distance constraint, where
the distances between pairs of points must correspond to the dis-
tances on the model, (ii) the angle constraint where the range of
possible angles between normals must contain the angle of known
object model normals, and (iii) the direction constraint, where for
every triple {i, j, k} of model surfaces, the cones of the directions
between the points on the pairs i, j and j, k are extracted and are
used to determine whether three sensed 3D points might also lie
on surfaces i, j, k.
Each node of the interpretation tree represents one of these
model constraints, and at each level of the tree the corresponding
model constraint is compared with one of the possible range-data-
derived constraints of the scene. If one of the three constraints de-
scribed above does not hold, the entire interpretation tree branch
is pruned. As the authors show experimentally and by a probabilis-
tic analysis, the computational benefits are significant, since the
use of such constraints lead to the efficient pruning of hypotheses
Fig. 7. The Gestalt laws.
836 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
which in turn speeds up inference. For example, in one of the
experiments that the authors perform, they demonstrate that this
pruning leads to a reduction in the number of candidate hypothe-
ses: from the 312,500,000 initial possible hypotheses for the ob-
ject, only 20 were left. Within the context of Fig. 1, the work by
Grimson and Lozano-Perez [67] represents a successful attempt
at speeding up the hypothesis generation module, by introducing
a number of constraints for solving an initially intractable problem.
This constitutes an exemplar-based recognition system, and as
such it is a good tool for machine vision tasks where we are dealing
with the problem of localizing well-defined geometric object (e.g.,
assembly line inspection).
Fan et al. [68] use dense range data and a graph-based represen-
tation of objects—where the graph captures information about the
various surface patches and their relation to each other—to recog-
nize objects. These relations might indicate connectivity or occlu-
sion. A given scene’s graph is decomposed into subgraphs (e.g.,
feature grouping) and each subgraph ideally represents a detected
object’s graph representation. The matching is performed using
three modules: A screener module, which determines the most
likely candidate views for each object, a graph matcher module,
which compares candidate matching graphs, and an analyzer,
which makes proposals on how to split/merge object graphs. Fea-
tures used during the matching between an object model and a
scene extracted graph include the visible area of each patch, the
orientation of each patch, its principal curvatures, the estimated
occlusion ratio, etc. Each patch is encoded by a node in the graph,
and adjacent patches are encoded with an edge between the nodes.
Heuristic procedures are defined on how to merge/split such
graphs into subgraphs based on the edge strength. Heuristic proce-
dures for matching graphs are also provided. The authors attempt
to address many issues simultaneously, such as object occlusion
and segmenting out background nuisances. However, it is not clear
how easy it is to reliably extract low-level features such as patch
orientations, and principal curvatures from images. Furthermore,
the complexity requirements for reliably learning such object rep-
resentations might be quite high. Within the context of the general
recognition framework in Fig. 1, the system makes a proposal for
improving all of the pipeline’s components, from the extraction
and grouping of strong indexing primitives, all the way to the
hypothesis generation and object verification stage. It is not clear,
however, how efficient this object representation is in terms of its
encoding length, and it is, thus, not clear how well it compares to
other similar approaches. Hierarchical representations of objects
will be discussed in more detail later in this survey, and are meant
to provide reusable and compact object parts. They constitute a
popular and closely related extension of graph based representa-
tions of objects.
Many interpretation-tree approaches do not automatically ad-
just the expressive power of their representation during training
and online object recognition. As it is discussed in [26] this could
have serious implications in the training process and the reliability
of online recognition. A fundamental problem of interpretation
tree search is coming up with a good tradeoff between tree com-
plexity and generalization ability and making the system capable
of controlling the representational complexity [26]. However, this
problem of a constant expressive power is shared by most recogni-
tion systems described in this document, so this is not an issue
exclusively related to the interpretation-tree approach to
2.5. Geometric invariants
Geometric invariants are often used to provide an efficient
indexing mechanism for object recognition systems (see Fig. 1).
The indexing of these invariants in hash-tables (geometric hashing
[90,91,89,92–95]) is a popular technique for achieving this effi-
ciency. A desirable property of such geometric invariants is that
they are invariant under certain group actions, thus providing an
equivalence class of object deformations modulo certain transfor-
mations. Typical deformations discussed in the literature include
2D translations, rotations and scalings (similarity transformations),
as well as 2D affine and projective transformations. Many of these
hashing techniques have also been extended to the 3D case and are
particularly useful when there exists reliable range information.
Such rapid indexing mechanisms have also been quite successful
in medical imaging and bioinformatics, and particularly in match-
ing template molecules to target 3D proteins [96]. Thus, under cer-
tain restrictions, geometric hashing techniques can reliably
address a number of problems [95]:
1. Obtaining an object indexing mechanism that is invariant under
certain often-encountered deformations (e.g., affine and projec-
tive deformations).
2. Obtaining an object indexing mechanism that is fairly robust
under partial occlusions.
3. Recognizing objects in a scene with a sub-linear complexity in
terms of the number of objects in the object database. The
inherent parallelism of geometric hashing approaches is
another one of their advantages.
A problem with such group invariants is that perspective pro-
jections do not form a group. Furthermore, early work on the appli-
cation of group invariants was complicated by the fact that other
common object deformations also do not form groups [97] and
cannot be described easily by closed form expressions.
Notice that in the recent literature, geometric invariants tend to
emerge within the context of local feature-based and parts-based
methods (which are discussed in more detail later in Section 2).
For example, interest-point detectors that are invariant with re-
spect to various geometric transformations (translation or affine
invariance for example) are often used to determine regions of
interest, regardless of image or sensor specific transformations/
deformations. This provides a measure of robustness for determin-
ing regions around which features or parts can be extracted reli-
ably and with a degree of invariance Agarwal and Roth [98],
Fig. 8. The constraints of Grimson and Lozano-Perez [67]. The figure shows three
points A, B, C on the surface of a cube, and the three normals N
, N
, N
of the
corresponding surface planes. For each pair of points {A, B}, {B, C}, {A, C} the
distance between the two points refers to the corresponding distance constraint. For
each pair of normals {N
, N
}, {N
, N
}, {N
, N
} the angle between the normals refers
to an angle constraint. For each pair of surfaces, the cone spanned by the directions
between all pairs of points on the two surfaces defines another direction constraint—
given three sensed points, the corresponding cones can be extracted and if they
form a subset of the corresponding model cones, a match has occurred. These cones
can be used to prune the interpretation tree.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 837
Weber et al. [99], Fergus et al. [100], Lazebnik et al. [101], Mikolajc-
zyk and Schmid [102].
The origins of the idea of geometric hashing for shapes, are
traced to Schwartz and Sharir [90,91,95]. Often, subsets of feature
points are used to obtain a coordinate frame of the image’s object,
and all other model/image features use this coordinate frame to get
expressed in affine invariant or projective coordinates. Other pop-
ular invariants include the differential invariants (under Euclidean
actions) of Gaussian curvature and torsion, as well as a number of
invariants related to the plane conics which can be applied to lines,
arcs, and corners [92]. We will discuss some of these invariants la-
ter in this section. These invariants are used to rapidly access a
hash table. Typically, this procedure is repeated a number of times
for each object, votes are accumulated for each such subset of coor-
dinates, and the object identity hypothesis with the most votes is
the choice of the recognized object. In Charts 3, 4 and Table 2 we
present a comparison, along certain dimensions, for a number of
the papers surveyed in this section and Sections 2.6, 2.7, 2.8.
[89] use regular 2D images. They extract interest points at loca-
tions of deep concavities and sharp convexities. Assume e
, e
are an affine basis triplet in the plane. The affine coordinates
(a, b) of a point v in the plane are given by
v = a(e
) ÷ b(e
) ÷e
Any affine transformation T would result in the same affine coordi-
nates since
Tv = a(Te
) ÷ b(Te
) ÷Te
Given an image model of an object with m interest points, for each
triple of points, the affine coordinates of the other m ÷ 3 points are
extracted. Each such (a, b) coordinate is used as an index to insert
into a hash table the affine coordinate basis and an object ID. This
makes it possible to encode each interest point using all possible af-
fine basis coordinates. For each triplet of interest points in the im-
age, their corresponding affine coordinate basis is used to
calculate the coordinates (a, b) of all the other interest points. These
coordinates are hashed in the hash table. The object entry in the
hash table with sufficient votes is chosen as the recognized object.
If a verification stage also succeeds—where the object edges are
compared with those of the scene—the algorithm has succeeded
in recognizing the object. Otherwise, a new affine basis coordinate
is chosen and the process is repeated (see Fig. 9).
Flynn and Jain [93] describe an approach for 3D to 3D object
matching using invariant features indexing. Solid models of ob-
jects—composed of cylinders, spheres, planes—are used to deter-
mine corresponding triples {(M
, S
), (M
, S
), (M
, S
)} where M
denotes a model surface and S
denotes a corresponding scene sur-
face. For each pair of extracted scene cylinders, spheres and planes,
an invariant feature is defined and extracted. For example, for each
pair of scene cylinders and planes the angle between the plane’s
normal and the cylinder’s axis of symmetry is extracted. Pairs or
triples of such invariant features are used to access tables where
each table entry contains a linked-list of all the database object
Chart 3. Summary of the scores from the Table 2 papers published between 1973–1999 that make significant contributions to geometric invariants, 3-D shapes/deformable
models, function, context, and appearance based recognition. This set of papers emphasizes the use of powerful indexing and object encoding primitives. We notice that apart
from the 3-D shape/deformable model papers, the other papers do not make much use of 3-D information for inference and object modeling/representation, and very few of
the other papers make use of function or context. In other words there was little crosstalk between the paradigms during 1973–1999.
Chart 4. Summary of the scores from the Table 2 papers published between 2000–2011 that make significant contributions to geometric invariants, 3-D shapes/deformable
models, function, context, and appearance based recognition. This set of papers emphasizes the use of powerful indexing and object encoding primitives. Compared to the
chart in Chart 3 we notice an ever smaller role of 3D in recognition, and a greater emphasis on function, context and efficient inference algorithms.
838 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
models composed of the same invariant features. A vote is placed
for each of the objects in that table entry. By performing this pro-
cess across all the extracted invariant features of the scene object,
the table object with the most votes is selected as the recognized
Forsyth et al. [92] present a framework on invariant descriptors
for 3D model-based vision. The authors survey the large mathe-
matics literature on projective geometry and its invariants, and ap-
ply these invariants to the recognition problem. One of the
projective invariants that they discuss for example, involves the
use of plane conics. A plane conic is given by the values of x satis-
fying x
cx = 0 where c is some symmetric matrix with a determi-
nant of 1. A projective invariant of such a conic is given by the
_ _
_ _
_ _ (3)
where x
, x
are any two points not lying in the conic. In other
words, the value in (3) is independent of the coordinate frame in
which x
, x
and the conic are measured, and is invariant under pro-
jective distortions. For example, this can be useful for consistently
detecting a car’s wheels from multiple viewpoints.
A problem with many geometric hashing techniques includes
the fact that the resulting feature distributions are not uniform,
thus, slowing down the indexing mechanism by unevenly distrib-
uting the indices in the hash-table cells. Such problems are usually
addressed by uniformly rehashing the table through the use of a
distribution function that models well the expected uneven distri-
bution of features in the original hash-table.
Bayesian formulations of the problem are also useful in model-
ing positional errors in the hash-tables [94,95]. For example, one
can attempt to maximize the probability P(M
, i, j, B[S
) by using
Bayes’ theorem to assign weighted votes to the hash-table, where
is an object model, i, j are indices of two distinct points on the
model which correspond to two points from the basis set B (these
two points can define an axis of the current coordinate frame) and
is the set of extracted scene points which excluded the currently
chosen basis points in B. The use of such a redundant vote repre-
sentation scheme can also diminish the need to consider all possi-
ble model basis combinations in various hashing and voting
Geometric hashing algorithms constitute a proven methodology
offering a rapid indexing mechanism. However, there is little work
on bridging the semantic gap between the low-level features typ-
ically extracted from images, and the high order representations
that are ultimately necessary for recognition algorithms to work
well with non-trivial objects, while simultaneously maintaining
the rapid indexing advantages of such hashing approaches.
2.6. Qualitative 3-D shape-based recognition and deformable models
Do humans recognize objects by first recognizing sub-parts of
an object, or are objects recognized as an image/whole in one shot?
Is it perhaps the case that we first learn to recognize an object by
parts, but as we become more familiar with the object, we recog-
nize it as a whole? The answer to these questions could have
Table 2
Comparing some of the more distinct algorithms of Sections 2.5, 2.6, 2.7, 2.8 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are
used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a
not-applicable label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity
increases. Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a
detection algorithm, this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improve
detection efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of
the object representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of
indexing primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is
used by the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (1973–2011) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Lamdan et al. [89] ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Forsyth et al. [92] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄ ⁄
Flynn and Jain [93] ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Fischler and Elschlager [107] ⁄ ⁄⁄ ⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Biederman [40] ⁄ ⁄ ⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄ ⁄
Marr [23] ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄ ⁄
Pentland [109] ⁄ ⁄ ⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄ ⁄
Sclaro and Pentland [113] ⁄⁄ ⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Grabner et al. [127] ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄⁄ ⁄⁄⁄⁄ ⁄
Stark et al. [128] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄ ⁄
Castellini et al. [130] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄ ⁄
Ridge et al. [131] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄
Saxena et al. [132] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄
Hanson and Riseman [126] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄
Stark and Bowyer [117] ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄
Strat and Fischler [116] ⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄
Hoiem et al. [121] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄
Kumar and Hebert [135] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄
Torralba et al. [119] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄
Torralba [120] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄
Wolf and Bileschi [123] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄
Li and Fei [137] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄
Murphy et al. [136] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄
j. Shotton et al. [138] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄
Heitz and Koller [139] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄
Turk and Pentland [146] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Murase and Nayar [141] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Huang et al. [147] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Leonardis and Bischof [148] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 839
profound implications for the design of computer vision systems. A
number of researchers have addressed this issue. This section over-
views some of the related research.
The combinatorial and minimum encoding length arguments
from the previous sections provide a compelling argument as to
the need for parts based recognition. It is combinatorially infeasi-
ble to achieve 3D recognition of wholes without parts based recog-
nition preceding it first. Pelli et al. [103] provide some compelling
arguments in support of the parts based approach. To support their
arguments, the authors demonstrate that human efficiency in
reading English words is inversely proportional to word length,
where ‘‘efficiency’’ is defined as the ratio of the ideal observer’s
threshold energy divided by a human observer’s threshold en-
ergy—threshold energy being the minimum energy needed in the
signal/word to make it observable. The authors demonstrate that
despite having read billions of words in their lifetime and the vi-
sual system having learnt them as well as it is possible, humans
do not learn to recognize words as wholes. They demonstrate that
efficiency decreases with increasing word length. If humans recog-
nized words as wholes, this effect should not be as pronounced. A
word is never learnt as an independent feature and human perfor-
mance never surpasses that achievable by letter based models. A
word cannot be read unless its letters can be separately recognized
and its components are detected. This leads to some interesting
ideas with respect to purely feedforward approaches to recogni-
tion, [104–106]. As localization is very difficult using purely feed-
forward approaches and since some sort of localization on the
object is necessary to recognize the individual parts, an attention
mechanism [29] is necessary in order to provide this localization/
parts-based information.
It is, however, unknown what the components used in parts
based recognition are. To this extent, numerous hypotheses have
been formulated, which attempt to explain the components used
in parts based recognition. However, as many years of research on
the subject have shown, the extractionof such parts from2Dimages
is far fromtrivial, anddepends stronglyonthe imagecomplexityand
the similarity of the image features to the finite set of object parts.
Within the context of the recognition framework in Fig. 1, 3D
shapes and deformable models are believed to provide an extre-
mely powerful indexing mechanism. Their main limitation is, how-
ever, the difficulty in extracting and learning such representations
from 2D images. As a result in modern work, such 3D part-based
representations are not very popular (see Table 7 for example).
Nevertheless, many researchers believe that such 3D representa-
tions must play a significant role in bridging the semantic gap of
recognition systems (see the Tarr-Biederman debate in Section 2.1).
As we will notice, there has been little effort in modern work for
merging such 3D parts based representation with modern view-
based methodologies relying on texture, local features and ad-
vanced statistical learning algorithms.
Fischler and Elschlager [107] present an early system where a
reference image is represented by p components and also associate
a cost with the relative displacement/deformation in the spatial
position of each component. Biederman [40] suggests that the
components most appropriate for the recognition process are
geons, which are generalized cones such as blocks, cylinders,
wedges and cones. A maximum of 36 such geons are suggested.
He argues in support of the recognition-by-components approach
to vision. The author maintains that these geons are readily
extractable from five detectable edge based properties of images:
curvature, collinearity, symmetry, parallelism and cotermination
(see Fig. 10). Biederman claims that, since these properties are
invariant over viewing directions, it should be readily possible to
extract geons from arbitrary images. Years of research in the field
have demonstrated, however, that the extraction of geons from
arbitrary images is a non-trivial task and most likely more sources
of regularization are needed if reliable extraction from images of
such high-complexity objects is to be achieved. This early research
by Biederman on recognition-by-components influenced signifi-
cantly the computer vision community, and spurred a number of
years of intense research in the field. The author argues that the
human visual system recognizes a maximum of 30,000 object clas-
ses, by using an English language dictionary to approximate the
number of nouns in the English language. Notice, however, that
one could also argue that humans are capable of distinguishing
amongst many more than 30,000 objects (millions of objects) since
for each such noun, humans can effortlessly distinguish amongst
many sub-classes (e.g., the noun ‘car’ has multiple distinguishable
sub-categories which are not enumerated in a typical English lan-
guage dictionary). By simple combinatorial arguments Biederman
shows that combinations of 2–3 geons should be more than suffi-
cient to provide accurate indices for recognition. A number of
experiments are performed with human subjects, demonstrating
the ease with which humans can recognize real life object classes
that are represented by the 2D projection of a composition of 2–
3 geons. Tanaka is well known for his research on uncovering the
neuronal mechanisms in the inferotemporal cortex related to the
representation of object images [108].
Biederman’s approach is in some ways similar to Marr’s [23]
paradigm for object inference from 2-D images. Marr’s approach,
however, relies on the extraction of 3-D cylinders, rather than
geons, from images. In more detail, Marr proposes three main lev-
els of analysis in understanding vision. These include a primal
sketch of a scene consisting of low-level feature extraction, fol-
lowed by a 2.5D sketch where certain features which add a sense
of depth might be added to the primal sketch, such as cast-shad-
ows and textures, followed by the above described 3-D cylinder
based representation of the objects in the scene.
Even though Biederman’s 36 geons are inherently three-dimen-
sional, he notes that he is not necessarily supporting an object-cen-
tered approach to recognition. He argues that since the 3D geons
are specifiable from their 2D non-accidental properties, recogni-
tion does not need to proceed by constructing a three-dimensional
Fig. 9. The geometric hashing approach by [89].
840 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
interpretation of each volume. Note that the belief that a combina-
tion of object-centered and viewpoint-dependent recognition
takes place in the human visual system, is currently more widely
accepted in the vision community. Biederman also argues that
the recognition-by-components framework can explain why mod-
est amounts of noise or random occlusion, such as a car occluded
by foliage, do not seem to significantly affect human recognition
abilities, as geon-like structures and the extraction of non-acciden-
tal properties provide sufficient regularization to the problem. Bie-
derman is careful to indicate that he is not arguing that cues such
as color, the statistical properties of textured regions, position of
the object in the scene/context do not play a role in recognition.
What he is arguing in support of, is that geon-like structures are
essential for primal access: the first contact/attempt at recognition
that is made upon observing a stimulus and accessing our memory
for recognition. Thus, within the context of the general recognition
framework of Fig. 1, Biederman is proposing a potentially powerful
indexing primitive. However, it is not yet clear how to reliably and
consistently extract these primitives from a regular 2D image, nor
is it clear what the optimal algorithm is for learning an object rep-
resentation that is composed of such parts.
Pentland [109,110] presents another parts-based approach to
modeling objects using superquadrics (see Fig. 11). Let cosg = C
and sinx = S
. Then, a superquadric is defined as
X(g; x) =





Fig. 11. Examples of various generated superquadrics.
Fig. 10. Hypothesized processing stages in object recognition according to Biederman [40]. The process relies on the extraction of five types of non-accidental properties from
images, which in turn help in inferring the corresponding 3-D geons.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 841
where X(g, x) is a 3D vector that defines a surface parameterized by
g, x. Furthermore,
are constant parameters controlling the
surface shape and a
, a
, a
denote the length, width and breadth.
By deforming a number of such superquadrics, and taking their
Boolean combinations, a number of solids are defined.
The authors propose a methodology for extracting such super-
quadrics from images, assuming the existence of accurate esti-
mates of the surface tilt for each pixel of the image, where the
tilt is defined as s = x
where x
, y
respectively denote the x
and y axis components of the surface normal. Through a simple
regression methodology they show how the superquadric’s center,
orientation and deformation parameters can be reliably estimated.
The idea is that the relatively compact description of each super-
quadric can provide a good methodology for indexing into a data-
base of objects and identifying the object from the image. In
practice, however, extracting such superquadrics from images
has been met with little success. However, superquadrics have
been successfully applied to other domains where there is signifi-
cantly less variability in the image features, such as the medical
imaging domain and the segmentation of the cardiac ventricles
Within the context of the general object recognition framework
of Fig. 1, the work on superquadrics that was reviewed so far has
not dealt with all the modules of the standard recognition pipeline.
As indicated previously, while it is difficult to extract a one-to-one
map from an image of an object to a superquadric-based represen-
tation, one-to-many mappings may exist that provide a sufficiently
discriminative and efficiently learnable representation [26]. Dick-
inson and Metaxas [112] present another superquadric based ap-
proach to shape recovery, localization and recognition that
addresses to a greater extent the components of Fig. 1, due to the
use of a hierarchical representation of objects. They first use an as-
pect hierarchy to obtain segmentations of the image into likely as-
pects (see Fig. 12). These aspects in turn are used to guide a
superquadric fitting on the 2D images. The superquadric is fit on
the extracted aspects by fitting a Lagrange equation of motion

q ÷D
q ÷Kq = f (5)
where q is a vector containing the superquadric parameters and
parameters for rotation and translation, and f is a vector of image
forces which control the deformation of the differential equation.
These forces depend on the extracted image aspects. The extracted
superquadrics provide a parts-based characterization of the image
and also provide a compact indexing mechanism.
Sclaro and Pentland [113] provide a different formulation of a
deformable model. Given a closed parameterized curve, which rep-
resents the outline of a segmented object, they decompose the ob-
ject into its so-called ‘‘modes of deformation’’. The model’s nodes’/
landmarks’ displacement vector U is modeled using the Lagrange
equation of motion

U ÷D
U ÷KU = R (6)
where R denotes the various forces acting on the model and causing
the deformation—such as edges and lines—and M, D, K denote the
element mass, damping and stiffness properties respectively. It is
shown how to use this differential equation to obtain a basis matrix
U of m eigenvectors U= [/
, . . ., /
]. Linear combinations of these
eigenvectors describe the ‘‘modes of deformation’’ of the differential
equation, thus, allowing us to deform the model displacements U
until the best matching model is found. Given a model object whose
contour is described by a finite number of landmarks, the authors
present a formulation for determining the displacement U that best
matches those landmarks. If the matching is sufficiently good, we
say that the object has been recognized.
Cootes et al. introduced Active Shape Models (ASMs) and Active
Appearance Models (AAMs) [114,115], which are also quite popu-
lar in the medical imaging domain [111,12] (see Fig. 13). While
superquadric based approaches use a very specific shape model,
AAMs and ASMs try to learn a shape model from general data with
regularization. As previously discussed, such 3D parts-based prim-
itives offer a potentially powerful indexing mechanism, but are of-
ten difficult to extract reliably from images.
2.7. Function and context
Extensive literature exists on the exploitation of prior knowl-
edge about scene context and object function, in order to improve
recognition. Function and context are related topics, since by defi-
nition, information about the function of a certain object implies
that the system is able to extract some information about the scene
context as well. For example, knowledge that a particular set of ob-
jects can be used as a fork, spoon and plate increases the contex-
tual probability that we are in a kitchen and that edible objects
might be close by, which can in turn help improve recognition per-
formance. Conversely, contextual knowledge is strongly related to
function since often the scene context (e.g., are we inside a house
or are we outside, and what is the scale with which the scene is
sensed?) could help us determine whether, for example, a car-like
object is a small toy that is suitable for play, or whether it is a big
vehicle that is suitable for transportation purposes.
Thus, within the context of the recognition pipeline in Fig. 1 we
see how function and context could in principle affect all the com-
ponents of the standard pipeline. For example, contextual knowl-
edge could place a smaller burden on the level of object
representational detail required by the object database. Similarly,
function and context could affect the feature extraction and group- Fig. 12. The aspect hierarchy used by Dickinson and Metaxas [112].
Fig. 13. A 3-D active appearance model (AAM) used in [12] to model the left
ventricle of the heart. The right image shows a 3D model of the left ventricle of the
heart, which captures the modes of shape deformation during the cardiac cycle as
well as the corresponding image appearance/intensity variations. This model can be
deformed to better fit the data in the volumetric images, and thus achieve better
segmentation. The left image shows a short-axis cardiac MRI slice whose intensity
is modeled by the AAM. A stack of such images produces a 3D volumetric image.
842 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
ing process when there is scene ambiguity due to sensor noise or
occlusions for example. Similarly, context and function can prune
the hypothesis space, and thus improve the reliability and effi-
ciency of the object verification and object training phases. We also
notice that related work is by necessity closely related to knowl-
edge representation frameworks.
Various knowledge representation schemes were implemented
over the years in order to improve the performance of vision sys-
tems, through the integration of task-specific knowledge [116–
123]. Contextual knowledge used by such system typically helps
in answering certain useful questions such as: Where are we?
Are we looking up or down? What kind of objects are typically lo-
cated here? How will the objects in the scene be used?
An early and influential knowledge representation framework is
attributable to Minsky [124]. The essence of Minsky’s frame theory
is encapsulated in the following quotation: ‘‘When one encounters a
new situation (or makes a substantial change in one’s view of the pres-
ent problem) one selects from memory a structure called a Frame. This
is a remembered framework to be adapted to fit reality by changing
details as necessary frames provide us with a structural and concise
means of organizing knowledge.’’ Essentially frames are data struc-
tures for encoding knowledge and represent an early attempt at
modeling the way humans store knowledge. As such they have sig-
nificant applications in vision and influenced early vision research
on context and function. Minsky argued that frames could provide
a global theory of vision. For example, he argued that frames could
be used to encode knowledge about objects, sub-parts, their posi-
tions in rooms, and how these relations might change with chang-
ing viewpoint (is-a relations, part-of relations, part-whole relations
and semantic relations). However, modern recognition research
has moved to a learning/probabilistic based model for representing
knowledge, which is typically represented in terms of graphical
models [125]. In Section 3 we will discuss how knowledge repre-
sentation frameworks were also used in early research on active
object localization and recognition.
Along similar lines, research on exploiting function (a.k.a., affor-
dances) has provided some promising results. Under this frame-
work, the object’s function plays a crucial role in recognition. For
example, if we wish to perform generic recognition and be capable
of recognizing all chairs, we need to identify a chair as any object
on which someone can sit. One can of course argue that this is
no different from the classical recognition paradigm where we
learn an object’s typical features, and based on those features try
to recognize the object. It is simply a matter of learning all the dif-
ferent types of chairs. Nevertheless, and as discussed in the intro-
duction, the huge amount of variation in objects implies that it is
unreasonable to assume that an accurate geometric model will al-
ways exist.
In early work on function actions and affordances, there was no
focus on how learning is related to the problem. In more recent
work the confluence of learning for actions and affordances has
gained prominence [127,128]. One can think of many object classes
(such as the class of all chairs) which contain elements that at least
visually are completely unrelated. The only intermediate feature
that such classes share is their function. The idea behind learn-
ing-based affordance/function research involves associating visual
features with the function of the object, and then using the object
function to improve recognition.
According to Gibson’s concept of affordances, the sight of an ob-
ject is associated by a human being to the ways it is usable
[129,130]. It is believed that during early childhood development
this association is primed by manipulation, randomly at first, and
then in a more and more refined way [130]. According to this
school of thought, a significant reason as to why human object rec-
ognition is reliable is because humans immediately associate to the
sight of an object its affordances, which results in strong general-
ization capabilities.
Recent work on affordances has also focused on its relation to
robotics, by building systems that use vision based systems to
determine how an object should be grasped and manipulated
[131,132]. While some success was achieved for a small number
of object classes, consistently reliable affordance-based grasping
for a large number of object classes is not yet demonstrated. Notice
that object grasping is a many-to-many relationship, since multi-
ple objects are graspable with the same grasp, and one object
can be associated with multiple kinds of grasps [130].
An early example of a vision system which used a non-trivial
knowledge base, is the VISIONS system by Hanson and Riseman
[126] which was progressively developed and improved over a
number of years (also see Fig. 14). This system incorporated a
knowledge representation scheme over numerous layers of repre-
sentation in order to create an advanced vision system [133]. At
the highest level, their system consists of a semantic network of
object recognition routines. This network is hierarchically orga-
nized in terms of part-of compositional relations which, thus,
decompose objects into object parts.
Stark and Bowyer [117] present the GRUFF-2 system for func-
tion based recognition and present a function-based category
indexing scheme that allows for efficient indexing. Given as input
a polyhedral representation of an object, they define a set of knowl-
Fig. 14. Overview of the VISIONS system, and the three main layers of representation used in the system. Adapted from [126].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 843
edge primitives which define the object. For example, a king sized
bed is defined by knowledge primitives which specify the total
sleeping area of the bed, the width of the sleeping area, the stabil-
ity of the object—i.e., does it have a sufficient number of legs?—and
so on. A hierarchy is defined for all the objects we wish to recog-
nize. For example, a chair has a number of subcategories (conven-
tional chair, balance chair, lounge chair etc.) and each subcategory
might have another subcategory or it might be a leaf in the hierar-
chy, in which case the set of knowledge primitives from the leaf to
the root need to be verified to see if we are dealing with such an
object. Acceptable ranges are defined for each of the tested fea-
tures—such as the acceptable total sleeping area of the object if
we are to classify it as a bed—and based on the total score of all
the ranges, a confidence measure is defined on the hypothesized
identity of the object. An indexing scheme is also proposed, which
uses the results of the initial input shape estimation to remove
impossible categories. For example, if the total volume of the ob-
ject is not within a specified bound, we know that a big subset of
the objects in our database can be ignored as they do not have
the same volume bounds.
Similarly, a large amount of published research exists on con-
text-based vision. Strat and Fischler [116] present the Condor rec-
ognition system (see Fig. 15). It is a system designed for
recognizing complex outdoor natural scenes, by relying to a large
extent on contextual knowledge without depending on geometric
models of the learned objects. The authors do not make certain
assumptions that are inherent to many recognition systems.
Namely, that all objects of interest are definable by a small number
of parts, and that all objects have well defined locally measurable
features. Context sets are defined, which are sets of predicates/im-
age features that, if satisfied, a certain action is taken. An example
of a context set is for example {SKY-IS-CLEAR, CAMERA-IS-HORI-
ZONTAL, RGB-IS-AVAILABLE}. If the context set is true, a certain
operator is executed which helps us determine whether a certain
object (soil, trees, etc.) is present in the image. Such context sets
are used in 3 different types of rules:
v Type I: Candidate generation.
v Type II: Candidate evaluation.
v Type III: Consistency determination.
Type I rules are typically entered manually. In the candidate
evaluation phase a feature selection is performed and which con-
text sets are most discriminative for each operator and object/class
are determined. This makes it possible to order the rules and thus,
obtain in a more efficient manner the maximal set (referred to as
the maximal clique in the paper) of objects present in the image.
The maximal clique must contain objects whose rules contradict
each other as little as possible. Since it is intractable to enumerate
all the sets in order to determine the best one, this ordering of rules
is a heuristic that makes it possible to find a good set within rea-
sonable time.
Hoiem et al. [121] use probabilistic estimates of 3D geometry of
objects relative to other objects in the scene to make estimates of
the likelihood of the various object hypotheses. For example, if a
current hypothesis detects a building and a person in the image,
but the extracted 3D geometry indicates that the person is taller
than the building, the hypothesis is discarded as highly unlikely.
Their approach can be incorporated as a ‘‘wrapper’’ method around
any object detector. Markov random fields (MRFs) are also a popu-
lar method for incorporating contextual information via spatial
dependencies in the images [134]. In more recent work, Kumar
and Hebert [135] use Discriminative Random Fields (DRFs), an
extension of MRFs, for incorporating neighborhood/scene interac-
tions. The main advantage of DRFs is their ability to relax the con-
ditional independence assumption of MRFs. A number of
researchers use the statistics of bags of localized features (edges,
lines, local orientation, color, etc.) to determine the likely distribu-
tion of those features depending on the scene or current context
[119,120,123,122]. This is often referred to as extracting the gist
of the image. Recent work has examined the use of graphical mod-
els as means of an event-based generative modeling of the objects
in an image [136,137]. For example, Murphy et al. [136] use a con-
ditional random field for simultaneously detecting and classifying
images. The authors take advantage of the correlations between
typical object placements to improve recognition performance
(e.g., a keyboard is typically close to a screen). Instead of modeling
such correlations directly, they use a hidden common cause, which
they call the ‘‘scene’’. In subsequent sections we discuss how such
contextual clues have also been used to improve the efficiency of
next-view-planners in active vision systems.
In more recent work, the role of contextual knowledge ex-
tracted from the outputs of segmentation algorithms and local
neighborhood labeling algorithms was investigated [138,139]. A
recent evaluation of the role of context in recognition algorithms
is presented in [140].
There is a general consensus in the vision community that func-
tion and context contribute significantly to the vastly superior per-
formance of the human visual system, as compared to the
performance of artificial vision systems. As it is discussed in [26],
function and context can play a significant role in attentional prim-
ing and during the learning process of new object detectors. Edel-
man [27] argues that the major challenges inhibiting the design of
intelligent vision systems include the need to adapt to diverse
tasks, the need to deal with realistic contexts, and the need to pre-
vent vision systems from being driven exclusively by conceptual
knowledge (which effectively corresponds to template matching,
as previously described). Edelman argues that the use of interme-
diate representations instead of full geometric reconstruction, is a
necessary condition for building a versatile artificial vision system.
2.8. Appearance based recognition
Early research on appearance-based recognition used global
low-level image descriptors based on color and texture histograms
[142]. See Niblack et al. [143], Pontil and Verri [144], Schiele and
Crowley [145] for some related early work. The introduction of
appearance based recognition using Principal Component Analysis
(PCA) arguably provided the first approach for reliable exemplar
based recognition of objects under ideal imaging conditions (e.g.,
no occlusion, controlled lighting conditions, etc.). The first break-
through in the area arose with Turk and Pentland’s ‘‘eigenfaces’’
Fig. 15. The Condor recognition system of Strat and Fischler [116].
844 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
paper [146] which used PCA at the level of image pixels, to recog-
nize faces. A slew of research on appearance based recognition fol-
lowed. The work by Cootes et al. on Active Appearance Models
[114,115], constituted an early proof-of-concept on the applicabil-
ity of such appearance based techniques within the context of
other vision-related tasks, such as in the tracking and medical
imaging domains [111]. One of the first approaches using PCA at
the pixel level for recognition was Murase and Nayar’s work
[141]. In contrast to the traditional approach to object recognition,
the recognition problem is formulated as a problem of matching
appearance and not shape. PCA provides a compact representation
of the object appearance parameterized by pose and illumination
(see Fig. 16). For each object of interest, a large set of images is ob-
tained by automatically varying pose and illumination. The image
set is compressed to obtain a low dimensional subspace called an
eigenspace in which an object is represented as a manifold. The ob-
ject is recognized based on the manifold it lies on. Every object is
represented as a parametric manifold in two different eigenspaces:
The universal eigenspace, which is computed using image sets of all
objects, imaged from all views and the object eigenspace which is a
different manifold for each object, computed using only images/
views of a single object. Given an image consisting of an object
of interest, the authors assume the object can be segmented from
the rest of the scene, and is not occluded by other objects.
After a scale and brightness normalization of all the views of a
certain object, they transform the image pixel intensities into a
vector. They call all the vectors for each view of an object, the ob-
ject image set, and the union of all object image sets the universal
image set. The idea is that for the object image set O
of each object
p, and for the universal image set U, principal component analysis
(PCA) is applied and eigenvectors explaining a certain percentage
of the image variation are retained—typically 90–95%. Using PCA,
an eigenbasis for the universal image set is obtained, and a differ-
ent eigenbasis for each object image set is also obtained. For each
object image set, the algorithm projects the images on the univer-
sal image set eigenbasis and on their respective object image set
Notice that for each training image the algorithm knows the
view from which the image was acquired. The algorithm also sim-
ulates variations in the illumination conditions of each image by
using a single variable. Thus, for each object image set, the pro-
jected point is parameterized in terms of the view from which
the image was acquired and in terms of the illumination conditions
under which the image was acquired. By interpolation, two contin-
uous functions are obtainable: g
, h
), the appearance of object
p in the universal eigenbasis from view h
, illumination h
, h
), the appearance of object p in the object eigenbasis from
view h
, illumination h
. The projection of an image on the univer-
sal eigenbasis gives z. The object recognition problem is reduced to
the problem of finding the object p in the universal eigenbasis
which gives the minimum value for min
|z ÷ g
; h
)|. If we
denote the projection of the image on the eigenbasis corresponding
to the recognized class p as z
, the pose is determined by finding
the h
, h
which minimize |z
÷ f
, h
)|. The above method has
a number of advantages that are shared by a big proportion of the
appearance based recognition literature. It is simple, it does not re-
quire knowledge of the shape and reflectance properties of the ob-
ject, it is efficient since recognition and pose estimation can be
handled in real time, and the method is robust to image noise
and quantization. It also shares a number of disadvantages which
are also common in the appearance based recognition literature.
It is difficult to obtain training data, since (i) there is a need to seg-
ment general scenes before the object training can happen, (ii) the
method requires that the objects are not occluded, (iii) the algo-
rithm cannot easily distinguish between two objects that differ
in one small but important surface detail and (iv) the method
would not work well for objects with a high dimensional eigen-
space and a high number of parameters, since the non-linear opti-
mization problem in high dimensions is notoriously difficult.
Nevertheless, despite these limitations, its training is much easier
than that of manually trained systems, such as the ones in Table 1.
These training limitations are shared by most modern training
algorithms which require labeled and segmented data (see Sec-
tion 4). Thus the algorithm compares favorably to many of the best
performing recognition systems which require detailed manual
A number of researchers proposed algorithms for addressing
these issues. Huang et al. [147] present an approach to recognition
using appearance parts. Zhou et al. [149] use particle filters in con-
junction with inter-frame appearance based modeling to achieve a
robust face tracker and recognizer under pose and illumination
variations. Leonardis and Bischof [148] use RANSAC to handle
occlusion in a more robust way. Given a basis set of eigenvectors
which describe a training set of objects, the authors present a ro-
bust method of determining the coefficients of the eigenvectors
which best match a target image. Using RANSAC, they randomly
select subsets of the target image pixels, and find the optimal
eigenvector coefficients that fit those pixels. At each iteration,
the worst fitting pixels are discarded—as sources of occlusion or
noise, for example—and the process is repeated. At the end, the
algorithm calculates a robust measure of the eigenvector coeffi-
cients that best fit the image. These coefficients are used to recog-
nize the object.
Thus, within the context of the pipeline in Fig. 1, we see that
appearance based approaches are strongly related to the feature
grouping module of recognition algorithms. Their power lies in
the projection of the raw extracted features to a lower dimensional
feature space which is more easily amenable to powerful classifi-
ers. This demonstrates how powerful feature grouping algorithms
led to some of the earliest reliable exemplar-based recognition sys-
tems. In Section 3 we show how active vision approaches, in
conjunction with dimensionality reduction algorithms, were pro-
ven capable of leading to a drastic reduction in the size of the ob-
ject database (see Fig. 1) needed for acceptable recognition
2.9. Local feature-based recognition and constellation methods
Local feature-based recognition methods gained popularity in
the second half of the 1990s mainly due to their robustness in clut-
ter and partial occlusion [150,151,72,152–155]. Inspired by the
machine learning literature and the introduction of promising
new classifiers (SVMs for example) that could now run within rea-
sonable time frames on personal computers, researchers started
Fig. 16. Example of a potential transformation of an image to parametric
eigenspace (from [141]). Each position on the manifold/contour represents the
eigenbasis coordinates of an object from a certain viewpoint and illumination. Thus,
if the transformed image lies close to the manifold, the image contains an instance
of the object that the manifold represents.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 845
investigating methods for extracting features from images and
applying machine learning techniques to identify the likely object
from which those features were extracted.
Local-based features are useful for local or global recognition
purposes. Local recognition is useful when we want the ability to
recognize the identity and location of part of an object. Local-based
features are also useful for global recognition purposes, when we
are not interested in the exact location of the object in an image,
but we are just interested in whether a particular image contains
a particular object at any image location (classification). Such glo-
bal features are particularly popular with Content Based Image Re-
trieval (CBIR) systems where the user is typically interested in
extracting global image properties/statistics (see Section 2.12). As
discussed in more detail in Section 4, image categorization/classi-
fication algorithms (which indicate whether an image contains
an instance of a particular object class), are significantly more reli-
able than object localization algorithms whose task is to localize
(or segment) from an image all instances of the object of interest.
Good localization performance has been achieved for restricted ob-
ject classes: in general there still does not exist an object localiza-
tion algorithm that can consistently and reliably localize arbitrary
object classes. In Chart 5 and Table 3 we present a comparison,
along certain dimensions, for a number of the papers surveyed in
Sections 2.9 and 2.10.
An early local feature-based approach to recognition is Rao and
Ballard’s iconic representation algorithm [150], which extracts local
feature vectors encoding the multiscale local orientation of various
image locations. These vectors are used for recognition purposes.
Lowe presents the SIFT algorithm [72], where a number of interest
points are detected in the image using difference-of-gaussian like
operators. At each one of those interest points, a feature vector is
extracted. Over a number of scales and over a neighborhood
around the point of interest, the local orientation of the image is
estimated using common techniques from the literature. Each local
orientation angle is expressed with respect to the dominant local
orientation, thus, providing rotation invariance. If a number of such
features are extracted from an object’s template image, we say that
the object is detected in a new test image if a number of similar
feature vectors are localized in the new test image at similar rela-
tive locations. Quite often such features are used as elements of
orientation histograms. A comparison of the similarity between
two such histograms helps determine the similarity between two
shapes. See [156,157] for some early precursors of such ap-
proaches. Currently such approaches are extremely popular in
addressing the image classification problem, and we will discuss
them in more detail in Section 4. Thus, within the context of the
recognition pipeline in Fig. 1, we see that early work on local fea-
tures was most closely related to the feature grouping module. In
Section 2.12 we will see why such features are also useful in reduc-
ing the object database storage requirements of content based im-
age retrieval systems.
Mikolajczyk and Schmid present an affine invariant interest
point detector [158] that provides a feature grouping procedure
that is more robust under certain image transformation, and can
thus improve the reliability of the recognition modules in Fig. 1 that
depend on the feature grouping module. The local features are ro-
bust in the presence of affine transformations and changes in scale,
thus, providing invariance under viewpoint changes. Interest point
candidates are firstly extracted using the multi-scale Harris detec-
tor. Then, based on the gradient distribution of the interest point’s
local neighborhood, an affine transformation is estimated that
makes the local image gradient distribution isotropic and that cor-
rects the displacements in the interest point locations across scales
due to the Harris detector. Once these isotropic neighborhoods are
obtained, any typical feature based recognition approach could be
used. The authors estimate a vector of local image derivatives of
dimension 12, by estimating derivatives up to 4th order. A simple
Mahalanobis comparison formulates hypotheses of matching inter-
est points across two images, and a RANSAC based approach further
refines the correspondences and provides a robust estimate of a
homography describing the transformation between the two
images. Using this homography, if there is a sufficient number of
matching interest points, the two images match, potentially recog-
nizing the object in one image if the object present in the other im-
age is known. Similar approaches are described elsewhere for affine
invariant matching, wide-baseline stereo and multiview recogni-
tion [159–164]. Local image-based features are also used for vision
based localization and mapping with some success [165–167] and
are currently quite popular in content-based image retrieval [168].
Belongie et al. [155] present a metric for shape similarity and
use this to find correspondences amongst shapes that are describ-
able by well defined contours, such as letters, and logos. Each
shape’s outline is discretized into a number of points, and for each
point a log-polar histogram is built of the location of the other
shape points with respect to this point (see Fig. 17). To compare
the correspondence quality of two points p
, q
on two different ob-
jects, it suffices to compare their respective histograms using the
test statistic C
= C(p
; q
) =

where h
(k) denotes
the kth entry in the histogram of the relative coordinates of the
points on the shape contour, where the relative coordinates are ex-
pressed with respect to the ith point on the shape contour. A bipar-
tite graph matching algorithm, then, searches for the best matches
amongst all the landmarks on the two shapes.
Chart 5. Summary of the 1995–2012 papers from Table 3. We notice that the local feature-based representation, constellation method, grammar and graph representation
papers surveyed in the respective sections are mostly focused on inference and training efficiency, encoding scalability and expanding the diversity of the indexing primitives
used. There is no consistent effort amongst this group of papers in simultaneously modeling 3D representations, texture and function/context.
846 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
Csurka et al. [169] and Sivic and Zisserman [170] introduced the
‘‘bag-of-features’’ approach for recognition, an influential and effi-
cient approach for recognition which was widely adopted by the
community. The main advantages of the framework is its simplic-
ity, efficiency and invariance under viewpoint changes and back-
ground clutter, which typically results in good image
categorization. The framework has four main steps: (i) Detection
of image patches (ii) Using the descriptors of image patches to as-
sign them to a cluster of mined clusters (iii) counting the number
of keypoints/features assigned to each cluster and (iv) treating the
bag of features as a feature vector and using a classifier to classify
the respective image patch.
Grauman and Darrell [171] use a ‘‘bag-of-features’’ type of an
approach to recognition. They extract SIFT features from a set of
images and then, define a pyramid-based metric which measures
the affinity between the features in any two images. A spectral
clustering based approach clusters the images based on this affin-
ity, providing a semi-supervised method for determining classes
from a set of images. Each cluster is then further refined, removing
any potential outliers from the clusters.
Another bag-of-features type of an approach is presented in
Nistér and Stewénius [75], where the authors present a recognition
scheme that scales well to a large number of objects. They present
an online test suite using a 40,000 image database from music CD
covers. The authors also tested their system using a set of 6376 la-
beled images that were embedded in a dataset of around 1 million
frames captured from various movies. Features are extracted by
using the Maximally Stable Extremal Region algorithm [164] to lo-
cate regions of interest, followed by fitting an ellipse to each such
region, and followed by transforming each ellipse into a circular
region. Then SIFT features are extracted from these normalized re-
gions and they are quantized using a vocabulary tree algorithm.
Effectively, the vocabulary tree uses hierarchical k-means to create
a feature tree, where at each layer of the tree, the features are
grouped into k subtrees. Each node of the tree is assigned an infor-
mation theoretic weight w
= ln(N/N
), where N is the number of
images in the database and N
is the number of training images
with at least one quantized vector passing through node i in the
tree. A query image is matched with the database images by
extracting all the feature vectors from the query image, and then
finding the path in the tree that best matches each feature vector.
Each node i of the tree is weighed by the number of query image
vectors that traverse the corresponding node i, and this provides
a vector which is matched with each database image vector. The
matching provides the vector’s ‘‘distance’’ to the closest matching
database images. It is surprising that the algorithm gives good per-
formance despite the fact that information about the relative posi-
tion of the various features is discarded. This reinforces the point
discussed elsewhere in this survey, that detection algorithms
which do not attempt to actually localize the position of an object
in an image, tend to perform better than localization algorithms.
Thus, within the context of the pipeline in Fig. 1, we see that this
work makes a proposal on how local features could improve the
Table 3
Comparing some of the more distinct algorithms of Sections 2.9 and 2.10 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are used
to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a not-
applicable label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity
increases. Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a
detection algorithm, this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improve
detection efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of
the object representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of
indexing primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is
used by the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (1995–2012) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Rao and Ballard [150] ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Lowe [72] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Mikolajczyk and Schmid [158] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Belongie et al. [155] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Csurka et al. [169] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄
Grauman and Darrell [171] ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Nistér and Stewénius [75] ⁄⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Sivic and Zisserman [170] ⁄⁄⁄ ⁄⁄ N/A ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Kokkinos and Yuille [172] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄ ⁄ ⁄
Lampert et al. [173] ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Fergus et al. [100] ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Fergus et al. [174] ⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Sivic et al. [175] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄
Ullman et al. [176] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Felzenszwalb and Huttenlocher [177] ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄
Leibe and Schiele [178] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Li et al. [179] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Ferrari et al. [180] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Siddiqi et al. [181] ⁄⁄ ⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
LeCun et al. [182] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Ommer et al. [183] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Ommer and Buhmann [184] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Deng et al. [185,186] ⁄⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄
Bart et al. [187,188] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄
Le et al. [189] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Fig. 17. Belongie’s algorithm [155]. The contour outline of an object is discretized
into a number of points, which are in turn mapped onto a log-polar histogram.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 847
feature grouping, object hypothesis and object verification phases,
compared to a baseline feature based approach.
Sivic and Zisserman [170] present a method inspired by the text
retrieval literature, for detecting objects and scenes from videos. A
number of affine invariant features are extracted and an index is
created using those features. A Mahalanobis distance metric is
used to cluster these features into ‘‘visual words’’, or frequent fea-
tures. Those visual words are used to achieve recognition. For a gi-
ven image/document, each visual word is assigned a weight of
importance which depends on the product of the frequency of
the word in the document with another number which downplays
words that appear frequently in the database. Given a query vector
of the visual words in an image/video sequence, and a set of visual
word vectors with their weights, extracted from our database of
videos, a matching score is based on the scalar product of query
vector with any database vector. The authors also discuss various
ways in which the weights could affect the matching score/ranking
of images/videos.
Kokkinos and Yuille [172] present scale invariant descriptors
(SIDs) and use these descriptors as the basis of an object recogni-
tion system that detects whether certain images contain cars, faces
or background texture. The authors use a logarithmic sampling
(centered at each pixel of the image) that is similar to the human
visual front-end. As a result the image region around each pixel
is parameterized in terms of a logarithmically spaced radius r
and a rotation value u. The authors show that as a result of the
non-uniform scale of spatial sampling, it is possible to obtain fea-
ture vectors that are scale and rotation invariant. These feature
vectors depend on the amplitude, orientation and phase at each
corresponding image position. They are obtained by transforming
the corresponding amplitude, orientation and phase maps of each
image into the Fourier domain, resulting in orientation and scale
invariance. These feature vectors are in turn used as the basis of
an object detector: The authors describe a methodology for
extracting candidate sketch tokens from training images and
describing their shape and appearance distributions in terms of
SIDs, which in turn enables object detection to take place.
Lampert shows how a branch and bound algorithm can be used
in conjunction with a bag-of-visual-words model in order to
achieve efficient image search, by circumventing the sliding win-
dow approach that has dominated much of the literature
[190,191,173,192]. The algorithm is able to localize the target of
interest in linear or sublinear time. The authors also show how
classifiers, such as SVMs, which were considered too slow for sim-
ilar localization tasks, can be used within this framework for effi-
cient object localization. This resulted in significant efficiency
improvements in the hypothesis generation phase of the algorithm
(see Fig. 1), which contributed to its popularity.
Local feature-based approaches have been successfully applied
to a number of tasks in computer vision. However, it is generally
acknowledged that significantly more complex types of object rep-
resentations are necessary to bridge the semantic gap between low
level and high level representations, which we discussed in Sec-
tions 1, 2.1. It is important to keep in mind that some recent work
questions whether popular object representations and recognition
algorithms do indeed offer superior performance, as compared to
other much simpler algorithms, or whether this difference in per-
formance is usually just an artifact of biased datasets (see Pinto
et al. [193], Torralba and Efros [194], and Andreopoulos and Tsot-
sos [73]). As discussed in the above papers, the empirical evidence
is clearly pointing to the fact that a common thread of most recog-
nition algorithms is their fragility and their inability to generalize
in novel environments. This is an indication that there is significant
room for breakthrough innovations in the field. While local fea-
tures are the only thing that can be observed reliably, there is a sig-
nificant on-going discussion on the situations when these local
features need to be tied together to obtain more complex represen-
tations [51]. This topic of local features vs. scene representation
tends to re-emerge within the context of the semantic gap prob-
lem. For example, when dealing with commercial vision systems
that are expected to mine massive datasets using a large number
of object classes (see Section 2.12 on content based image retrieval
systems) there is an inverse relationship between an increase in
the system’s efficiency and its reliability: as the complexity of
the extracted features and the constructed scene representations
increases, the computational resources and the amount of training
data required can easily become intractable [26].
Constellation methods are ‘‘parts-and-structure’’ models for
recognition that lie at the intersection of appearance/local-
feature-based methods and parts-based methods. They represent
an attempt to compensate for the simplicity of local-feature-
based representations, by using them to form dictionaries
of more complex and compact object representations
[195,196,179,177,176,197,178,180,198,199,100,174,200–203]. As
such they represent an evolution in the local-feature-based group-
ing phase of the pipeline in Fig. 1. Within this context, the work by
[107] which was previously discussed, could also be classified as
falling within this category since it relies on a parts-based repre-
sentation of objects. In early work on recognition, when referring
to parts-based approaches, authors were often referring to 3-D ob-
ject parts (superquadrics, cylinders, deformable models, etc.),
while more recent local-feature-based approaches are mostly used
to from 2-D parts-based representations of objects. In general,
sophisticated learning techniques have been applied to a much
greater extent on local-feature-based object representations and
constellation methods. This differentiates much of the literature
on 3-D and 2-D parts representations of objects.
An advantage of many constellation methods is that they are
learnt from unsegmented training images and the only supervised
information provided is that the image contains an instance of the
object class [174]. It is not always necessary for precise object
localization information to be provided a priori of course. However,
the less extraneous/background scene information present in the
training images, the better the resulting classifier. Typically, this
is achieved through latent variable estimation using the EM algo-
rithm. A disadvantage of such approaches is that their training
can sometimes be quite expensive. For example, many formula-
tions of constellation methods, typically, require fully connected
graphs, where the graph nodes might represent local features or
parts. As a means of simplifying such problems, authors often
use various heuristics to decrease the connectivity of the related
graphs or to simplify other aspects of the problem. The published
literature does not tend to distinguish its recognition algorithms
as exemplar or generic. The spectrum of object categories, encoun-
tered in the literature, is as general as that of cars and as specific as
that of cars with a particular pattern of stripes on it. Typically, as it
is common in the literature, only successful approaches are pub-
lished making it difficult to understand why a particular approach
that works well in one situation might not work so well in another.
We discuss this topic in more detail in Section 4.
Fergus, Perona and Zisserman are arguably some of the stron-
gest advocates of the approach and have published a series of pa-
pers on constellation methods, some of which we overview here
[100,195,174,200,203]. Their work is an early example of view-
based approaches combined with a graphical model based learning
and representation framework. Within the context of Fig. 1, their
papers represent a characteristic example of an effort to use low le-
vel indexing primitives to learn progressively more complex prim-
itives (so called words). In [100] the shape, appearance, relative
scale of the parts, and potential occlusion of parts is modeled.
For each image pixel and over a range of image scales the local sal-
iency is detected and the regions whose saliency is above a certain
848 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
threshold are the regions from which the features used for recog-
nition are extracted. The saliency metric is a product of the image
intensity histogram’s entropy over an image radius determined by
the scale, weighed by the sum over all image intensities of the rate
of change of the corresponding intensity channel as the scale var-
ies. The training is completely unsupervised, which is the main
strength of the paper. However, the method’s training is extremely
slow as a 6–7 part model with 20–30 features, using 400 training
images, takes 24–36 h to complete on a Pentium 4 PC. A number
of short-cuts have been proposed that improve the training times.
The number of parts P is specified a priori. To each part, the
algorithm assigns a feature out of the N features in the image.
The features not assigned to a part are classified as belonging to
the background, and therefore, are irrelevant. The object’s shape
is represented as a Gaussian distribution of the features’ relative
locations and the scale of each part with respect to a reference
frame is also modeled by a Gaussian distribution. Each part’s
appearance is modeled as a 121-dimensional vector whose dimen-
sion is further decreased by applying PCA on the set of all such
121-dimensional vectors in our training set. A Gaussian distribu-
tion is then used to model each part’s appearance. As it is common
with such constellation methods, an EM algorithm is applied to
determine the unknown parameters (shape mean, shape covari-
ance matrix, each part’s scale parameters, part occlusion modeling,
appearance mean and covariance matrix). As the E-step of the EM
algorithm would need to search through an exponential number of
parameters (O(N
)), the A

search algorithm is applied to improve
the training complexity. Once the training is complete, the decision
as to whether a particular object class O is present in the image is
done by maximizing the ratio of probabilities
parameters denotes all the parameters estimated during training
with the EM algorithm. Fei-Fei Li et al. have picked up on this work
and published numerous related papers. In Li et al. [197] for exam-
ple, the authors use an online version of the EM-algorithm so that
the model learning is not done as a batch process.
Fergus et al. [195] extend their approach by also encoding each
part by its curve segments. A Canny edge operator determines all
the curves, and each curve is split into independent segments at
its bi-tangent points. On each such curve a similarity transforma-
tion is applied so that the curve starts at the origin and ends at
(1, 0). The curve endpoint positioned at the origin is determined
by whether or not its centroid falls beneath the x-axis. By evenly
sampling each curve at 15 points along its x-axis, a 15-dimensional
feature vector of the curve is obtained and is modeled by a 15
dimensional Gaussian. The model is again learnt via the EM algo-
rithm. The training data set used contains valid data of the object
we wish to learn but it might also contain background irrelevant
images. RANSAC is used to fit a number of models and determine
the best trained model. By applying each learned model on two
datasets—one containing exclusively background/irrelevant data
and the other containing many correct object instances—the best
model is chosen based on the idea that the best model’s scoring
should be the lowest on the background data
and the highest on the data with valid object instances. The algo-
rithm is used to learn object categories from images indexed by
Google’s search engine.
In [200] the authors use the concept of probabilistic Latent
Semantic Analysis (pLSA) from the field of textual analysis to
achieve recognition. If we have D documents/images and each doc-
ument/image has a maximum of W words/feature types in it, we
can denote by n(w, d) the number of words of type w in document
d. If z denotes the topic/object, the pLSA model maximizes the log
likelihood of the model over the data:
L =


P(w; d)
P(w; d) =

P(w[z)P(z[d)P(d) (8)
and Z is the total number of topics/objects. Again the EM algorithm
is used to estimate latent variables and learn the model densities
P(w[z) and P(z[d). Recognition is achieved by estimating P(z[d) for
the query images.
In [174] the authors address the previously mentioned problem
of a fully connected graphical model representing all the possible
parts-features combinations. By using a Star model to model the
probability distributions, the complexity is reduced to O(N
P). In
the Star model (see Fig. 18) all the other object parts are described
with respect to a landmark part. If the position of the non-land-
mark parts is expressed with respect to the position of the land-
mark parts, translation invariance is also obtained. The authors
also obtain scale invariance by dividing the non-landmark’s loca-
tion by the scale of the landmark’s position. The rest of the ap-
proach is similar to [100].
Sivic et al. [175] present a method based on pLSA for detecting
object categories from sets of unlabeled images. The same objec-
tive function as in Eq. (7) is used, where now the EM algorithm
is used to maximize the objective function and discover the top-
ics/object classes corresponding to a number of features. The fea-
tures used by the authors are SIFT-like feature vectors. Two types
of affine covariant regions are computed in each image, using var-
ious methods described in the literature. One method is based on
[158] which we described above. For each such elliptical region a
SIFT like descriptor is calculated. K-means clustering is applied to
these SIFT descriptors to determine the ‘‘words’’ comprising our
data set. The authors demonstrate that even though this is a ‘‘bag
of words’’ type of an algorithm, it is feasible to use the algorithm
for localizing/segmenting an object in an image. The authors dem-
onstrate that doublets of features can be used to accomplish this.
Ullman et al. [176] present an approach that uses a constella-
tion of face-part templates (eyes, mouth, nose, etc.) for detecting
faces. The templates are selected using an information maximiza-
tion based approach from a training set, and detection is achieved
by selecting the highest scoring image fragments under the
assumption that the object is indeed present in the image.
Felzenszwalb and Huttenlocher [177] present a recognition
algorithm based on constellations of iconic feature representations
[150] that can also recognize articulated objects. The work is moti-
vated by the pictorial structure models first introduced by Fischler
and Elschlager [107]. The authors use a probabilistic formulation of
deformable parts which are connected by spring-like connections.
The authors indicate that this provides a good generative model of
objects which is of help with generic recognition problems. The
authors test the system with person-tracking systems.
Leibe, Schiele and Leonardis [178,198,199] model objects using
a constellation of appearance parts to achieve simultaneous recog-
nition and segmentation. Image patches of 25 × 25 pixels are ex-
Fig. 18. (left): A fully connected graphical model of a six-part object. (right): A star
model of the same object as proposed by Fergus et al. [174].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 849
tracted around each interest point detected using the Harris inter-
est point detector. Those patches are compared to codebook entries
of patches which were discovered by agglomerative clustering on a
codebook of appearance patches of an object of interest. The simi-
larity criterion is based on the normalized grey-scale correlation.
From each such cluster its center is selected as the representative
patch for the center. Once an image patch is matched to a codebook
entry, that codebook entry casts votes for the likely objects it might
have come from and places a vote in the image of the object center
with respect to the object patch. This voting mechanism is used to
select the most likely object identity. The likely object is backpro-
jected onto the image and this provides a verification and segmen-
tation of the object. In [198] this work is extended to achieve a
greater amount of scale invariance. Opelt et al. [196] use a similar
approach only that instead of using patches of appearance, they
use pairs of boundary fragments extracted using the Canny opera-
tor in conjunction with an Adaboost based classification. Li et al.
[179] present what amounts to a feature selection algorithm for
selecting the most meaningful features in the presence of a large
number of distracting and/or irrelevant features. Ferrari et al.
[180] present a method for recognition which initially detects a
single discriminative feature in the image and by exploring the im-
age region around that feature, slowly grows the set of matching
image features.
From this survey on local features and constellation methods,
we see that most research efforts in the field have been applied
to the feature grouping phase of the pipeline in Fig. 1. In Section 3
we will discuss a number of active recognition approaches, which
in conjunction with local feature based approaches and constella-
tion methods, form an alternative framework for viewpoint selec-
tion, object hypothesis formation and object verification (Fig. 1).
2.10. Grammars and related graph representations
An often encountered argument in linguistics, refers to the need
to use a sparse set of word representations in any given language,
as a means of ensuring redundancy and efficient communication,
despite the existence of potentially ambiguous basic speech signals
[204,205]. As it was first argued by Laplace [206], out of the large
set of words that could be formulated by taking random and fi-
nite-length arrangements of the letters in any popular alphabet
(such as the Latin or Greek alphabets) it is this sparsity of chosen
words and the familiarity associated with some subset of the word,
that makes a valid word stand-out as a non-random arrangement
of letters.
This has motivated the vision community to conduct research
into the use of grammars as a means of compactly encoding the
fact that certain parts of an image tend to occur more often in uni-
son than in random. This in turn precipitates the construction of
compact representations, with all the associated benefits [26].
Thus, grammars provide a formalismfor encoding certain recurring
ideas in the vision literature, such as using 2-D and volumetric-
parts for constructing compact object representations, as we have
earlier discussed. As we will demonstrate in this section, the parse
trees associated with a particular grammar, provide a simple
graph-based formalism for matching object representations to
parsed-image representations, and for localizing objects of interest
in an image. It is important to point out that in practice, the pub-
lished literature does not tend to distinguish the recognition algo-
rithms as being exemplar or generic. An early identification of the
task and scope that a particular algorithm is meant to solve, can af-
fect the graph based recognition architecture used. Thus, within
the context of the pipeline in Fig. 1, grammars are meant to offer
a compact, redundant and robust approach to feature grouping.
More formally, a grammar consists of a 4-tuple G = (V
; V
; R; S)
where V
, V
are finite sets of non-terminal and terminal nodes
respectively, S is a start symbol, and R is a set of functions referred
to as production rules, where each such function is of the form c:
a ?b for some a, b ÷ (V
. A language associated with a
grammar G denotes the set of all possible strings that could be
generated by the applications of compositions of production rules
from this grammar. A stochastic grammar associates a probability
distribution with the grammar’s language. Given a string from a
language, the string’s parse tree denotes a sequence of production
rules associated with the corresponding grammar, which generate
the corresponding string. Image grammars use similar production
rules, in order to define in a compact way generative models of ob-
jects, thus, facilitating the generalizability of object recognition
systems which use such production rules for the object representa-
tions. An interesting observation from Table 3 is that very little
work has been done on grammars and graph representations that
simultaneously incorporate function, context, 3D (both in sensing
and object representations), texture and efficient training strate-
gies. This is good indication that within the context of graph mod-
els and hierarchical representations, the previously discussed
semantic-gap problem for bridging low level and high level repre-
sentations is still open. As it will be discussed in Section 3, within
the active vision paradigm a number of similar problems emerge.
Zhu and Mumford [205] classify the related literature on image
grammars into four streams. The earliest stream is attributed to Fu
[207] who applied stochastic web grammars and plex grammars to
simple object recognition tasks. These web and plex grammars
are generalizations of the linguistic grammars earlier discussed,
and are meant to provide a generalization of the standard gram-
mars to 2-D images.
The second stream is related to Blum’s work on medial axes. In-
spired by Blum’s argument [70,208] that medial axes of shape out-
lines are a good and compact representation of shape, Leyton [209]
developed a grammar for growing more complex shapes from sim-
ple objects. More recent work has expanded the scope of graph
based algorithms using shock graphs [181,210–215]. Inspired by
Blum’s concept of the medial axis and given the significance that
symmetry plays in parts-based-recognition systems—symmetry
in generalized cylinders/geons for example—algorithms have ap-
peared for encoding the medial axis of an object in a graph struc-
ture and matching the graph structures of two objects in order to
achieve recognition. Shock graphs encode the singularities that
emerge during the evolution of the ‘‘grassfire’’ that defines the
skeleton/medial-axis of the object. These are the ‘‘protrusion’’,
‘‘neck’’, ‘‘bend’’ and ‘‘seed’’ singularities. Thus, these shocks can
be used to segment the medial axis into a tree-like structure.
Fig. 19 provides examples of the four kinds of shocks that are
typically encountered in an object’s medial axis. By encoding those
shocks into a tree-like structure (see Fig. 20) the recognition prob-
lem is reduced to that of graph/tree matching. As this corresponds
to the largest subgraph isomorphism problem, a large portion of
the research has focused on efficient techniques for matching
two tree structures and making them robust in the presence of
noise. An interesting matching algorithm is proposed by Siddiqi
et al. [181]. The authors represent a shock tree using a 1–0 adja-
cency matrix. The authors show, that finding the two shock sub-
trees whose adjacency representation has the same eigenvalue
sum, provides a good heuristic to finding the largest isomorphic
subtrees and thus, achieves recognition via matching with a tem-
plate object’s shock tree (see Fig. 20). In general, shock graphs pro-
vide a powerful indexing mechanism if segmentations/outlines of
the desired objects are provided. Using such approaches on arbi-
trary images however, requires a segmentation phase, which in
turn remains an unsolved problem and is also probably the most
fundamental problem in computer vision.
The third stream proposed by Zhu and Mumford [205] refers to
more recent work which was inspired from Grenander’s work on
850 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
General Pattern Theory [97]. According to this paradigm, patterns in
nature (including images) are formed by primitives called genera-
tors. The outputs of these generators are joined together using var-
ious graph-like criteria. Random diffeomorphisms applied to these
patterns add another degree of generalization to the generated pat-
terns. And-Or graphs lie within this stream (see Fig. 21). An And-Or
graph uses conjunctions and disjunctions of simple patterns/gener-
ators to define a representation of all the possible deformations of
the object of interest.
The fourth stream is similar to the previous stream, with the
main difference being that an extremely sparse image coding mod-
el is used (employing simple image bases derived from Gabor fil-
ters parameterized by scale, orientation and contrast sensitivity
for example) and that the related grammars can be viewed as being
stochastic context free grammars.
A number of feedforward hierarchical recognition algorithms
have been proposed over the years [216,217,104,218,105,
106,183,184,219–221]. Such hierarchical architectures can be
associated with the grammars discussed so far. One of the main
characteristics of such hierarchical representations is that they of-
ten strive for biological plausibility. Typically, such algorithms de-
fine a multiscale feedforward hierarchy, where at the lowest level
of the feedforward hierarchy, edge and line extraction takes place.
During a training phase, combinations of such features are discov-
ered, forming a hierarchical template that is typically matched to
an image during online object search.
For example, LeCun’s work on convolutional networks
[217,222,182], and a number of its variants, have been successfully
used in character recognition systems. Convolutional networks
combine the use of local receptive fields, the use of shared weights,
and spatial subsampling. The use of shared weights and subsam-
pling adds a degree of shift invariance to the network, which is
important since it is difficult to guarantee that the object of inter-
est will always be centered in the input patch that is processed by
the recognition algorithm. Because convolutional networks are
purely feedforward they are also easily parallelizable, which has
contributed to their popularity.
Another goal of hierarchical architectures is to provide an effi-
cient grammar for defining a set of re-usable object parts. These
parts are typically meant to enable us to efficiently compose multi-
ple views of multiple objects. As a result, the problem of efficient
feature grouping (see Fig. 1) that removes any ambiguities due to
environmental noise or poor imaging conditions, keeps re-emerg-
ing in the literature on hierarchical recognition systems. The resul-
tant ambiguities are one of the main reasons why hierarchical
representations do not scale very well when dealing with
thousands of object classes. For example, Ommer et al. [183]
and Ommer and Buhmann [184] propose a characteristic
methodology that attempts to deal with the complexity of real
world image categorization, by first performing a perceptual
bottom-up grouping of edges followed by a top-down recursive
grouping of features.
Fig. 19. The four types of shocks, as described by Siddiqi et al. [181].
Fig. 20. The shock trees of two objects (top row) and the correspondences between the trees and the medial axis of two views of an object (bottom row). Adapted from Siddiqi
et al. [181].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 851
It is important to point out that unrestricted object representa-
tion lengths and unrestricted representation class sizes can lead to
significant problems when learning a new object’s representation
[26]. Often with graph-like models, and especially in early re-
search, their representation strength (number of nodes and edges)
is hand-picked for the training dataset of interest, which can
potentially lead to a significant bias when tested with new
Deng et al. [185,186], present an algorithm for learning hierar-
chical relations of semantic attributes from labeled images. These
relations are similar to predicates, and are arranged in the form
of a hierarchical tree structure. The closeness of nodes in this hier-
archical structure can be used to match similar images, which is in
turn used for image classification purposes. The authors also intro-
duce a hashing algorithm for sublinear image retrieval. Similarly,
Bart et al. [187,188] describe a graphical model for learning object
taxonomies. Each node in this hierarchy represents information
that is common to all the hierarchy’s paths that pass from that
node, providing a compact representation of information. The
authors present a Gibbs sampling based approach for learning
the model parameters from training data and discuss potential
applications of such taxonomies in recognition. More recent work
[189] demonstrates the continuous evolution of research on
graph-based object representations using massively large training
data sets and clusters of thousands of CPUs.
2.11. Some more object localization algorithms
We now focus on object localization algorithms which are ro-
bust, and which localize objects in an image efficiently. This pro-
vides an overview of the approaches attempted over the years
for efficiently localizing objects in a static image. In subsequent
sections we deal with the more complex case of active object local-
ization and recognition, where we also have to physically move the
sensor over large distances. As we will see, and as it is evident from
other localization algorithms that were previously discussed with-
in other contexts (e.g., [191]), object localization efficiency is clo-
sely related to improvements in the hypothesis generation and
object verification module of recognition systems (Fig. 1). The
breadth of algorithms tested for improving search efficiency is vast.
They range from simple serial search mechanisms with winner-
take-all, and reach all the way to complex systems, integrating par-
allel search with probabilistic decision making that make use of
function, context, and hierarchical object representations. Within
this context, active vision plays an important role. As such, this sec-
tion also serves as a precursor to the active search and recognition
systems discussed in Section 3.
As previously discussed [26], time, noise, as well as other cost
constraints, can make the problem significantly more difficult.
Furthermore, as discussed in [24,26], searching for an object in
an object class without knowledge of the exact target we are
looking for—as the target appearance could vary drastically due
to occlusions for example—makes the problem intractable as
the complexity of the object class increases. This issue becomes
more evident in the case of the SLAM problem [26], where we
want to simultaneously localize but also learn arbitrary new ob-
jects/features that might be encountered in the scene. This leads
to the slightly counter-intuitive conclusion that the feature
detection algorithm/sensor used must be characterized by nei-
ther too high nor too low of a noise rate, since too low of a desir-
able object detector noise rate makes the online learning of new
objects prohibitively expensive. In the active object localization
problem, where we typically have to move robotic platforms to
do the search in 3D space under occlusions, and we know apriori
the object we are searching for, any reduction in the total num-
ber of such mechanical movements would have a drastic effect
on the search time and the commercial viability of our solution.
Thus, a central tenet of our discussion in this section involves
efficient algorithms for locating objects in an environment. In
Chart 6 and Table 4 we present a comparison, along certain
dimensions, for a number of the papers surveyed in Sections
2.11 and 2.12.
Avraham and Lindenbaum present a number of interesting pa-
pers that use a stochastic bottom-up attention model for visual
search based on inner scene similarity [223–225]. Inner scene sim-
ilarity is based on the hypothesis that search task efficiency de-
pends on the similarities between scene objects and the
similarities between target models and objects located in the
scene. Assume, for example, that we are searching for people in a
scene containing people and trees. An initial segmentation pro-
vides image regions containing either people or trees. An initial
detector on a tree segment returns a ‘‘no’’ as an answer. We can,
thus, place a lower priority on all the image segments having sim-
ilar features as the rejected ‘‘no’’ segment, thus, speeding up
search. These ideas were first put forward by Duncan and Humph-
reys [226] who rejected the parallel vs. serial search idea put for-
ward by Treisman and Gelade and suggested that a hierarchical
bottom-up segmentation of the scene takes place first with similar
segments linked together. Suppression of one segment propagates
to all its linked segments, potentially offering another explanation
for the pop-out effect. Avraham and Lindenbaum demonstrate that
their algorithm leads to an improvement compared to random
search and they also provide measures/grades indicating how easy
it is to find an object. They define a metric based on the feature
space distances between the various features/regions that are dis-
covered in an image. Assuming that each such feature might corre-
spond to a target, they present some lower and upper bounds on
the number of regions that need to be queried before a target is
Fig. 21. (left) A grammar, its universal And-Or tree, and a corresponding parse-tree shown in shadow. A parse tree denotes a sequence of production rules associated with the
corresponding And-Or tree, which generate a corresponding string of terminals. (right) An And-Or tree showing how elements a, b, c could be bound into structure A using
two alternative ways. Adapted from Zhu and Mumford [205].
852 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
discovered. Three search algorithms are presented and some
bounds on their performance are derived:
1. FNN – Farthest Nearest Neighbor. Given the set of candidates’
feature vectors or segments {x
, . . ., x
}, compute the distance of
each such feature vector/segment to each nearest neighbor, and
order the features based on descending distance. Query the
object detector module until finding the object of interest. The
idea is that the target object is usually different from most of
the rest of the image, so it should be close to the top 2 of the list.
2. FLNN – Farthest Labeled Nearest Neighbor. Given the set of can-
didates’ feature vectors/segments {x
, . . ., x
}, randomly choose
one of these feature vectors/segments and label it using the
object detector. Repeat until an object is detected. For each
unlabeled feature vector/segment, calculate its distance to the
nearest labeled neighbor. Choose the feature vector/segment
with maximum distance to query with the object detector and
get its label. Repeat.
3. VSLE – Visual Search Using Linear Estimation. Define the covari-
ance between two binary labels l(x
), l(x
) as cov(l(x
), l(x
)) =
, x
)) for some function c and distance function d. Since
the labels which denote the presence or absence of the target
are binary, their expected values denote the probability that
they take a value of 1. Given that we have estimated the labels
), . . ., l(x
) for m feature vectors/segments, we seek to obtain
a linear estimate
= a

) which minimizes the
Chart 6. Summary of the 1994–2006 papers from Table 4. We notice that search efficiency was not consistently a primary concern in the localization literature since many
algorithms tended to use an inefficient sliding window approach to localize objects. Furthermore content-based image retrieval systems mostly focused on the classification
of individual images and not on the localization problem within individual images. Nevertheless from Table 4 we see that search efficiency was identified as an important
topic in a number of papers. CBIR systems were focused on using a diverse set of efficient indexing primitives. However it is far from clear that they achieved the inference
scaling properties desired, since in order to make such systems responsive and user friendly, often accuracy was sacrificed in favor of query efficiency. We also notice very
little use of 3D in these systems.
Table 4
Comparing some of the more distinct algorithms of Sections 2.11 and 2.12 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are
used to indicate the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a
not-applicable label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity
increases. Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a
detection algorithm, this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improve
detection efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of
the object representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of
indexing primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is
used by the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (1994–2006) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Avraham and Lindenbaum [224] ⁄⁄ ⁄⁄⁄ N/A N/A N/A ⁄⁄⁄ ⁄ ⁄
Draper et al. [228] ⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄
Paletta et al. [229] ⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄ ⁄
Greindl et al. [231] ⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄ ⁄
Bandera et al. [232] ⁄ ⁄⁄⁄ ⁄⁄ N/A N/A ⁄⁄⁄ ⁄ ⁄
Darrell [233] ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄
Tagare et al. [234] ⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Torralba et al. [237] ⁄⁄⁄ ⁄ ⁄ ⁄ ⁄⁄⁄⁄ ⁄ ⁄ ⁄⁄
Opelt et al. [238,196] ⁄⁄⁄ ⁄ ⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Amit and Geman [240] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Piater [241] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄
Viola et al. [236] ⁄⁄⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Fleuret and Geman [242] ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Flickner et al. [244] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Gupta and Jain [245] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Mukherjea et al. [246] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Pentland et al. [247] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Smith and Chang [248] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Wang et al. [249] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Ma and Manjunath [250] ⁄⁄ N/A ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Laaksonen et al. [251] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 853
mean square error E((l(x
) ÷
). It can be shown that an
optimization of this expected value depends on the covariance
of various pairs of labels. Given an image with n feature vec-
tors/segments, calculate the covariance for each pair. Then,
select the first candidate randomly or based on prior knowl-
edge. Then, at iteration m + 1, estimate
for all k Pm + 1, based
on the known labels, and query the oracle about the candidate k
for which
is maximum. If enough targets are found, abort, else
The authors perform a number of tests and demonstrate that
especially the VSLE algorithm leads to an improvement in the
number of detected objects and the speed at which they are de-
tected compared to a random search using the Viola-Jones detec-
tion algorithm [227].
Draper et al. [228] present a Markov Decision Process (MDP)
based approach for performing an online sort of feature selection,
where an MDP is used to determine which set of recognition mod-
ules should be applied for detecting a set of houses from aerial
images. Paletta et al. have published a number of papers proposing
the use of reinforcement learning techniques—based on Q-learn-
ing—for deciding the image sub-regions where the recognition
module should attend to and extract features from and achieve
recognition [229], [230]. [231] present an attention based mecha-
nism using a sequence of hierarchical classifiers for attending to
image regions and recognizing objects from images. Bandera
et al. [232] present work that is similar to Paletta’s work in that
they also propose a Q-learning based approach for determining
the fixation regions that would lead to the greatest decrease in en-
tropy and object class discrimination. The features used in the
experiments are simple corner based features and are encoded in
a vector denoting the presence or absence of each feature in a
scene. Recognition is achieved using a neural network trained on
such example vectors. Note that in none of these papers do the
authors use active cameras. Darrell [233] presents a formulation
based on Partially Observable Markov Decision Processes (POMDP)
with reinforcement learning to decide where to look to discrimi-
nate a target from distractor patterns. The authors apply their ap-
proach to the problem of gesture recognition.
Tagare et al. [234] present a maximum likelihood attention
algorithm. The algorithm identifies object parts and features with-
in each object part. They propose pairs of object parts and part fea-
tures which most likely come from the object. In many ways the
algorithm is an interpretation tree algorithm formulated using
attention-like terminology. Probability densities are defined for
the chance of occlusion—using a Poisson distribution—and a max-
imum likelihood estimation is performed to determine the most
probable part-feature pair, the second most probable part-feature
pair, etc. Features used in their experiments include corners and
edges. The algorithm ends up evaluating only about 2% of all
part-feature pairs on the test images used by the authors.
Torralba et al. [237] present a method for multiclass object
detection that learns which features are common in various dispa-
rate object classes. This allows the multiclass object detector to
share features across classes. Thus, the total number of features
needed to detect objects, scales logarithmically with the number
of classes. Many object localization algorithms train a binary clas-
sifier for each object the algorithm is attempting to localize, and
slide a window across the image in order to detect the object, if
it exists. As the authors argue, the use of shared features improves
recognition performance and decreases the number of false posi-
tives. The idea is that for each object class c, a strong classifier H
is defined which is a summation of a number of weak classifiers.
Each of the weak classifiers is trained to optimally detect a subset
of the C object classes. Since in practice there are 2
such classes
the authors suggest some heuristics for improving the complexity.
Linear combinations of these weak classifiers under a boosting
framework are acquired which provide the strong classifiers H
The authors’ working hypothesis is that by fitting each weak clas-
sifier on numerous classes simultaneously, they are effectively
training the classifiers to share features, thus, improving recogni-
tion performance and requiring a smaller number of features.
The authors present some results demonstrating that joint boost-
ing offers some improvements compared to independent boosting
under a ROC curve of false positives vs. detection rate. They also
present some results on multiview recognition, where they simply
use various views of each object class to train the classifiers. Sim-
ilarly to the Viola and Jones algorithm, the authors demonstrate a
characteristic of the algorithm which makes it suitable for localiz-
ing objects in scenes—using the shifted template/window ap-
proach—, namely, its small number of false positives.
Opelt et al. [238] present a boundary fragment model approach
for object localization in images. Edges are detected using a Canny
edge detector, from pre-segmented image regions containing the
object of interest and containing a manually annotated centroid
of each object. A brute force approach searches through the bound-
ary fragments and sub-fragments, and a matching score of the frag-
ment with a validation set is calculated. The matching score is
based on the Chamfer distance of the fragments from the frag-
ments located in each image of the validation set. The matching
score also depends on how close the centroids of each object are
to each other; each fragment is associated to its centroid. The
authors define a set of weak detectors which typically learn pairs
or triples of boundary fragments that lead to optimal classification.
Those weak detectors are joined into a strong detector which rec-
ognizes the desired object from an image. Overall, the use of
boundary fragments makes this algorithm quite robust under illu-
mination changes, and should be quite robust for solving simple
exemplar-like detection tasks. Its high complexity is a significant
drawback of the algorithm though. Simple approaches to multi-
view object localization are proposed. Their work is further ex-
panded upon in [196].
Amit and Geman [240] present a computational model for local-
izing object instances. Local edge fragments are grouped based on
known geometrical relationships of the desired object that were
learnt during the training phase. The authors note that even
though their search model is not meant to be biologically plausible,
it exhibits some of the characteristics of neurons in the Inferior
Temporal Cortex such as scale and translation invariance.
A local feature based method for object localization is also pre-
sented by Piater [241]. Steerable filters are used to extract features
at various scales. These features are used to extract blobs, corners
and edges from images. These provide salient image features. As
Amit and Geman [240] did, Piater clusters these features into com-
pound features based on their geometric relations. Piater’s thesis is
interesting in that it proposes an online algorithm for learning new
discriminative features and retraining the classifier. Their algo-
rithm is composed mainly of a Bayesian network/classifier that
uses features to achieve recognition, and a feature learning system
that uses the Bayesian classifier to train on training images and de-
cide when a new discriminative feature needs to be added to the
Bayesian network.
Viola and Jones [227,235,236] present a robust approach for ob-
ject localization (see Fig. 22). Their approach is ideally suited for
localizing objects due to the low number of false negatives it pro-
duces. Search is done by the simple method of shifting a template
across space and scale, making it somewhat inefficient in its
search. A number of Haar like features are extracted from the im-
age. In the original formulation by Viola and Jones, Haar-like tem-
plates provide features similar to first and second order derivatives
acquired from a number of orientations. By extracting from each
image location p a number of such features at different scales
854 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
and neighborhoods close to p, it is easy to end up with thousands of
features for each image pixel p. The authors propose using a cas-
cade of classifiers, where each classifier is trained using Adaboost
to minimize the number of false negatives by putting a lower
weight of importance to features which tend to produce such false
negatives. However, the method is only suited for detecting simple
objects characterized by a small number of salient features, such as
faces and door handles [239]. See Fig. 23 for an example of door
handle localization using the Viola and Jones algorithm [227].
The algorithm would have problems detecting highly textured ob-
jects. Furthermore, the method’s training phase is extremely slow
and care must be taken during the training phase as it is easy to
end up with a classifier producing many false positives.
Fleuret and Geman [242] present a coarse-to-fine approach to
object localization. The authors measure their algorithm’s perfor-
mance in terms of the number of false positives and the amount
of on-line computation needed to achieve a small false negative
rate. The approach is tested on the face detection problem. The ob-
ject class is represented by a large disjunction of conjunctions.
Each conjunction represents salient object features under strict
known conditions in lighting, scale, location and orientation and
the disjunctions account for a large number of variations of these
conditions. The authors also present an interesting approach for
measuring the efficiency of object localization tasks based on the
number of branches followed in a decision tree, where each branch
might correspond to a search in a different image location, or dif-
ferent scale. They use such ideas to argue in support of a coarse
to fine approach for object localization. In other words, when
searching for a certain object, we should first search across all
scales and at the first failure to detect a necessary feature in one
of the scales, search in a different image location. The authors pres-
ent a rigorous proof of the optimality of such a search under a sim-
pler model where the non-existence of the target is declared upon
the first negative feature discovered. They also indicate that a
coarse to fine approach was proven optimal under a number of
simulations they performed, even though the proof for the general
case still eludes them.
A number of Monte-Carlo approaches for object localization
have also been attempted in an effort to escape the inefficiency
of exhaustive search across scale-space for an object. Two such ap-
proaches based on particle filters were proposed by Sullivan et al.
[243] and Isard [202] who presented particle filter based ap-
proaches for performing inference.
An overall observation is that many papers described as locali-
zation algorithms, simply follow the sliding window approach due
to their low false-positive or lowfalse-negative rate. Few of the ori-
ginal algorithms attempted to focus on salient regions based on
prior knowledge of the object they were searching for. Cognizant
of these problems, more recent work has focused on using saliency
algorithms in conjunction with task-directed biases to speed up
the search process [252]. As bottom-up algorithms do not use
any task-directed biases, the benefits they offer during visual
search only become evident when dealing with low-complexity
scenes where the objects of interest pop-out easily with respect
to the background. While there has been some effort in incorporat-
ing top-down biases in such systems, it is not clear whether they
are capable of offering benefits for foreground and background re-
gions of arbitrary complexity. As Avraham and Lindenbaum
showed, proper attentional mechanisms can lead to better perfor-
mance, so more focus on the problem is worthwhile. Notice that
usually no cost/time constraints are included in the formulation
of such papers. In subsequent sections we discuss a number of pa-
pers demonstrating the effect that such mechanisms can have on
search efficiency. Thus, within the context of the recognition
framework of Fig. 1, we notice a gradual shift of the effort ex-
pended on the hypothesis generation and object verification phase.
The shift is towards a preference for the integration of ever more
powerful and complex inference algorithms that improve search
efficiency, moving us beyond the impediments of the sliding win-
dow approach to recognition.
2.12. Content-based image retrieval
Every day around 2.5 quintillion bytes of data is created. It is
estimated that 90% of the data in the world was created over the
last two years [253]. A significant portion of this data consists of
digital pictures and videos. Online video and multimedia content
has experienced annual double digit growth for the last few years
[254], precipitating the need for steady improvements in the auto-
mated mining of video and pictorial data for useful information.
Numerous content based image retrieval (CBIR) systems have been
proposed for addressing this problem. Proposed CBIR solutions
typically lie at the intersection of computer vision, databases,
information retrieval [168], HCI and visualization/computer-
graphics. Arguably, the first workshop on the topic took place in
Florence in 1979 [168,255]. This was followed in 1992 by a work-
shop organized by the US National Science Foundation [168,256]
where the need for interactive image understanding was empha-
sized. The need to include a non-trivial interaction component is
what differentiates research on CBIR systems from the more
Fig. 22. The algorithm by Viola and Jones [227,235,236]. On the left some of the two, three and four-rectangle features/kernels used are shown. Tens of thousands of these
features are used as simple classifiers in an Adaboost framework, in order to define a cascade of classifiers which progressively prune image regions which do not contain the
object of interest (see the right subfigure).
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 855
classical object recognition algorithms previously discussed. See
[168,257–259] for good surveys on the topic. Some early influential
example systems in the commercial domain [258] include IBM’s
QBIC [244], VIRAGE [245], and the NEC AMORE system [246], as
well as the MIT Photobook [247], the Columbia VisualSEEK/Web-
SEEK [248], Stanford’s WBIIS system[249] and the UCSB NeTra sys-
tem [250] from academia.
CBIR has been a vibrant topic of research. Huijsmans and Sebe
[260] note that image search can be split into three main catego-
ries: (i) search by association, where an iterative process is used
to refine the browsed images, (ii) aimed search where the user
explicitly specifies the image he wishes to search for, and (iii) cat-
egory search where a more loosely defined semantic class is
searched for (which could be defined by a text string, an image
example, or a combination of both).
Similarly, Datta et al. [258] categorize CBIR systems based on
four axes: User Intent, Data Scope, Query Modalities and Visualiza-
tion. User Intent is characterized by the clarity of the user’s intent:
is the user browsing for pictures with no clear goal (in which case
the user is a Browser), a moderate goal where the user slowly be-
comes more clear in his end-goal (the user is a Surfer) or he has
from the very beginning a very clear understanding of what he is
looking for (the user is a Searcher). Identifying the user type and
adjusting the user interface accordingly could vastly improve the
user experience and potentially influence the commercial viability
of a system, exemplifying the importance of the HCI component in
the design of a CBIR system.
Clarifying the scope of the image and video data can also be
very important since this can influence the system design in terms
of how reliable the image search has to be, how fast and responsive
the underlying hardware architecture has to be as well as what
type of user interface to implement. These dimensions could be
particularly important in the case of social media websites for
example, where users tend to share photo and video albums using
a variety of data acquisition modalities (e.g., smartphones). Datta
et al. [258] classifies the image and video data based on whether
it is intended for: (i) a personal collection expected to be stored lo-
cally, to be of relatively small size and to be accessible only to its
owner, (ii) a domain specific collection such as medical images or
images and videos acquired from a UAV, (iii) enterprise collection
for pictures and videos available in an intranet and potentially not
stored in a single central location, (iv) archives for images and vid-
eos of historical interest which could be distributed in multiple
disk arrays, accessible only via the internet and requiring different
levels of security/access controls for different users, (v) Web-based
archives that are available to everyone and should be able to sup-
port non-trivial user traffic volumes, store vast amounts of data,
and search semi-structured and non-homogeneous data.
Query modalities for CBIR system can rely on keywords, free-
text (consisting of sentences, questions, phrases, etc.), images
(where the user requests that similar images, or images in the
same semantic category as the query image to be retrieved), graph-
ics (where the user draws the query image/shape) as well as com-
posite approaches based on combinations of the aforementioned
Finally visualization is another aspect of a CBIR system that can
influence its commercial success [258]. Relevance-ordered results
are presented based on some order of importance of the results,
and it is the approach adopted by most image search engines.
Time-ordered results are presented in chronological order and
are commonly used in social media websites such as Facebook. A
clustered presentation of the results can also provide an intuitive
way of browsing groups of images. A hierarchical approach to visu-
alizing the data could be implemented through the use of metadata
associated with the images and videos. As noted in Datta et al.
[258], a hierarchical representation could be useful for educational
purposes. Finally combinations of the above visualization systems
could be a useful feature when designing personalized systems.
CBIR systems need to be efficient and scalable, resulting in a
preference towards the use of storage wise scalable feature
extraction algorithms. However such feature extraction algorithms
must also be powerful in terms of their indexing capabilities. The
so-called semantic gap, which we discussed in some detail in Sec-
tions 1, 2.1, is a topic which re-emerges in the CBIR literature. The
ability to integrate feature primitives which are also powerful
indexing mechanisms is typically moderated by the fact that
powerful indexing primitives tend to also be quite expensive
Scalable features are either color-based, texture-based or shape
based. See Veltkamp and Tanase [259] for a good overview. Color
features can rely on the use of an image’s dominant color, a re-
gion’s histogram, a color coherence vector, color moments, correla-
tion histograms, or a local image histogram. Common texture
features used include edge statistics, local binary patterns, random
field decomposition, atomic texture features and Wavelet, Gabor or
Fourier based features. Common shape features include the use of
bounding boxes such as ellipses, curvature scale space, elastic
models, Fourier descriptors, template matching and edge direction
Thus, within the CBIR context, we see that the general recogni-
tion framework of Fig. 1 forms a submodule of a more complex sys-
tem lying at the confluence of HCI (user interfaces, user intent
prediction, visualization) database systems, and object recognition.
Flickner et al. developed the QBIC system which uses color, tex-
ture and shape features [244,259]. The RGB, YIQ, Lab and Munsell
color spaces are used to extract whole image color representations
or average color vectors. The shape features rely on algebraic mo-
ments invariants (corresponding to eigenvalues of various con-
structed matrices), major axis orientation, shape area and
eccentricity. Tamura’s texture features were the inspiration for
the coarseness, contrast and directionality features used [261].
Querying is based on example images, sketches drawn by the user,
and through the use of various query colors and texture patterns.
Matching relies on variants of the Euclidean distance between
Fig. 23. In [239] a vision-based computer-controlled wheelchair equipped with a 6DOF arm, actively searches an indoor environment to localize a door, approaches the door,
pushes down the door handle, opens the door, and enters the next room.
856 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
the extracted color, shape and texture features. Relevance feedback
enables the user to select retrieved images and use them as seeds
in subsequent queries.
The Photobook from MIT’s Media Lab [262], is a system that re-
lies on the extraction of faces, 2D shapes and texture images. The
faces and 2D shape features rely on the eigenvectors of a covari-
ance matrix depending on pixel intensities and various feature
points defining the object’s shape [113]. The texture description
of the object depends on periodicity, directionality and random-
ness. For each image category a few characteristic prototypes for
the image category are selected. For each database image the aver-
age distance to these image prototypes is calculated. The distance
of the query image to these averages is used during query time in
order to match the query image to a category.
The PicSOM system [251,263–265] will be discussed in more
detail towards the end of the paper as these features have also
been used successfully in the annual PASCAL competitions. Briefly,
a number of color features are extracted from the RGB channels,
and the YIQ channels. Also the image edges (extracted with a Sobel
operator) in conjunction with the low-passed Fourier spectrum,
provide another 128-dimensional vector which is useful for recog-
nition purposes. A Self-Organizing Map (SOM) is used to match the
images, where the distance between SOM units corresponds to the
Euclidean distance between the above described feature vectors.
From the CBIR systems compared in Table 4, we notice that it is
difficult to compare and discriminate CBIR systems along the typ-
ical dimensions used to compare recognition systems. This is be-
cause few performance metrics are typically disclosed for such
systems. Early published work relied on fairly similar features
(mostly color, shape, texture, and sometimes text based) and made
little to no use of function, context, or 3D object representations.
More recent work on image classification (as reviewed in Sections
2.7, 2.9 for example) relies to a greater extent on the use of context.
The main differentiating factor amongst CBIR systems lies in the
user interface (how users enter their queries, the type of queries,
the use of relevance feedback), how data is visualized, and query
The discussion on classical approaches to object recognition,
has provided the reader with an overview of the types of algo-
rithms that could be incorporated in a CBIR system. Practical CBIR
systems have to make certain compromises between the indexing
power of the feature extraction algorithms, their generality, and
their computational requirements. The practical success of a CBIR
system will also ultimately rely on its user interface, the power
of its relevance feedback mechanism, the system’s ability to pre-
dict what the user is searching for, as well as on the representa-
tional power of the system’s core recognition algorithms.
3. Active and dynamic vision
In the introduction we overviewed some of the advantages and
disadvantages of the active vision framework. The human visual
system has two main characteristics: the eyes can move, and visual
sensitivity is highly heterogeneous across visual space [33]. Curi-
ously, these characteristics are largely ignored by the vision
The human eyes exhibit four types of behaviors: saccades, fixa-
tion, smooth pursuit, and vergence. Saccades are ballistic move-
ments associated with visual search. Fixation is partially
associated with recognition tasks which do not require overt atten-
tion. Smooth pursuit is associated with tracking tasks and vergence
is associated with vision tasks which change the relative directions
of the optical axes. How do these behaviors fit within the active vi-
sion framework in computer vision? As discussed in Section 2.7, it
is believed that during early childhood development, the associa-
tion between the sight of an object and its function is primed by
manipulation, randomly at first, and then in a more and more re-
fined way. This hints that there exists a strong association between
active vision and learning. Humans are excellent in recognizing
and categorizing objects even from static images.
It can thus be ar-
gued that active vision research is at least as important for learning
object representations as it is for online recognition tasks.
Findlay and Gilchrist [33] make a compelling argument in sup-
port of more research in the active approach to human vision:
1. Vision is a difficult problem consisting of many building blocks
that can be characterized in isolation. Eye movements are one
such building block.
2. Since visual sensitivity is the highest in the fovea, in general,
eye movements are needed for recognizing small stimuli.
3. During a fixation, a number of things happen concurrently: the
visual information around the fixation is analyzed, and visual
information away from the current fixation is analyzed to help
select the next saccade target. The exact processes involved in
this are still largely unknown.
Findlay and Gilchrist [33] also pose a number of questions, in
order to demonstrate that numerous basic problems in vision still
remain open for research.
1. What visual information determines the target of the next eye
2. What visual information determines when eyes move?
3. What information is combined across eye movements to form a
stable representation of the environment?
As discussed earlier [29], a brute force approach to object local-
ization subject to a cost constraint, is often intractable as the
search space size increases. Furthermore, the human brain would
have to be some hundreds of thousands times larger than it cur-
rently is, if visual sensitivity across the visual space was the same
as that in the fovea [29]. Thus, active and attentive approaches to
the problem are usually proposed as a means of addressing these
We will show in this section that within the context of the gen-
eral framework for object recognition that was illustrated in Fig. 1,
previous work on active object recognition systems has conclu-
sively demonstrated that active vision systems are capable of
leading to significant improvements in both the learning and infer-
ence phases of object recognition. This includes improvements in
the robustness of all the components of the feature-
extraction ?feature-grouping ?object-hypothesis ?object-veri-
fication ?object-recognition pipeline.
Some of the problems inherent in single view object recogni-
tion, include [266]:
1. The impossibility of inverting projection and the fragility of
3D inference. It is, in general, impossible to recover a three
dimensional world from its two dimensional projection on an
image, unless we make restrictive assumptions about the
2. Occlusion. Features necessary for recognition might be self-
occluded or occluded by other objects.
3. Detectibility. Features necessary for recognition might be miss-
ing due to low image contrast, illumination conditions and
incorrect camera placement [73].
4. View degeneracies. As discussed in [49], view degeneracies
that are caused by accidental alignments can easily lead to
wrong feature detection and bad model parameterizations.
see Biederman’s work at http://geon.usc.edu/~biederman/ObjectRSVP.mov.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 857
It is straight-forward to see how the above problems can ad-
versely influence the components of a typical object recognition
system shown in Fig. 1. Various attempts have been made to ad-
dress these problems. The various 3D active object recognition sys-
tems that have been proposed so far in the literature can be
compared based on the following four main characteristics [267]:
1. Nature of the next view planning strategy. Often the features
characterizing two views of two distinct objects are identical,
making single view recognition very difficult. A common goal
of many active recognition strategies is to plan camera move-
ments and adjust the camera’s intrinsic parameters in order
to obtain different views of the object that will enable the sys-
tem to escape from the single view ambiguities. While classical
research on active vision from the field of psychology has lar-
gely focused on ’eyes and head’ movements, the next-view
planning literature in computer vision and robotics assumes
more degrees of freedom since there are no constraints on
how the scene can be sensed or what types of actuators the
robotic platform can have.
2. Uncertainty handling capability of the hypothesis generation
mechanism. One can distinguish between Bayesian based and
non-Bayesian based approaches to the hypothesis generation
problem and the handling of uncertainty in inference.
3. Efficient representation of domain knowledge. The efficiency
of the mechanism used to represent domain knowledge and
form hypotheses is another feature distinguishing the recogni-
tion algorithms. This domain knowledge could emerge in the
form of common features—such as edges, moments, etc.—as
well as other features that are appropriate for using context
or an object’s function to perform recognition.
4. Speed and efficiency of algorithms for both hypothesis gen-
eration and next view planning. Complexity issues arise, for
example, in terms of the reasoning and next view planning
algorithm that is used, but also in terms of other components
of the recognition algorithm. The complexity of those sub com-
ponents can play a decisive role as to whether we will have a
real-time performing active object recognition algorithm, even
if we use a highly efficient representation scheme of the domain
knowledge from point 3.
As indicated in the introduction, the dynamic vision paradigm
subsumes the active vision paradigm, and is more focused on dy-
namic scenes where vision algorithms (such as recognition algo-
rithms) are applied concurrently to the actions being executed.
Within this context, a significant topic of research in dynamic vi-
sion systems is the incorporation of predictions of future develop-
ments and possibilities [50,268]. Historically, dynamic vision
systems have focused on the construction of vision systems that
are reliable in indoor and outdoor environments. Within this con-
text, dynamic vision systems are also more tightly coupled to the
research interests of the robotics community, as compared to clas-
sical computer vision research.
Historically, dynamic vision research emerged due to the need
to integrate recursive estimation algorithms (e.g., Kalman filters)
with spatio-temporal models of objects observed from moving
platforms. As pointed out by Dickmanns [50], applying vision algo-
rithms concurrently to the actions performed requires the follow-
ing (also see Fig. 24): (i) The computation of the expected visual
appearance from fast moving image sequences and the representa-
tion of models for motion in 3-D space and time. (ii) Taking into ac-
count the time delays of the different sensor modalities, and taking
into account these time delays in order to synchronize the image
interpretation. (iii) The ability to robustly fuse different elements
of perception (such as inertial information, visual information
and odometry information) whose strengths and weaknesses
might complement each other in different situations. For example,
visual feedback is better for addressing long term stability drift-
problems which might emerge from inertial signals, while inertial
signals are better for short-term stability when implementing ego-
motion and gaze stabilization algorithms. (iv) Incorporating a
knowledge-base of manoeuvre elements for helping with situa-
tional assessment. (v) Incorporating a knowledge base of behav-
ioral capabilities for various scene objects, so that the objects’
behavior and identity could be identified more easily from small
temporal action elements. (vi) Taking into consideration the inter-
dependence of the perceptual and behavioral capabilities and ac-
tions across the system’s various levels, all the way down to the
actual hardware components.
We see that dynamic vision systems incorporate what is often
also referred to as contextual information, thus taking a much
Fig. 24. Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from [50,268]).
858 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
broader and holistic approach to the vision problem. A significant
insight of Dickmanns’ spatio-temporal approach to vision was that
the modeling of objects and motion processes over time in 3-D
space (as compared to modeling them directly in the image plane)
and the subsequent perspective projection of those models in the
image plane, led to drastic improvements in the calculation of
the respective Jacobian matrices used in the recursive estimation
processes, and thus became necessary components of a dynamic
vision system. This approach led to the creation of robust vision
systems that were far more advanced than what had been consid-
ered to be the state-of-the-art up until then. Examples of such sys-
tems include pole balancing using an electro-cart [269], the first
high-speed road vehicle guidance by vision on a highway [50]
(which includes modules for road recognition [270–273], lane rec-
ognition, road curvature estimation, and lane switching [274,50],
obstacle detection and avoidance [275], recognition of vehicles
and humans [276], and autonomous off-road driving [50]) as well
as aircraft and helicopters with the sense of vision for autono-
mously landing [277,278]. Within the context of the recognition
pipeline shown in Fig. 1, we see that the work by Dickmanns im-
proved the reliability of the measured features, it improved the
reliability of predicted features, of the object hypotheses and their
subsequent grouping, when attempting to extract these features
under egomotion. These improvements led to significant and sur-
prising for the time innovations in vision, by demonstrating for
example the first self-driving vision-based vehicle.
In Section 1 we discussed some of the execution costs associ-
ated with an active vision system. These problems (such as the
problem of determining correspondences under an imperfect ste-
reo depth extraction algorithm and the problem of addressing
dead-reckoning errors) are further exacerbated in dynamic vision
systems where the actions are executed concurrently to the vision
algorithms. This is one major reason why the related problems are
more prominently identified and addressed in the literature on
dynamical vision systems, since addressing these problems usually
becomes a necessary component of any dynamic vision system.
At this point we need to make a small digression, and discuss
the difference between passive sensors, active sensors, active vi-
sion and passive vision. While passive and active vision refers to
the use (or lack thereof) of intelligent control strategies applied
to data acquisition process, an active sensor refers to a sensor
which provides its own energy for emitting radiation, which in
turn is used to sense the scene. In practice, active sensors are
meant to complement classical passive sensors such as light sensi-
tive cameras. The Kinect [295] is a popular example of a sensor that
combines a passive RGB sensor and an active sensor (an infrared
laser combined with a monochrome CMOS camera for interpreting
the active sensor data and extracting depth). One could classify vi-
sion systems into those which have access to depth information
(3D) and those that do not. One could argue that the use of active
sensors for extracting depth information is not essential in the ob-
ject recognition problem, since the human eyes are passive sensors
and stereo depth information is not an essential cue for the visual
cortex. In practice, however, active sensors are often superior for
extracting depth under variable illumination. Furthermore, depth
is a useful cue in the segmentation and object recognition process.
One of the earliest active recognition system [286] made use of la-
ser-range finders. Within the context of more recent work, the suc-
cess of Kinect-based systems [296–299] demonstrates how
combined active and passive sensing systems improve recognition.
For example, the work by [296] achieved top ranked performance
in a related recognition challenge, by leveraging the ability of the
Kinect to provide accurate depth information in order to build reli-
able 3D object models. Within the context of the recognition pipe-
line shown in Fig. 1, active sensors enable us to better register the
scene features with the scene depth. This enables the creation of
higher fidelity object models, which in turn are useful in improving
the feature grouping phase (e.g., determining the features which lie
at similar depths) as well as the object hypothesis and recognition
phases (by making 3D object model matching more reliable).
3.1. Active object detection literature survey
With the advent popularity in the 1990s of machine learning
and Bayesian based approaches for solving computer vision prob-
lems, active vision approaches lost their popularity. The related
number of publications decreased significantly between the late
1990s and the next decade.
This lack of interest in active vision systems is partially attribut-
able to the fact that power efficiency is not a major factor in the de-
sign of vision algorithms. This is also evidenced by the evaluation
criteria of vision algorithms in popular conferences and journals,
where usually no power metrics are presented. Note that an algo-
rithm’s asymptotic space and time complexity is not necessarily a
sufficiently accurate predictor of power efficiency, since this does
not necessarily model well the degree of communication between
CPU and memory in a von-Neumann architecture. One of the main
research interests of the object recognition community over the
last 10–15 years, has been on the interpretation of large datasets
containing images and video. This has been mainly motivated by
the growth of the internet, online video, and smartphones, which
make it extremely easy for anyone to capture high quality pictures
and video. As a result most resources by the vision community
have been focused on addressing the industry’s need for good vi-
sion algorithms to mine all this data. As a result, research on active
approaches to vision was not a priority.
Recently, however, there has been a significant upsurge of inter-
est in active vision related research. This is evidenced by some of
the more recent publications on active vision, which are also dis-
cussed in Sections 3.1 and 3.2. In this section we focus on the active
object detection problem, which involves the use of intelligent data
acquisition strategies in order to robustly choose the correct value
of at least one binary label/classification associated with a small 3D
region. The main distinguishing characteristic of the active object
detection literature, as compared to the literature on active object
localization and recognition, is that in the detection problem we
are interested in improving the classification performance in some
small 3D region, and are not as interested in searching a large 3D
region to determine the positions of one or more objects. In Charts
7, 8 and Tables 5, 6 we compare, along certain dimensions, a num-
ber of the papers surveyed in Sections 3.1, 3.2. Notice that the ac-
tive object detection systems of Table 5 make little use of function
and context. In contrast to the non-active approaches, all the active
vision systems rely on 3D depth extraction mechanisms through
passive (stereo) or active sensors. From Tables 5, 6 we notice that
no active recognition system is capable of achieving consistently
good performance along all the compared dimensions. In this re-
spect it is evident that the state of the art in passive recognition
(Table 7) surpasses the capabilities of active recognition systems.
Wilkes and Tsotsos [266] published one of the first papers to
discuss the concept of active object detection (see Fig. 25) by pre-
senting an algorithm to actively determine whether a particular
object is present on a table. As the authors argue, single viewobject
recognition has many problems because of various ambiguities
that might arise in the image, and the inability of standard object
recognition algorithms to move the camera and obtain a more suit-
able viewpoint of the object and thus, escape from these ambigui-
ties. The paper describes a behavior based approach to camera
motion and describes some solutions to the above mentioned
ambiguities. These ambiguities are discussed in more detail by
Dickinson et al. [280]. Inspired by the arguments in [266], the
authors begin by presenting certain reasons as to why the problem
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 859
Chart 7. Summary of the 1989–2009 papers in Table 5 on active object detection. By definition search efficiency is not the primary concern in these systems, since by
assumption the object is always in the sensor’s field of view. However inference scalability constitutes a significant component of such systems. We notice very little use of
function and context in these systems. Furthermore, training such systems is often non-trivial.
Chart 8. Summary of the 1992–2012 papers on active object localization and recognition from Table 6. As expected, search efficiency and the role of 3D information is
significantly more prominent in these papers (as compared to Chart 7).
Table 5
Comparing some of the more distinct algorithms of Section 3.1 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are used to indicate
the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a not-applicable
label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity increases.
Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection
algorithm, this refers to its localization efficiency within the context of a sliding-window/exhaustive approach (i.e., the degree of the use of intelligent strategies to improve
detection efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of
the object representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of
indexing primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is
used by the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (1989–2009) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Wilkes and Tsotsos [266] ⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄ ⁄
Callari and Ferrie [279] ⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄
Dickinson et al. [280] ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Schiele and Crowley [281] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Borotschnig et al. [282] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Paletta and Prantl [283] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Roy et al. [284] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄ ⁄
Andreopoulos and Tsotsos [239] ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄
Roy and Kulkarni [285] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Hutchinson and Kak [286] ⁄ ⁄⁄ ⁄ ⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄ ⁄
Gremban and Ikeuchi [287] ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄⁄ ⁄
Herbin [288] ⁄ ⁄ ⁄ ⁄ ⁄ ⁄ ⁄ ⁄
Kovacic et al. [289] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄ ⁄
Denzler and Brown [290] ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄
Laporte and Arbel [291] ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄
Mishra and Aloimonos [292] ⁄⁄⁄ N/A N/A N/A ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄
Mishra et al. [293] ⁄⁄⁄ N/A N/A N/A ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄
Zhou et al. [294] ⁄⁄⁄ N/A N/A N/A N/A N/A N/A N/A
860 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
of recognizing objects from single images is so difficult. The rea-
sons were discussed in the previous section and include the impos-
sibility of inverting projection, occlusions, feature detectibility
issues, the fragility of 3D inference, and view degeneracies. To ad-
dress these issues the authors define a special view as a view of the
object, optimizing some function f of the features extracted from
the image data. Let P
, P
, P
be three points on the object and d
denote the distance of the projected line between points P
. The authors try to locate a view of the object maximizing d
and d
subject to the constraint that the distance of the camera
from the center of the line joining P
and P
is at some constant
distance r. The authors argue that such a view will make it less
likely that they will end up in degeneracies involving points P
, P
[49]. Once they have found this special view, the authors
suggest using any standard 2D pattern recognition algorithm to
do the recognition. Within the context of the standard recognition
Table 6
Comparing some of the more distinct algorithms of Section 3.2 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are used to indicate
the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a not-applicable
label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity increases.
Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection
algorithm, this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improve detection
efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of the object
representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of indexing
primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by
the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (1992–2012) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Rimey and Brown [302] ⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄⁄ ⁄ ⁄
Wixson and Ballard [303] ⁄ ⁄⁄⁄ ⁄ ⁄ ⁄ ⁄⁄⁄⁄ ⁄ ⁄
Sjöö et al.[304] ⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄⁄ ⁄
Brunnström et al. [305,306] ⁄ ⁄⁄ ⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄
Ye and Tsotsos [307] ⁄ ⁄⁄⁄ ⁄ ⁄ ⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄
Minut and Mahadevan [308] ⁄ ⁄⁄⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄ ⁄
Kawanishi et al. [309] ⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄⁄ ⁄
Ekvall et al. [310] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄
Meger et al. [311] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Forssen et al. [312] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Saidi et al. [313] ⁄ ⁄⁄⁄ ⁄ ⁄ ⁄ ⁄⁄ ⁄⁄⁄ ⁄
Masuzawa and Miura [314] ⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄⁄ ⁄
Sjöö et al. [315] ⁄ ⁄⁄ ⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄
Ma et al. [316] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄
Andreopoulos et al. [24] ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄⁄ ⁄
Table 7
Comparing some of the more distinct algorithms of Section 4.2 along a number of dimensions. For each paper, and where applicable, 1–4 stars (⁄, ⁄⁄, ⁄⁄⁄, ⁄⁄⁄⁄) are used to indicate
the strength/expended effort along the corresponding dimension. These often implicitly denote why a particular paper became well known. Where appropriate, a not-applicable
label (N/A) is used. Inference scalability: The focus of the paper on improving the robustness of the algorithm as the scene complexity or the object class complexity increases.
Search efficiency: The use of intelligent strategies to decrease the time spent localizing an object when the corresponding algorithm is used for localization. If it is a detection
algorithm, this refers to its localization efficiency within the context of a sliding-window approach (i.e., the degree of the use of intelligent strategies to improve detection
efficiency). Training efficiency: The level of automation in the training process, and the speed with which the training is done. Encoding scalability: The encoding length of the object
representations as the number of objects increases or as the object representational fidelity increases. Diversity of indexing primitives: The distinctiveness and number of indexing
primitives used. Uses function or context: The degree to which function and context influences the algorithm. Uses 3D: The degree to which depth/range/3D information is used by
the algorithm for inference or model representations. Uses texture: The degree to which texture discriminating features are used by the algorithm.
Papers (2002–2011) Inference Search Training Encoding Diversity of indexing Uses function Uses Uses
scalability efficiency efficiency scalability primitives or context 3D texture
Zhang et al. [142] ⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄⁄
Dalal and Triggs [335] ⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Leibe et al. [199] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Laaksonen et al. [263] ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Perronnin and Dance [364] ⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Chum and Zisserman [365] ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Felzenszwalb et al. [366] ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Ferrari et al. [367] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
van de Weijer and Schmid [368] ⁄⁄ ⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Viitaniemi and Laaksonen [265] ⁄⁄ ⁄ ⁄ ⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Harzallah et al. [361] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄
Tahir et al. [342] ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Felzenszwalb et al. [363] ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄
Vedaldi et al. [356] ⁄⁄⁄ ⁄ ⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄
Wang et al. [357] ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄
Khan et al. [358] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
van de Sande et al. [351] ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄ ⁄ ⁄⁄
Bourdev and Malik [352] ⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄⁄⁄ ⁄⁄
Perronnin et al. [354] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Zhu et al. [348] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄
Chen et al. [349] ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄ ⁄ ⁄⁄
Song et al. citesong2011 ⁄⁄⁄ ⁄⁄ ⁄⁄⁄ ⁄⁄ ⁄⁄⁄⁄ ⁄⁄⁄ ⁄ ⁄⁄⁄
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 861
pipeline in Fig. 1, we see that [266] showed how an active vision
system can escape from view-degeneracies, thus leading to more
reliable feature extraction and grouping.
Callari and Ferrie [300,279] introduce a method for view selec-
tion that uses prior information about the objects in the scene. The
work is an example of an active object detection system that incor-
porates contextual knowledge. This contextual knowledge is used
to select viewpoints that are optimal with respect to a criterion.
This constrains the gaze control loop, and leads to more reliable
object detection. The authors define contextual knowledge as the
join of a discrete set of prior hypotheses about the relative likeli-
hood of various model parameters s, given a set of object views,
with the likelihood of each object hypothesis as the agent explores
the scene. The active camera control mechanism is meant to aug-
ment this contextual knowledge and, thus, enable a reduction in
the amount of data needed to form hypotheses and provide us with
more reliable object recognition. The paper describes three main
operations that an agent must perform: (a) Data collection, regis-
tration with previous data and modeling using a pre-defined scene
class. (b) Classification of the scene models using a set of object
hypotheses. (c) Disambiguation of ambiguous hypotheses/classifi-
cations by collecting new object views/data to reduce ambiguity.
The paper does not discuss how to search through an arbitrary
3D region to discover the objects of interest. The paper assumes
that the sensor is focused on some object, and any motion along
the allowed degrees of freedom will simply sense the object from
a different viewpoint (i.e., it tackles a constrained version of the ob-
ject search problem). Thus, this active vision system provided a
methodology for improving the object hypothesis and verification
phases of the pipeline in Fig. 1.
[280] Combine various computer vision techniques in a single
framework in order to achieve robust object recognition. The algo-
rithm is given the target object as its input. Notice that even
though the paper does deal with the problem of object search
and localization within a single image, its next viewpoint control-
ler deals mostly with verifying the object identity from a new
viewpoint, which is the reason we refer to this algorithm as an ac-
tive object detector.
The paper combines a Bayesian based attention mechanism,
with aspect graph based object recognition and viewpoint control,
in order to achieve robust recognition in the presence of ambigu-
ous views of the object. See Figs. 26, 27 for an overview of the var-
ious modules implemented in the system. The object
representation scheme is a combination of Object Centered Model-
ing and Viewer Centered Modeling. The object centered modeling
is accomplished by using 10 geons. These geons can be combined
to describe more complex types of objects. The Viewer Centered
modeling is accomplished by using aspects to represent a small
set of volumetric parts from which an object is constructed, rather
than directly representing an object. One obvious advantage of this
is the decrease in the size of the aspect hierarchy. However, if a vol-
umetric part is occluded, this could cause problems in the recogni-
tion. To solve this problem, the authors extend the aspect graph
representation into an aspect graph hierarchy (see Fig. 12) which
consists of three levels. The set of aspects that model the chosen
volumes, the set of component faces of the aspects, and the set
of boundary groups representing all subsets of contours bounding
the faces. The idea is that if an aspect is occluded, they can use
some of these more low-level features to achieve the recognition.
Fromthis hierarchy of geon primitives, aspects, faces and boundary
groups, the authors create a Bayesian network, and extract the
associated conditional probabilities. The probabilities are extracted
in a straightforward manner by uniformly sampling the geons
using a Gaussian sphere. For example, to estimate the probability
of face x occurring given that boundary group y is currently visible,
they use the sampled data to calculate the related probability.
From this data, the authors use a slight modification of Shannon’s
entropy formula to discover that for the geon based representation,
faces are more discriminative than boundary groups. Therefore,
they use faces as a focus feature for the recovery of volumetric
parts (see Fig. 28).
Using various segmentation algorithms described in the litera-
ture, the authors segment the images and create region topology
graphs (denoting region adjacencies), region boundary topology
graphs (denoting relations between partitioned segments of
bounding contours) and face topology graphs (indicating the la-
beled face hypothesis for all regions in the image). Each region’s
shape in the image is classified by matching its region boundary
graph to those graphs representing the faces in the augmented as-
pect hierarchy graph using interpretation tree search. This enables
the creation of face topology graphs labeling the current image.
They use this face topology graph labeling with attention driven
recognition in order to limit search in both the image and the mod-
el database. Given as input the object they wish to detect, the
authors define a utility function U that can be used in conjunction
with the previously defined conditional probabilities and the as-
pect graph, to determine the most likely face to start their search
with, given the object they are trying to find. The search uses con-
cepts inspired from game theory, and does the search until there is
a good match between the face topology graph for the image and
the augmented aspect graph.
Then, a verification step is done, by using various metrics to see
if the aspects and volumes also match. If there is no match the
authors proceed with the next most likely matching face, and the
process continues like this. Extensions of this recognition algo-
rithm to multipart objects are also described and involve some ex-
tra steps in the verification phase searching for connectedness
among their part aspects. The final component of the recognition
algorithm involves viewpoint control. Viewpoint control makes it
possible to resolve viewpoint degeneracies. As already discussed
in this survey (also see the discussion towards the end of this sec-
tion), such degeneracies have been shown to frequently occur in
practice. The authors define an aspect prediction graph which is
a more compact version of the aspect graph and specifies
Fig. 25. A sequence of viewpoints from which the system developed by Wilkes and Tsotsos [266] actively recognizes an origami object.
862 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
transitions between topologically equivalent views of the object.
They use this graph to decide the direction of camera motion.
The main idea is to move the camera to the most likely aspect—
excluding the already viewed aspects—, based on the previously
calculated conditional probabilities and the most likely volume
currently viewed, in order to verify whether it is indeed this
hypothesized volume that is in the field of view. Then the algo-
rithm described above is repeated.
The main innovation of the paper is the combination of lots of
ideas in computer vision (attention, object recognition, viewpoint
control) in a single framework. Limitations of the paper include
the assumption that objects can be represented as constructions
of volumetric parts—which is difficult for various objects such as
clouds or trees—, and its reliance on salient homogeneous regions
in the image for the segmentation. Real objects contain a lot of de-
tails, and the segmentation is in general difficult. Notice that there
is room for improvement in the attentive mechanisms used. No
significant effort is made to create a model that adaptively adjusts
its expressive power during learning, potentially making proper
training of the model somewhat of an art and dependent on man-
ual intervention by the user. As it is the case with many of the pa-
pers described so far, the model relies heavily on the extraction of
edges and corners which might make it difficult to distinguish an
object based on its texture or color. Within the context of Fig. 1,
the work by Dickinson et al. [280] proposes an active vision frame-
work for improving all the components of the standard vision pipe-
line. This also includes the ‘object databases’ component, since the
use of a hierarchy is meant to provide a spacewise efficient repre-
sentation of the objects.
Schiele and Crowley [281] describe the use of a measure called
transinformation for building a robust recognition system. The
authors use this to describe a simple and robust algorithm for
determining the most discriminating viewpoint of an object.
Fig. 26. The face recovery and attention mechanism used in [280] (diagram
adapted from [280]).
Fig. 27. The object verification and next viewpoint selection algorithm used in
[280] (diagram adapted from [280]).
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 863
Spectacular recognition rates of nearly 100% are presented in the
paper. The main idea of the paper is to represent the 3D objects
by using the probability density function of local 2D image charac-
teristics acquired from different viewpoints. The authors use
Gaussian derivatives as the local characteristics of the object. These
derivatives allow them to build histograms (probability density
functions) of the image resulting after the application of the filter.
Assuming that a measurement set M of some local characteristics
} is acquired from the image—where the local characteristics
might be for example the x-coordinate derivatives or the image’s
Laplacian—they obtain a probability distribution p(M[o
, R, T, P, L,
N) for the object o
(where R, T, P, L, N denote the rotation, transla-
tion, partial occlusion, light changes and noise). The authors argue
that for various reasons (the filters they use, the use of histograms
and so on) the distribution is conditionally independent of various
of these variables and it suffices to build histograms for p(M[o
, S)
where S denotes the rotation and one of the translation parame-
ters. The authors define the quantity
; S
¦) =

; S


/ ; S
/ )
which gives the probability of object o
in pose S
occurring gi-
ven that we know the resulting images under set {m
} of filters.
The probabilities on the right hand side are known through the
histogram based probability density estimation we described
above. We can use this probability to recognize the object we
currently view and its pose by simply maximizing the probability
over all values for variables n, j. Test results that the authors cite
indicate that this performs very well even in cases where only
40% of the object is visible. The authors then describe the object
recognition process in terms of the transmission of information.
The quantity
T(O; M) =

; m
; m
(for the sets O, M of the objects and image features respectively) is
the transinformation. Intuitively, the lower the quantity, the ‘‘clo-
ser’’ the two sets are to being statistically independent, implying
that one set’s values do not affect the other set’s values. This is used
to choose the salient viewpoints of an object and thus, provide an
algorithm for active object detection. By rewriting the previous
equation for transinformation as
T(O; M) =


we see that the transinformation can be interpreted as the average
transinformation of some object o
’s transinformation T(o
; M) =

. By going one step further and incorporating
the pose S
of an object in the previous definition of transinforma-
tion we get
; S
; M) =

; S
; S
and we see that we can find the most significant viewpoints of an
object by finding the maximum over all j of this equation. The
authors use this last formula to hypothesize the object identity
and pose from an image. Then, they use again this last formula to
estimate the most discriminating viewpoint for the hypothesized
object, move the camera to that viewpoint, perform verification
and proceed until some threshold is passed, indicating that the ob-
ject has been identified.
Overall, the main advantage of the paper is that it provides an
elegant and simple method to perform object recognition. The test
results provide strong evidence of the power of the active object
recognition framework. The more verification steps performed,
the lower the misclassification rate. A drawback of the method is
that it has not been tested on much larger datasets, and little work
has been done to see how it performs under non-uniform back-
grounds. Furthermore, a question arises on the algorithm’s perfor-
mance as the errors in the estimation of the camera position
increase. As discussed in [45], the implications could be significant.
Similarly to the above paper, Borotschnig et al. [282] use an
information theoretic based quantity (entropy) in order to decide
the next view of the object that the camera should take to recog-
nize the object and obtain more robust object recognition in the
presence of ambiguous viewpoints. The approach uses an appear-
ance based recognition system (inspired by Murase and Nayar’s
[141] popular PCA based recognition algorithm) that is augmented
by probability distributions. The paper begins by describing how to
obtain an eigenbasis of all the objects in our database from all
views. Then, given a new image, the algorithm can project that
new image on the eigenbasis to obtain a point g in that basis,
denoting the image. Denote by p(g[o
, u
) the probability of point
g occurring in the eigenspace of all objects that are projecting an
image of object o
with pose parameters u
. Under ideal circum-
stances p(g[o
, u
) would be a spike function. In other words, the
function would be zero for all values of g, except for one value
for which it would be equal to 1. However, due to various sources
of error (fluctuations in imaging conditions, pan, tilt, zoom errors,
segmentation errors etc.) the authors estimate this probability
from a set of sample images with fixed o
and u
values. The prob-
ability density function is modeled as a multivariate Gaussian with
mean and standard deviation estimated from the sample images.
By Bayes’ theorem it can be shown that
; u
[g) =
; u
In the experiments the authors assumed that p(o
) and p(u
) are
uniformly distributed. In their test cases the authors choose a num-
ber of bins in which they will discretize the possible number of view-
points and use themto buildthese probability distributionfunctions.
Then, given some vector g in the eigenspace of shapes, the condi-
tional probability of seeing object o
is given by
[g) =

; u
[g). By iterating over all the objects in the data-
base and finding the most likely object, objects are recognized. The
authors then further expand on this idea and present an algorithm
for actively controlling the camera. They show that in cases where
the object database contains objects that share similar views, the ac-
tive object recognition framework leads to striking improvements.
The key to this is the use of planned camera movements that lead
Fig. 28. Graphical model for next-view-planning as proposed in [284,285].
864 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
to viewpoints from which the object appears distinct. Note also that
the authors use only one degree of freedom for rotating around the
object along a constant radius. However, extensions to arbitrary
rotations should be straightforward to implement. The authors de-
fine a metric s(Dw) which gives the average entropy reduction to
the object identity if the point of view is changed by Dw. Since there
is a discreet number of views, finding the optimal Dwis a simple lin-
ear search problem. The authors make 3 major conclusions based on
their results: (a) The dimensionof the eigenspace canbe lowered sig-
nificantly if active recognition is guiding the object classification. In
other words active recognition might open the way to the use of very
large object databases, suitable for real world applications (b) Even
objects that share most views can be successfully disambiguated.
(c) The number of steps needed to obtain good recognition results
is much lower than random camera placement, again indicating
the usefulness of the algorithm (2.6 vs. 12.8 steps on average). This
last point is further supported in[24]. The three above points demon-
strate howan active vision framework might decrease the size of the
object database needed to represent an object, and help improve the
object hypotheses and verification phase, by improving the disam-
biguation of objects that share many views (see Fig. 1).
These ideas were further expanded upon by Paletta and Prantl
[283], where the authors incorporated temporal context as a
means of helping disambiguate initial object hypotheses. Notice
that in their previous work, the authors treated all the views as
‘‘bags of features’’ without taking advantage of the view/temporal
context. In [283] the authors work on this shortcoming by adding a
few constraints to their probabilistic quantities. They add in their
probabilistic formulation temporal context by encoding that the
probability of observing a view (o
, u
) due to a viewpoint change
must be equal to the probability of observing view (o
, u
÷ Du
). This leads to a slight change in the Bayesian equations
used to fuse the data and leads to an improvement in recognition
performance. In [301] the authors use a radial-basis function based
network to learn object identity. The authors point out that the on-
line evaluation of the information gain, and most probabilistic
quantities as a matter of fact, are intractable, and therefore, learned
mappings of decision policies have to be applied in next view plan-
ning to achieve real-time performance.
Roy et al. [284] presents an algorithm for pose estimation and
next-view planning. A novelty of this paper is that it presents an
active object recognition algorithm for objects that might not fit
entirely in the camera’s field of view and does not assume cali-
brated intrinsic parameters. In other words it improves the feature
grouping and object hypothesis modules of the standard recogni-
tion pipeline (see Fig. 1), through the use of a number of invariants
that enable the recognition of objects which do not fit in a camera’s
field of view, and thus are not recognizable using a passive ap-
proach to vision. It should be pointed out that this was the first ac-
tive recognition/detection system to tackle this important and
often encountered real world problem. The paper introduces the
use of inner camera invariants for pose estimation. These image
computable quantities, in general, do not depend on most intrinsic
camera parameters, but assume a zero skew. The authors use a
probabilistic reasoning framework that is expressed in terms of a
graphical model, and use this framework for next-view planning
to further help themwith disambiguating the object. Andreopoulos
and Tsotsos [239] also present an active object localization algo-
rithm that can localize objects that might not fall entirely in the
sensor’s field of view (see Fig. 23). Overall this system was shown
to be robust in the case of occlusion/clutter. A drawback of the
method is that it was only tested with simple objects that con-
tained parallelograms. It is interesting to see how the method
would extend if we were processing objects containing more com-
plicated features. Again, its sensitivity to dead-reckoning errors is
not investigated.
Roy and Kulkarni [285] present a related paper with a few
important differences. First of all, the paper does not make use of
invariant features as [284] does. Furthermore, the graphical model
is used to describe an appearance based aspect graph: features q
represent the aspects of the various objects in our database, and
the classes C
represent the set of topologically equivalent aspects.
These aspects might belong to different parts of the same object, or
to different objects altogether, yet they are identical with respect
to the features we measure. For each class C
the authors build
an eigenspace U
of object appearances. Given any image I, they
find the eigenspace parameter c, and affine transformation param-
eter a, that would minimize
q(I(x ÷ f (x; a)) ÷ [U
c[(x); r) (14)
where q is a robust error function, r is a scale parameter and f is an
affine transformation. They use this c, to find the most likely class C
corresponding to the object. The probabilities are estimated by the
reconstruction error induced by projecting the image I on each one
of the class eigenspaces U
. The smaller the reconstruction error, the
more likely we have found the corresponding class. Then, the a pri-
ori estimated probabilities P(q
) are used to find the most likely
object O
corresponding to the viewed image. If the probability of
the most likely object is not high enough, we need to move to a next
view to disambiguate the currently viewed object. The view-plan-
ning is similar to that of paper [284], only that there is just 1 degree
of freedom in this paper (clockwise or counter clockwise rotation
around some axis). By using a heuristic that is very similar to the
one in paper [284] and based on knowledge from previously viewed
images of the object, the authors form a list of the camera move-
ments that we should make to disambiguate the object. This proce-
dure is repeated until the object is disambiguated.
The authors use the COIL-20 object database from Columbia
University to do their testing. The single-view based correct recog-
nition rate was 65.70% while the multi-view recognition rate in-
creased to 98.19%, indicating the usefulness of the recognition
results and the promise in general of the active object recognition
framework under good dead-reckoning. Furthermore, the average
number of camera movements to achieve recognition was 3.46
vs. 5.40 moves for the case of random camera movements, again
indicating the usefulness of the heuristic the authors defined for
deciding the next view. Notice that this is consistent with the re-
sults in [24,282]. Disadvantages of the paper include the testing
of the method on objects with only a black background (of appar-
ently little occlusion) and the use of only a single degree of free-
dom in moving the camera to disambiguate the object.
Hutchinson and Kak [286] presents one of the earliest attempts
at active object detection. The authors generalize their work by
assuming that they can have at their disposal lots of different sen-
sors (monocular cameras, laser range-finders, manipulator fingers
etc.). Thus, within the context of the standard object recognition
pipeline (Fig. 1), this is an example of a system that combines mul-
tiple types of feature extractors. It also represents one of the earli-
est active approaches for object hypothesis and verification. Each
one of those sensors provides various surface features that can
be used to disambiguate the object. These features include surface
normal vectors, Gaussian and Mean curvatures, area of each poly-
hedral surface and orientation, amongst others. By creating an as-
pect graph for each object and by associating with each aspect the
features corresponding to the surfaces represented by that aspect,
the algorithm can formulate hypotheses as to the objects in a data-
base that might correspond to the observed object. The authors
then do a brute force search on all the aspects of each aspect graph
in the hypotheses, and move the camera to a viewpoint of the ob-
ject that will lead to the greatest reduction in the number of
hypotheses. In general, this is one of the first papers to address
the active object detection problem. A disadvantage of this paper
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 865
is the oversimplifying assumption of polyhedral objects. Another
disadvantage is the heuristic used to make the camera movements
since in general it gives no guarantees that this sensor movement
will be optimal in terms of the number of movements till
recognition takes place. Notice that complexity issues need to be
addressed, since in practice the aspect graphs of objects are quite
large and can make brute force search through the aspects of all
aspect graphs infeasible. Furthermore, as it is the case with most
of the active object recognition algorithms described so far, the is-
sue of finding the optimal sequence of actions subject to a time
constraint is not addressed.
Gremban and Ikeuchi [287] investigate the sensor planning
phase of object recognition, and thus their work constitutes an-
other effort in improving the object hypothesis and verification
of the standard recognition pipeline (see Fig. 1). Like many of the
papers described in this survey, the algorithm uses aspect graphs
to determine the next sensor movement. Similarly to [285] and
[284], the authors of this paper make use of so called congruent as-
pects. In a computer vision system, aspects can be defined in var-
ious ways. The most typical way of defining them is based on the
set of visible surfaces or the presence/absence of various features.
Adjacent viewpoints over a contiguous object region, for which the
features defining the aspect remain the same, give an aspect equiv-
alence class. In practice, however, researchers who work with as-
pect graphs have noticed that the measured features can be
identical over many disparate viewpoints of the object. This makes
it impossible to determine the exact aspect viewed. These indistin-
guishable aspects which share the same features are called congru-
ence classes. The authors argue that any given feature set will
consist of congruent aspects and this is responsible for the fact that
virtually every object recognition system uses a unique feature
set—in order to improve the performance of the algorithm on that
particular domain and distinguish between the congruent aspects.
Other reasons why congruent aspects might arise include noise
and occlusion. The authors argue that since congruent aspects can-
not be avoided sensing strategies are needed to discriminate them.
In Fig. 29 we give an example of the aspects of an object and its
congruence classes, where the feature used to define the aspects
is the topology of the viewed surfaces in terms of the visible edges.
The authors use Ikeuchi and Kanade’s aspect classification algo-
rithm [41] to find the congruence class corresponding to the aspect
viewed by the camera. The camera motion is used to decide the as-
pect that this particular class corresponds to. This is referred to as
aspect resolution. This enables the system to recognize whether the
image currently viewed contains the target object. The authors de-
fine a class restricted observation function X(w, h) that returns the
congruence class currently viewed by the camera. The variable w
defines the angle of rotation of the sensor around some axis in
the object’s coordinate system—the authors assume initially that
the only permissible motion is rotation around one axis—and h de-
notes the rotation of the object with respect to the world coordi-
nate frame. An observation function X(w, h) can be constructed
for the object model that is to be identified in the image. The
authors discuss in the paper only how to detect instances of a sin-
gle object, not how to perform image understanding. The authors
initially position the camera at w = 0—they assume that the object
they wish to recognize is manually positioned in front of the cam-
era with an appropriate pose—and estimate the congruence class c
that is currently viewed by investigating the extracted features
(see Fig. 30). By scanning through the function X(w, h) they find
the set of values of h, (if any), for which X(0, h) = c. If no values
of h satisfy this function, the object viewed is not the one they
are searching for. Otherwise, by using a heuristic, the authors move
the camera to a new value of w, estimate the congruence class cur-
rently viewed by the camera and use this new knowledge to fur-
ther constrain the values of h satisfying this new constraint (see
Fig. 30). If they end up with a single interval of values of h that sat-
isfy all these constraints, they have recognized an instance of the
object they are looking for. The authors can also use this knowl-
edge to extrapolate the aspect that the sensor is currently viewing,
and thus, achieve aspect resolution. The authors describe various
data structures for extending this idea to more than a single degree
of camera motion.
Dickinson et al. [49] quantify an observation that degenerate
views occupy a significant fraction of the viewing sphere surround-
ing an object and show how active and purposive control of the
sensor could enable such a system to escape from these degenera-
cies, thus leading to more reliable recognition. A view of an object
is considered degenerate if at least one of the two conditions below
hold (see Fig. 31):
1. a zero dimensional (point-like) object feature is collinear with
both the front nodal point of the lens
2. and either:
(a) another zero dimensional object feature, or
(b) some point on a line (finite or infinite) defined by two zero-
dimensional object features.
The paper gives various examples of when such degeneracies
might occur. An example of degeneracy is when we have two cubes
such that the vertex x of one cube is touching a point y on an edge
of the other cube. If the front nodal point of the lens lies on the line
defined by points x, y the authors say that this view of the object is
degenerate. Of course, in the case of infinite camera resolution, the
chances of this happening are virtually non-existent. However,
cameras have finite resolution. Therefore, the chances of degener-
acies occurring are no longer negligible.
The authors conduct various experiments under realistic
assumptions and observe that for a typical computer vision setup
the chances of degenerate views are not negligible and can be as
high as 50%. They also tested a parameterization which partially
matched the human foveal acuity of 20 s of arc, and noticed that
the probability of degeneracies is extremely small. The authors ar-
gue that this is one reason why the importance of degenerate
views in computer vision has been traditionally underestimated.
Obviously an active vision system could be of immense help in dis-
ambiguating these degeneracies. The authors argue that if the goal
is to avoid the degenerate views in a viewer-centered object repre-
sentation or to avoid making inferences from such viewpoints, the
vision systems must have a system for detecting degeneracies and
actively controlling the sensor to move it out of the degeneracy.
One solution to the problem of reducing the probability of degen-
eracy—or reducing the chance of having to move the camera—is to
Fig. 29. The aspects of an object and its congruence classes (adapted from Gremban
and Ikeuchi [287]).
866 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
simply change the focal length of the camera to increase the reso-
lution in the region of interest. The analysis performed in the paper
indicates that it is important to compensate for degeneracies in
computer vision systems and also further motivates the benefits
of an active approach to vision. Intelligent solutions to the view-
degeneracy problem can decrease the probability of executing
expensive and unnecessary camera movement to recognize an ob-
ject. Within the context of the recognition pipeline in Fig. 1, we see
that these degeneracies could potentially affect all the modules in
the pipeline, from the quality of the low-level feature extracted, to
the way the features are grouped, and to the reliability of the final
object verification.
Herbin [288] presents an active recognition system whose ac-
tions can influence the external environment (camera position)
or the internal recognition system. The author assumes the
processing of segmented images, and uses the silhouette of the
objects—chess pieces—to recognize the object. The objects are
encoded in aspect graphs, where each aspect contains the views
with identical singularities of the object’s contour. Each view is en-
coded by a constant vector indicating whether a convex point, a
concave point or no extremum was found. Three types of actions
are defined: A camera movement by 5 degrees upwards or down-
wards and a switch between two different feature detection scales.
The author defines a training phase for associating an action a
time t given the sequence of states up until time t. This simply
learns the permissible actions for a certain object. Standard Bayes-
ian methods determine whether there is high enough confidence
so far on the object identity, or whether more aspects should be
Kovacic et al. [289] present a method for planning view se-
quences to recognize objects. Given a set of objects and object
views, where the silhouette of each object view is characterized
by a vector of moment-based features, the feature vectors are clus-
tered. Given a detected silhouette, the corresponding cluster is
determined. For each candidate new viewpoint, the object vectors
in the cluster are mapped onto another feature set of the same ob-
jects but from the new viewpoint. A number of different mappings
are attempted—where each mapping depends on the next poten-
tial view—and each mapping’s points are clustered. The next view
which results in the greatest number of clusters is chosen, since
this will on average lead to the quickest disambiguation of the ob-
ject class. This procedure is repeated until clusters with only one
feature vector remain, at which point recognition is possible.
Denzler and Brown [290] use a modification of mutual informa-
tion to determine optimal actions. They determine the action a
that leads to the greatest conditional mutual information between
the object identity X and the observed feature vector c. Laporte
and Arbel [291] build upon this work and choose the best next
viewpoint by calculating the symmetric KL divergence (Jeffrey
divergence) of the likelihood of the observed data given the
assumption that this data resulted from two views of two distinct
objects. By weighing each Jeffrey divergence by the product of the
Fig. 30. An aspect resolution tree used to determine if there is a single interval of values for h that satisfy certain constraints (adapted from [287]).
Fig. 31. The two types of view degeneracies proposed by Dickinson et al. [49].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 867
probabilities of observing the two competing objects and their two
views, they can determine the next view which provides the object
identity hypothesis, thus again demonstrating the active vision
system’s direct applicability in the standard recognition pipeline
(see Fig. 1).
Mishra and Aloimonos [292] and Mishra et al. [293] suggest
that recognition algorithms should always include an active seg-
mentation module. By combining monocular cues with motion or
stereo, they identify the boundary edges in the scene. This supports
the algorithm’s ability in tracing the depth boundaries around the
fixation point, which in turn can be of help in challenging recogni-
tion problems. These two papers provide an example of a different
approach to recognition, where the intrinsic recognition module
parameters are intelligently controlled and are more tightly cou-
pled to changes to the low-level feature cues and their grouping
in the standard recognition pipeline (see Fig. 1).
Finally, Zhou et al. [294] present an interesting paper on feature
selectivity. Even though the authors present the paper as having an
application to active recognition, and cite the relevant literature,
they limit their paper to the medical domain (Ultrasound) by
selecting the most likely feature(s) that would lead to accurate
diagnosis. The authors present three slight modifications to infor-
mation gain and demonstrate how to choose the feature y that
would lead to maximally reducing the uncertainty in classification,
given that a set of features X is used. They perform tests to deter-
mine the strengths and weaknesses of each approach and recom-
mend a hybrid approach based on the presented metrics as the
optimal approach to conditional feature selection. Within the con-
text of an active vision system, feature selection algorithms could
be used to choose the optimal next sensor action.
While most of the methods discussed in this section mainly
show that active image acquisition makes the problem easier, the
last few papers discussed give an insight of a general nature for ob-
ject recognition, where active image acquisition is tightly coupled
to the more classical vision and recognition modules. Another gen-
eral conclusion is that very few of the papers surveyed so far, take
into consideration the effects of cost constraints, noise-constraints
(e.g., dead-reckoning errors) or object representational power. As it
was previously argued [26], taking into account such constraints is
of importance, since they can lead to a reassessment of proper
strategies for next-view-planning and recognition.
3.2. Active object localization and recognition literature survey
We now present an overview of the literature on the active ob-
ject localization and recognition problems. In more recent litera-
ture, the problems are sometimes referred to under the title of
semantic object search. In Table 6 and Chart 8 we compare the algo-
rithms discussed in this subsection, along a number of dimensions.
A general conclusion one can reach, is that on average, the scalabil-
ity of inference for active object localization algorithms is worse
than the current state of the art in passive recognition (see Table 7
of Section 4.2 for example). This is partially attributable to the on-
line requirements of active localization/recognition mechanisms,
which make the construction of such real-time and online systems
a significant challenge.
Notice that in contrast to the Simultaneous Localization and
Mapping (SLAM) problem, in the active object localization problem
the vision system is tasked with determining an optimal sequence
of sensor movements that enable the system to determine the po-
sition of the apriori specified object, as quickly as possible. In con-
trast, in the SLAM problem, the scene features/objects are usually
learnt/determined online during the map building process. Notice
that within the context of Section 1, the localization and recogni-
tion problems subsume the detection problem, since the detection
problem is a limited/constrained version of the localization and
recognition problems.
When dealing with the vision-based SLAM problem, the issue of
extracting scene structure from a moving platform and using this
information to build a map of the environment emerges. While this
problem also emerges in the active object localization and recogni-
tion problem, in practice, it is typically of secondary importance,
since the main research effort while constructing active object
localization and recognition systems is focused around the creation
of the object recognition module and the creation of the next-view-
point selection algorithm. As it was pointed out at the beginning of
Section 3, active object localization and recognition research on dy-
namic scenes is limited, and in this regard it is less developed than
the structure from motion and SLAM literature.
For example Ozden et al. [317] indicate that the main require-
ments for building a robust dynamic structure from motion frame-
work, include:
v Constantly determining the number of independently moving
v Segmenting the moving object tracks.
v Computing the object 3D structure and camera motion with
sufficient accuracy.
v Resolving geometric ambiguities.
v Achieving robustness against degeneracies caused by occlusion,
self-occlusion and motion blur.
v Scaling the system to non-trivial recording times.
It is straightforward to see that these also constitute important
requirements when constructing an active object localization and
recognition system, since making a recognition system robust to
these challenges would likely require changes to all the compo-
nents of the standard recognition pipeline (see Fig. 1). However,
none of the active localization and recognition systems that we
will survey is capable of dealing with dynamic scenes, demonstrat-
ing that the field is still evolving. Note that this last point differen-
tiates active vision research from dynamic vision research (see
Section 3).
In the active object localization and recognition problems, any
reduction in the total number of mechanical movements involved
would have a significant effect on the search time and the commer-
cial viability of the solution. Thus, a central tenet of the discussion
in this section involves efficient algorithms for locating objects in
an environment subject to various constraints [45,26]. The con-
straints include time constraints, noise rates, and object and scene
representation lengths amongst others. In Table 6 and Chart 8 we
present a comparison, along certain dimensions, for a number of
the papers surveyed in Section 3.2.
Rimey and Brown [302] present the TEA-1 vision system that
can search within a static image for a particular object and that
can also actively control a camera if the object is not within its field
of view. Within the context of Minsky’s frame theory [124] which
we discussed in Section 2.7, the authors define a knowledge repre-
sentation framework that uses ‘‘PART-OF’’, ‘‘IS-A’’ and ‘‘adjacent’’
relationships—a form of contextual knowledge—for guiding the
search. The authors [302] also focus on the decision making algo-
rithms that are used to control the current focus of attention dur-
ing the search for the object. A Bayesian network is used to encode
the confidences regarding the various hypotheses. As the authors
point out, a significant constraint in any vision system that purpo-
sively controls an active sensor, such as a camera, is resource allo-
cation and minimization of the time-consuming camera
movements. Purposiveness is necessary in any active vision system.
The system must attempt specific tasks. Open ended tasks such as
‘‘randomly move the camera around the entire room until the de-
sired object falls in our field of view’’ lack the purposiveness
868 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
constraint. A number of papers [282,?,24] have experimentally
demonstrated that random search exhibits a significantly worse
reliability and localization speed than purposive search, giving fur-
ther credence to the arguments given in [302]. This approach to vi-
sion is inspired by the apparent effect that task specification has on
human eye movements. As Yarbus demonstrated [318], human
foveation patterns depend on the task at hand and the fixated ob-
jects seem to be the ones relevant for solving a particular task.
Somehow, irrelevant features are ignored and humans do not
search through the entire scene. This is exactly what Rimey and
Brown are trying to accomplish in their paper, namely, to perform
sequential actions that extract the most useful information and
perform the task in the shortest period of time. Thus, within the
context of the standard recognition pipeline in Fig. 1, this consti-
tutes an effort in improving the object hypothesis generation mod-
ule. The authors provide a nice summary of the main differences in
the selective/attentive approach to vision and the reconstruction-
ist/non-active/non-attentive approach to vision (see Fig. 32).
The authors use two different Bayesian-network-like structures
for knowledge representation: composite nets and two-nets. The
composite net, as its name suggests, is composed of four kinds of
nets: PART-OF nets, IS-A trees, expected area nets and task nets
(see Figs. 34, 33). PART-OF nets are graphical models which use
PART-OF relations to model the feasible structure of the scene
and the associated conditional probabilities (see Fig. 33). Each node
is a Boolean variable indicating the presence or absence of a partic-
ular item. For example, a node might represent a tabletop, its chil-
dren might represent different kinds of tables, and each kind of
table might have nodes denoting the types of utensils located on
the particular table type. Expected area nets have the same struc-
ture as PART-OF nets and identify the area in the particular scene
where the object is expected to be located and the area it will take
up. These are typically represented using 2D discrete random vari-
ables representing the probability of the object being located in a
certain grid location. Also values for the height and width of ob-
jects are typically stored in the expected area net. A relation-map
is also defined which uses the expected area net to specify the rel-
ative location probability of one object given another object. An IS-
A tree is a taxonomic hierarchy representing mutually exclusive
subset relationships of objects (see Fig. 34).
For example, one path in the hierarchy might be object ?table-
object ?bowl ?black-bowl. A task-net specifies what kind of
scene information could help with solving a recognition problem
but it does not specify how to obtain that information. The two-
net is a simpler version of the composite net, and is useful for
experimental analysis. The authors then define a number of actions
such as moving the camera or applying a simple object detection
algorithm. By iteratively choosing the most appropriate action to
perform, and updating the probabilities based on the evidence pro-
vided by the actions, recognition is achieved. Each action has a cost
and profit associated with it. The cost might include the cost of
moving a camera and the profit increases if the next action is
consistent with the probability table’s likelihoods. Three different
methods for updating the probabilities are suggested. The dum-
my-evidence method sets a user specified node in the composite-
nets and two-nets to a constant value, specifying judgemental val-
ues about the node’s values. The instantiate-evidence method is set
when a specific value of a random variable is observed as true. Fi-
nally, the IS-A evidence approach uses the values output by an ac-
tion to update the IS-A net’s probabilities using the likelihood
ratios for some evidence e: k = p(e[S)/p(e[S) where S denotes
whether a specific set of nodes in the IS-A tree was detected or
not by the action. The cost and profits are used to define a goodness
function which is used to select the best next action. A depth first
search in the space of all action sequences is used to select the best
next action that would minimize the cost and lead to the most
likely estimation of the unknown object or variable. The authors
perform some tests on the problem of classifying whether a partic-
ular tabletop scene corresponds to a fancy or non-fancy meal and
present some results on the algorithm’s performance as the values
of the various costs were adjusted. The method is tested only for
recognizing a single 2D scene.
Wixson and Ballard [303] present an active object localization
algorithm that uses intermediate objects to maximize the effi-
Fig. 32. Reconstructionist vision vs. selective perception, after Rimey and Brown
Fig. 33. A PART-OF Bayes net for a table-top scenario, similar to what was proposed
by [302].
Fig. 34. An IS-A Bayes tree for a table-top scenario that was used by [302].
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 869
ciency and accuracy of the recognition system (see Figs. 35 and 36).
The paper was quite influential and similar ideas are explored in
more, recent work [304,319,320]. The system by Wixson and Bal-
lard [303] incorporates some sort of contextual knowledge about
the scene by encoding the relation between intermediate objects.
Such intermediate objects are usually easy to recognize at low res-
olutions and are, thus, located quickly. Since we typically have
some clues about the target object’s location relative to the inter-
mediate object’s location, we can use intermediate objects to speed
up the search for the target. The authors present a mathematical
model of search efficiency that estimates the factors which affect
search efficiency, and they use these factors to improve search effi-
ciency. They note that in their experiments, indirect search pro-
vided an 8-fold increase in efficiency. As the authors indicate, the
higher the resolution needed to accurately recognize an object,
the smaller the field of view of the camera has to be—because,
for example, we might need to bring the camera closer to the ob-
ject. However, this forces more mechanical movements of the cam-
era to acquire more views of the scene, which are typically quite
time consuming. This indicates a characteristic trade-off in the ac-
tive localization literature that many researchers in the field have
attempted to address, namely, search accuracy vs. total search
In this work the authors speed up the search through the use of
intermediate objects. An example is the task of searching for a pen-
cil by first locating a desk—since pencils are usually located on
desks. Thus within the context of the standard recognition pipeline
in Fig. 1, this constitutes an effort in improving the feature group-
ing and object hypothesis generation module, by using intermedi-
ate object to influence the grouping probabilities and relevant
hypotheses of various objects or object parts. The authors demon-
strate the smaller number of images required to detect the pencil if
the intermediate object detected was the desk—an almost two-
thirds decrease. The efficiency of a search is defined as c/T where
c is the probability that the search finds the object and T is the ex-
pected time to do the search. The authors model direct and indirect
search. Direct search (see Figs. 35, 36) is a brute force search de-
fined in terms of the random variable R denoting the number of ob-
jects detected by our object detection algorithm over a search
sequence spanning the search space, in terms of the probability b
of detecting a false positive, the number of possible views V for
the intermediate object and c
, the average cost for each view j.
Usually c
is a constant c for all j. The success probability of indirect
search is
= [1 ÷ P(R = 0)[(1 ÷ b) (15)
and the expected cost for the direct search is
[P(R); V; c[ = (P(R = 0)V ÷

P(R = r)s(1; r; V)) × c (16)
where s(k, r, V) denotes the expected number of images that must
examined before finding k positive responses, given that r posi-
tive responses can occur in V images. A close look at the underlying
parameters shows that b and P(R) are coupled: If everything else
remains constant, a greater number of positive responses—a smal-
ler value of P(R = 0)—causes the expected values of R to be higher,
but it also increases b.
An indirect search model (see Fig. 35) is defined recursively by
applying a direct search around the neighborhood indicated by
each indirectly detected object. The authors perform a number of
tests on some simple scenes using simple object detectors. One
type of test they perform, for example, is detecting plates by first
detecting tables as intermediate objects. An almost 8-fold increase
in detection speed is observed. These mathematical models exam-
ine the conditions under which spatial relationships between ob-
jects can provide more efficient searches. The models and
experiments demonstrate that indirect search may require fewer
images/foveations and increases the probability of detecting an ob-
ject, by making it less likely that we will process irrelevant infor-
mation. As with most early research, the work is not tested on
the large datasets that more recent papers usually are tested on.
Nevertheless, the results are consistent with the results presented
in [24], regarding the significant speed up of object search that is
achieved if we use a purposive search strategy, as compared to ran-
dom search. We should point out that this paper does not take into
account the effects of various cost constraints and dead-reckoning
errors. In contrast, it is mostly concentrated on the next-view-
planner while ignoring somewhat the possible effects due to the
next-view-planner’s synergy with an object detector, in terms of
simulating attentional priming effects to speed up the search for
Brunnström et al. [305,306] present a set of computational
strategies for choosing fixation points in a contextual and task-
dependent manner. As shown in Fig. 37, a number of junctions
are specified, and a grouping strategy for these junctions is speci-
fied, where this grouping strategy is dependent on depth disconti-
nuities (determined by a stereo camera), and also affects the
sensor’s fixation strategy (see Fig. 1). The authors present a meth-
odology for determining the junction type present in the image,
and argue that this strategy could be quite useful for recognizing
an even larger variety of textureless objects.
Ye and Tsotsos [307] provide an early systematic study of the
problem of sensor planning for 3D object search. The authors pro-
pose a sensor planning strategy for a robot that is equipped with a
pan, tilt and zoom camera. The authors show that under a particu-
lar probability updating scheme, the brute force solution to the
problem of object search—maximizing the probability of detecting
the target with minimal cost—is NP-Complete and, thus, propose a
heuristic strategy for solving this problem. The special case of the
problem under Bayesian updating was discussed in [45,322]. The
search agent’s knowledge of object location is encoded as a discrete
probability density, and each sensing action is defined by a view-
point, a viewing direction, a field of view and the application of a
recognition algorithm. The most obvious approach to solving this
problem is by performing a 360° pan of the scene using wide angle
camera settings and searching for the object in this whole scene.
However, this might not work well if we are searching for a small
object that is relatively far away, since the object might be too
small to detect. The authors propose a greedy heuristic approach
Fig. 35. The way various conditions affect the search for the target object and for
intermediate objects. Dashed entries represent conditions which according to the
model of Wixson and Ballard [303], do not affect the search efficiency. Adapted
from [303].
870 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
to solving the problem, that consists of choosing the action that
maximizes the fraction of the expected object detection probability
divided by the expected cost of the action. Thus, within the context
of the recognition pipeline in Fig. 1, this constitutes an algorithm
for hypothesizing and verifying the objects present in the scene,
by adjusting the viewpoint parameters with which the object is
Minut and Mahadevan [308] present a reinforcement learning
approach to next viewpoint selection using a pan-tilt-zoomcamera.
They use a Markov Decision Process (MDP) and the Q-learning algo-
rithm to determine the next saccade given the current state, where
states are defined as clusters of images representing the same re-
gion in the environment. A simple histogram intersection—using
color information—is used to match an image I with a template
M. If a match is found with a low resolution version of the image,
the camera zooms in and obtains a higher resolution image and ver-
ifies the matching. If no match is found, (i.e., the desired object is
not found), they use the pan tilt unit to direct the camera to the
most salient region (saliency is determined by a symmetry operator
defined in the paper) located with one of 8 subregions. Choosing the
subregion to search within is determined by the MDP and the prior
contextual knowledge it has about the room.
Kawanishi et al. [309] use multiple pan-tilt-zoomcameras to de-
tect known objects in 3D environments. They demonstrate that
with multiple cameras the object detection and localization prob-
lems canbecome more efficient (2.5times faster) andmore accurate
thanwith a single camera. The systemcollects images under various
illumination conditions, object views, and zoom rates, which are
categorized as reference images for prediction (RIP) and verification
(RIV). RIP images are small images that are discriminative for
roughly predicting the existence of the object. RIVimages are higher
resolution images for verifying the existence of objects. For each im-
age region that detected a likely object when using the RIP images,
the cameras zoomin, and pan and tilt, in order to verify whether the
object was indeed located at that image region.
More recently, Ekvall et al. [310] integrated a SLAM approach
with an object recognition algorithm based on receptive-field
co-occurrence histograms. Other recent algorithms combine image
saliency mechanisms with bags-of-features approaches [311,312].
Saidi et al. [313] present an implementation, on a humanoid robot,
of an active object localization system that uses SIFT features [72]
and is based on the next-view-planner described by Ye and Tsotsos
Masuzawa and Miura [314] use a robot equipped with vision
and range sensor to localize objects. The range finder is used to de-
tect free space and vision is used to detect the objects. The detec-
tion module is based on color histogram information and SIFT
features. Color features are used for coarse object detection, and
the SIFT features are used for verification of the candidate object’s
presence. Two planning strategies are proposed. One is for the
coarse object detection and one is for the object verification. The
object detection planner maximizes a utility function for the next
movement, which is based on the increase in the observed area di-
vided by the cost of making this movement. The verification plan-
ner proposes a sequence of observations that minimizes the total
cost while making it possible to verify all the relevant candidate
object detections. Thus, this paper makes certain proposals for
improving the object hypothesis and verification module of the
standard recognition pipeline (see Fig. 1) by using a utility function
to choose the optimal next viewpoint.
Sjöö et al. [315] present an active search algorithm that uses a
monocular camera with zoom capabilities. A robot that is equipped
with a camera and a range finder is used to create on occupancy
grid and a map of the relevant features present in the search envi-
ronment. The search environment consists of a number of rooms.
The closest unvisited room is searched next, where the constructed
Fig. 36. The direct-search model, which includes nodes that affect direct search efficiency (unboxed nodes) and explicit model parameters (boxed nodes). Adapted from
Wixson and Ballard [303].
Fig. 37. Junction types proposed by Malik [321] and used by Brunnström et al.
[306] for recognizing man-made objects.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 871
occupancy grid is used to guide the robot. For each room, a greedy
algorithm is used to select the order in which the room’s view-
points are sensed, so that all possible object locations in the map
are sensed. The algorithm uses receptive field co-occurrence histo-
grams to detect potential objects. If potential objects are located,
the sensor’s zoom settings are appropriately adjusted so that SIFT
based recognition is possible. If recognition using SIFT features is
not possible, this viewpoint hypothesis is pruned (also see Fig. 1),
and the process is repeated until recognition has been possible
for all the possible positions in the room where an object might
be located.
Ma et al. [316] use a two-wheeled non-holonomic robot with an
actuated stereo camera mounted on a pan-tilt unit, to search for 3D
objects in an indoor environment. A global search based on color
histograms is used to perform coarse search, somewhat similar in
spirit to the idea of indirect search by Wixson and Ballard [303]
which we previously discussed. Subsequently, a more refined
search (based on SIFT features and a stereo depth extraction algo-
rithm) is used in order to determine the objects’ actual position
and pose. An Extended Kalman Filter is used for sustained tracking
and the A

graph search is used for navigation.
Andreopoulos et al. [24] present an implementation of an online
active object localization system, using an ASIMO humanoid robot
developed by Honda (see Figs. 38, 39). A normalized metric for tar-
get uniqueness within a single image but also across multiple
images of the scene that were captured from different viewpoints,
is introduced. This metric provides a robust probability updating
methodology. The paper makes certain proposals for building more
robust active visual search systems under the presence of various
errors. Imperfect disparity estimates, an imperfect recognition
algorithm, and dead-reckoning errors, place certain constraints
on the conditions chosen for determining when the object of inter-
est has been successfully localized. A combination of mutliple-view
recognition and single-view recognition approaches is used to
achieve robust and real-time object search in an indoor environ-
ment. A hierarchical object recognition architecture, inspired by
human vision, is used [218]. The object training is done by in-hand
demonstration and the system is extensively tested on over four-
hundred test scenarios. The paper demonstrates the feasibility of
using state of the art vision-based robotic systems for efficient
and reliable object localization in an indoor 3D environment. This
constitutes an example of a neuromorphic vision system applied to
robotics, due to the use of (i) a humanoid robot that emulates hu-
man locomotion, (ii) the use of a hierarchical feed-forward recog-
nition system inspired by human vision, and (iii) the use of a
next-view planner that shares many of the behavioral properties
of the ideal searcher [323]. Within the context of the recognition
pipeline in Fig. 1, this constitutes a proposal for hypothesizing
and verifying the objects present in the scene (by adjusting the
viewpoint parameters with which the object is sensed) and for
extracting and grouping low-level features more reliably based
on contextual knowledge about the relative object scale.
As previously indicated, on average, the scalability of inference
for active object localization algorithms is worse than the current
state of the art in passive recognition. This is partially attributable
to the online requirements of active localization/recognition mech-
anisms, which make the construction of such real-time and online
systems a significant challenge. Furthermore, powerful vision sys-
tems implemented on current popular CPU architectures are extre-
mely expensive power-wise. This makes it difficult to achieve the
much coveted mobility threshold that is often a necessary require-
ment of active object localization algorithms.
4. Case studies from recognition challenges and the evolving
In this section we present a number of case studies that exem-
plify the main characteristics of algorithms that have been proven
capable of addressing various facets of the recognition problem.
Based on this exposition we also provide a brief discussion as to
where the field appears to be headed.
4.1. Datasets and evaluation techniques
Early object recognition systems were for the most part tested
on a handful of images. With the exception of industrial inspection
related systems, basic research related publications tended to focus
on the exposition of novel recognition algorithms, with a lesser fo-
cus on actually quantifying the performance of these algorithms.
More recently, however, large annotated datasets of images con-
taining a significant number of object classes, have become readily
available, precipitating the use of more quantitative methodologies
for evaluating recognition systems. Everingham et al. [324] over-
view the PASCAL challenge dataset, which is updated annually
(see Fig. 40). Other popular datasets for testing the performance
of object/scene classification and object localization algorithms in-
clude the Caltech-101 and Caltech-256 datasets (Fei-Fei et al.
[325], Griffin et al. [326]), Flickr groups,
the TRECVID dataset Sme-
aton et al. [327], the MediaMill challenge Snoek et al. [328], the Lo-
tus-Hill dataset Yao et al. [329], the ImageCLEF dataset Sanderson
et al. [330], the COIL-100 dataset Nene et al. [331], the ETH-80
Fig. 38. An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively
search an indoor environment.
Fig. 39. An example of ASIMO pointing at an object once the target object is
successfully localized in a 3D environment [24].
872 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
dataset Leibe and Schiele [332], the Xerox7 dataset Willamowski
et al. [333], the KTH action dataset Laptev and Lindeberg [334] the
INRIA person dataset Dalal and Triggs [335], the Graz dataset Opelt
et al. [336], the LabelMe dataset Russell et al. [337] the TinyImages
dataset Torralba et al. [338], the ImageNet dataset Deng et al.
[339], and the Stanford action dataset Yao et al. [340]. Notice that
such offline datasets have almost exclusively been applied to passive
recognition algorithms, since active vision systems cannot be easily
tested using offline batches of datasets. Testing an active vision sys-
tem using offline datasets would require an inordinate number of
images that sample the entire search space under all possible intrin-
sic and extrinsic sensor and algorithm parameters. Typically, such
systems are initially tested using simple simulations, followed by a
significant amount of time that is spent field testing the system.
A number of metrics are commonly used to provide succinct
descriptors of system performance. Receiver Operating Character-
istic (ROC) curves are often used to visualize the true positive rate
versus the false positive rate of an object detector (see Section 1) as
a class label threshold is changed, assuming of course that the algo-
rithm uses such a threshold (note that sometimes in the literature
the false positive rate is also referred to as the false accept rate, and
the false negative rate is referred to as the false reject rate). In cer-
tain cases Detection Error Tradeoff (DET) curves are used to pro-
vide a better visualization of an algorithm performance [341],
especially when small probabilities are involved. The equal error
rate (EER) corresponds to the false positive value FP achieved when
the corresponding ROC curve point maps to a true positive value TP
that satisfies FP = 1 ÷ TP. This metric is convenient as it provides a
single value of algorithm quality (a lower EER value indicates a bet-
ter detector). The area under the curve of an ROC curve is also often
used as a metric of algorithm quality. The use of the average preci-
sion (AP) metric in the more recent instantiations of the PASCAL
challenge has also gained acceptance [324,342]: The average preci-
sion (AP) is defined as
AP =

where [R[ is the set of positive examples in the validation or test set,
if the algorithm is correct
on the kth sample
0 otherwise
and M
= {i
, . . ., i
} is the list of the top k best performing test set
samples. Standard tests of statistical significance (e.g., t-tests, ANO-
VA tests, Wilcoxon rank-sum tests, Friedman tests) are sometimes
used when comparing the performance of two or more algorithms
which output continuous values (e.g., comparing the percentage
of overlap between the automated object localization/segmentation
with the ground-truth segmentation). See [343–345] for a discus-
sion on good strategies for annotating datasets and evaluating rec-
ognition algorithms.
Our discussion on evaluation techniques for recognition algo-
rithms would be incomplete without the presentation of the crit-
icism associated with the use of such datasets. Such criticism is
sometimes encountered in the literature or in conferences on vi-
sion research (see [193,73,194,346] for example). In other words,
the question arises as to how good indicators these datasets and
their associated tests are for determining whether progress is
being made in the field of object recognition. One argument is
that the current state-of-the-art algorithms in object recognition
identify correlations in images, and are unable to determine true
causality, leading to fragile recognition systems. An example of
this problem arose in early research on neural networks, where
the task was to train a neural network to determine the presence
or absence of a certain vehicle type in images.
The neural net-
work was initially capable of reliably detecting the objects of inter-
est from the images of the original dataset. However, on a new
validation dataset of images, the performance dropped drastically.
On careful examination it was determined that in the original data-
set, the images containing the object of interest had on average a
higher intensity. During training, the neural network learned to de-
cide whether the object was present or absent from the image, by
calculating this average image intensity and thresholding this
intensity value. It is evident that in the original dataset there ex-
isted a correlation between average image intensity and the object
presence. However in the new dataset this correlation was no long-
er present, making the recognition system unable to generalize in
this new situation that the human visual system is capable of
addressing almost effortlessly. It has been argued that only correla-
tion can be perceived from experience, and determining true cau-
sality is an impossibility. In medical research the mitigation of
such problems is often accomplished through the use of control
groups and the inclusion of placebo groups, which allow the scien-
tist to test the effect of a particular drug by also testing the effect
of the drug under an approximation of a counter-factual state of
the world. However, as experience has shown – and as it is often
the case in computer vision research – the results of such con-
trolled experiments, whose conclusions ultimately rely on correla-
tions, are often wrong. [347] analyses the problem, and provides
a number of suggestions as to why this phenomenon occurs, which
we quote below:
v The smaller the case studies, the more likely the findings are false.
v The smaller the effect sizes in a research field, the less likely the
research findings are true. For example a study of the impact of
smoking on cardiovascular disease will more likely lead to correct
results than an epidemiological study that targets a small minority
of the population.
v The greater the number and the lesser the selection of tested rela-
tionships in a scientific field, the less likely the research findings are
to be true. As a result, confirmatory designs such as large controlled
trials are more likely true, than the results of initial hypothesis-gen-
erating experiments.
v The greater the flexibility in designs, definitions, outcomes and ana-
lytical models in a scientific field, the less likely the research find-
ings are to be true. For example, flexibility increases the potential
of turning ‘‘negative’’ results into ‘‘positive’’ results. Similarly, fields
that use stereotyped and commonly agreed analytical methods,
typically result in a larger proportion of true findings.
Fig. 40. The twenty object classes that the 2011 PASCAL dataset contains. Some of
the earlier versions of the PASCAL dataset only used subsets of these object classes.
Adapted from [324].
Geoff Hinton, personal communication.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 873
v The greater the financial and other interests and prejudices in a sci-
entific field, the less likely the research findings are to be true. As
empirical evidence shows, ‘‘expert opinion’’ is extremely unreliable.
v The hotter a scientific field, with more scientific teams involved, the
less likely the research findings are true.
The fact that usually only positive results supporting a particu-
lar hypothesis are submitted for publication, while negative results
not supporting a particular hypothesis are often not submitted for
publication, can make it more difficult to understand the limita-
tions of many methodologies [347]. Despite these potential limita-
tions, the hard reality is that the use of datasets currently
constitutes the most reliable means of testing recognition algo-
rithms. As Pinto et al. [193] indicate, an improvement in evaluation
methodologies might entail simulating environments and testing
recognition systems on these environments. But of course creating
environments that are acceptable by the vision community and
which are sufficiently realistic, is a challenging task. As argued in
[73], typically offline datasets are pre-screened for good quality
in order to eliminate images with saturation effects, poor contrast,
or significant noise. Thus, this pre-screening introduces an implicit
bias in the imaging conditions of such datasets. In the case of active
and dynamic vision systems, which typically sense an environment
from a greater number of viewpoints and under more challenging
imaging conditions, it becomes more difficult to predict the perfor-
mance of a vision system by using exclusively such datasets.
4.2. Sampling the current state-of-the-art in the recognition literature
A survey on the object recognition literature that does not at-
tempt to determine what the state-of-the-art is in terms of perfor-
mance, would be incomplete. To this extent, we present in some
detail some of the algorithms for which there is some consensus
in the community in terms of them belonging to the top-tier of
algorithms that reliably address the object detection, localization
and recognition problems (see Section 1). In Chart 9 and Table 7
we present a comparison, along certain dimensions, for a number
of the papers that will be surveyed in Section 4.2. For the reasons
earlier elaborated upon, determining the best performing algo-
rithms remains a difficult problem. In the active and dynamic vi-
sion literature there does not currently exist a standardized
methodology for evaluating the systems in terms of their perfor-
mance and search efficiency. However, sporadically, there have
been certain competitions (such as the semantic robot vision chal-
lenge) attempting to address these questions. Arguably the most
popular competition for evaluating passive recognition algorithms
is the annual PASCAL challenge. We thus focus our discussion in
this section on presenting in some detail the general approaches
taken by some of the best performing algorithms in the annual
PASCAL challenge for classifying and localizing the objects present
in images. In general, good performance on the PASCAL datasets is
a necessary condition of a solution to the recognition problem, but
it is not a sufficient condition. In other words, good performance on
a dataset does not guarantee that we have found a ‘‘solution’’, but it
can be used as a hint, or a simple guiding principle, for the con-
struction of vision systems, which is why we focus on these data-
sets in this section. For each annual PASCAL challenge, we discuss
some of the best performing algorithms and discuss the reasons as
to why the approaches from each year were able to achieve im-
proved performance. These annual improvements are always char-
acterized within the general setting described in Fig. 1.
From Table 7 and Chart 9 we notice that the top-ranked PASCAL
systems make very little use of 3D object representations. In mod-
ern work, 3D is mostly used within the context of robotics and ac-
tive vision systems (see Tables 5 and 6). In general, image
categorization/classification algorithms (which indicate whether
an image contains an instance of a particular object class), are sig-
nificantly more reliable than object localization algorithms whose
task is to localize (or segment) in an image all instances of the ob-
ject of interest. Good localization performance has been achieved
for restricted object classes: in general there still does not exist
an object localization algorithm that can consistently and reliably
localize arbitrary object classes. As Chum and Zisserman [365]
indicate, image classification algorithms have achieved significant
improvements since early 2000, and this is in general attributed
to the advent popularity of powerful classifiers and feature
4.2.1. Pascal 2005
We now briefly discuss some of the best performing approaches
tested during the 2005 Pascal challenge for the image classification
and object localization problems (see Fig. 41). This is not meant to
be an exhaustive listing of the relevant approaches, but rather to
provide a sample of some relatively successful approaches tested
over the years. 2005 was the first year of the PASCAL Visual Object
Challenge [370]. One of the best performing approaches was pre-
sented by Leibe et al. [199], which we also overviewed in
Section 2.9.
Dalal and Triggs [335] tested their Histogram of Oriented Gradi-
ent (HOG) descriptors in this challenge. In their original paper, Da-
lal and Triggs focused on the pedestrian localization problem, but
over the years HOG-based approaches have become quite popular,
and constitute some of the most popular descriptors in the object
recognition literature. See Fig. 42 for an overview of the pipeline
Chart 9. Summary of the PASCAL Challenge papers from Table 7 which correspond to algorithms published between 2002–2011. Notice that the winning PASCAL challenge
algorithms typically make little use of function, context, 3D and make a moderate use of texture.
874 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
proposed by Dalal and Triggs. The authors’ experiments suggest
that the other best-performing keypoint-based approaches have
false positive rates that are at least 1–2 orders of magnitude great-
er than their presented HOG dense grid approach for human detec-
tion. As the authors indicate, the fine orientation sampling and the
strong photometric normalization used by their approach, consti-
tute the best strategy for improving the performance of pedestrian
detectors, because it enables limbs and body segments to change
their position and their appearance (see Fig. 43). The authors eval-
uated numerous pixel color representations such as greyscale, RGB
and LAB color spaces, with and without gamma equalization. The
authors also tested various approaches for evaluating gradients,
and based on their results the simplest scheme which relied on
point-derivatives with Gaussian smoothing gave the best results.
The main constituent component of the HOG representation is
the orientation binning with normalization that is applied to vari-
ous descriptor blocks/cells. The cells tested are both rectangular
and radial. Orientation votes/histograms are accumulated in each
one of those cells. The orientation bins tested are both unsigned
(0–180 degrees) and signed (0–360 degrees). The authors choose
to use 9 orientation bins since more bins only lead to marginal
improvements at best. Furthermore, the authors note that the
use of signed orientations decreases performance. The authors also
tested various normalization schemes, which mostly entail divid-
ing the cell histograms by the orientation ‘‘energy’’ present in a lo-
cal neighborhood. The above-described combinations for
constructing histograms of orientation were then used in conjunc-
tion with linear and non-linear SVMs, achieving state-of-the art
performance for pedestrian detection. Note, however, that the sys-
tem was tested on images where the size of the pedestrians’ pro-
jection on the image was significant. A final observation that the
authors make is that any significant amount of smoothing before
gradient calculation degrades the system performance, demon-
strating that the most important discriminative information is
from sudden changes in the image at fine scales.
Zhang et al. [142] discuss a number of local-image-feature
extraction techniques for texture and object category classifica-
tion. In conjunction with powerful discriminative classifiers, these
approaches have led to top-tier performance in the VOC2005,
VOC2006 and VOC2007 competitions. Their work is mostly fo-
cused on the problem of classifying an image as containing an in-
stance of a particular object, and is not as much focused on the
object localization problem. As we discussed earlier, and as we
will discuss in more detail later in this section, a good classifier
does not necessarily lead to a good solution to the object localiza-
tion problem. This is due to the fact that simple brute-force slid-
ing-window approaches to the object localization problem are
extremely slow, due to the need to enumerate all possible posi-
tions, scales, and aspect ratios of a bounding-box for the object
Fig. 41. Documents describing some of the top-ranked algorithms for classifying and localizing objects in the PASCAL Visual Object Classes Challenges of 2005–2011. Note
that this is not an exhaustive list of the algorithms tested in the VOC challenges: it is simply meant to provide a sample of the most ‘‘distinct’’ approaches that have been
proven over the years to provide satisfactory results in these challenges. See [371] and [324] for an overview of the competition and a listing of all the algorithms tested over
the years.
Fig. 42. The pipeline used by [335].
Fig. 43. The HOG detector of Dalal and Triggs (from [335] with permission). (a): The average gradient image over a set of registered training images. (b), (c): Each pixel
demonstrates the maximum and minimum (respectively) SVM weight of the corresponding block. (d): The test image used in the rest of the subfigures. (e): The computed R-
HOG descriptor of the image in subfigure (e). (f), (g): The R-HOG descriptor weighed by the positive and negative SVM weights respectively.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 875
As Zhang et al. [142] indicate, in the texture recognition prob-
lem local features play the role of frequently repeated elements,
while in the object recognition problem, these local features play
the role of ‘‘words’’ which are often powerful predictors of a certain
object class. The authors show that using a combination of multi-
ple interest-point detectors and descriptors, usually achieves much
better results than the use of a single interest-point detector/
descriptor pair achieves. They also reach the conclusion that using
local features/descriptors with the highest possible degree of
invariance, does not necessarily lead to the optimal performance.
As a result, they suggest that when designing recognition algo-
rithms, only the minimum necessary degree of feature invariance
should be used. The authors note that many popular approaches
make use of both foreground and background features. They argue
that the use of background features could often be seen as a means
of providing contextual information for recognition. However, as
the authors discover during their evaluation, such background fea-
tures tend to aid when dealing with ‘‘easy’’ datasets, while for more
challenging datasets, the use of both foreground and background
features does not improve the recognition performance.
Zhang et al. [142] use affine-invariant versions of two interest
point detectors: the Harris-Laplace detector [102] which responds
to corners, and the Laplace detector [372] which responds to blob
regions (see Fig. 44). These elliptical regions are normalized into
circular regions from which descriptors are subsequently ex-
tracted. The authors also test these interest-point detectors using
scale invariance only, using scale with rotation invariance, and
by using affine invariance. As descriptors, the authors investigated
the use of SIFT, SPIN, and RIFT descriptors [373,374]. The SIFT
descriptor was discussed in Section 2.9. The SPIN descriptor is a
two dimensional rotation invariant histogram of intensities in a
neighborhood surrounding an interest-point, where each histo-
gram cell (d, i) corresponds to the distance d from the center of
the region and the weight of intensity value i at that distance.
The RIFT descriptor is similar to SIFT and SPIN, where rotation
invariant histograms of orientation are created for a number of
concentric circular regions centered at each interest point. The
descriptors are made invariant to affine changes in illumination,
by assuming pixel intensity transformations of the form aI(x) + b
at pixel x, and by normalizing those regions with respect to the
mean and standard deviation. The authors use various combina-
tions of interest-point detectors, detectors and classifiers to deter-
mine the best performing combination. Given training and test
images, the authors create a more compact representation of the
extracted image features by clustering the descriptors in each im-
age to discover its signature {(p
, u
), . . ., (p
, u
)}, where m is the
number of clusters discovered by a clustering algorithm, p
is the
cluster’s center and u
is the fraction of image descriptors present
in that cluster. The authors discover that signatures of length 20–
40 tend to provide the best results. The Earth Mover’s Distance
(EMD) [375] is used to define a ‘‘distance’’ D(S
, S
) between two
signatures S
, S
. The authors also consider the use of mined
‘‘vocabularies/words’’ from training sets of images, corresponding
to ‘‘clusters’’ of common features. Two histograms S
= (u
, . . .
), S
= (w
, . . ., w
) of such ‘‘words’’ can be compared to deter-
mine if a given image belongs to a particular object class. The
authors use the v
distance to compare two such histograms:
; S
) =

÷ w
÷ w
Image classification is tested on SVMs with linear, quadratic, Radial-
Basis-Function, v
and EMD kernels, where the v
and EMD kernels
are given by
; S
) = exp ÷
; S
_ _
where D(, ) can represent the EMD or v
distance and A is a nor-
malization constant. The bias term of the SVM decision function is
varied to obtain ROC curves of the various tests performed. The sys-
tem is evaluated on texture and object datasets. As we have already
indicated, the authors discover that greater affine invariance does
not necessarily help improve the system performance. The Lapla-
cian detector tends to extract four to five times more regions per
image than the Harris-Laplace detector, leading to better perfor-
mance in the image categorization task, and overall a combination
of Harris-Laplace and Laplacian detectors with SIFT and SPIN
descriptors. Both the EMD and v
kernels seem to provide good
and comparable performance. Furthermore, the authors notice that
randomly varying/shuffling the backgrounds during training, re-
sults in more robust classifiers.
Within the context of Fig. 1 (i.e., the feature-extraction ?
feature-grouping ?object- hypotheses ?object-verification ?
object- recognition pipeline), we see that the best performing sys-
tems of PASCAL 2005 demonstrate how the careful pre-processing
during the low level feature extraction phase makes a significant
difference in system reliability. Small issues such as the number
of orientation bins, the number of scales, or whether to normalize
the respective histograms, make a significant difference in system
performance. This demonstrates the importance of carefully study-
ing the feature-processing strategies adopted by the winning sys-
tems. One could argue that vision systems should not be as
sensitive to these parameters. However, the fact remains that cur-
rent state-of-the-art systems have not reached the level of maturity
that would make themrobust against such variations in the low-le-
vel parameters. Another observation with respect to Fig. 1 is that
the object representations of the winning systems in PASCAL
2005, were for the most part ‘‘flat’’ and made little use of the object
hierarchies whose importance we have emphasized in this survey.
As we will see, in more recent work, winning systems have made
Fig. 44. Examples of the Harris-Laplace detector and the Laplacian detector, which were used extensively in [142] as interest-point/region detectors (figure reproduced from
[142] with permission).
876 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
greater use of such hierarchies. Finally, while one could argue that
Leibe et al. [199] made use of a generative object hypothesis and
verification phase, in general, winning algorithms of PASCAL 2005
were discriminative based, and did not make use of sophisticated
modules for implementing the full pipeline of Fig. 1.
4.2.2. Pascal 2006
In addition to the previously described methodologies, a combi-
nation of the approaches described in [263,264] was proven suc-
cessful for many of the object classes tested in VOC2006 (see
Fig. 41). The presented algorithm ([263,264,369]) is used both for
the VOC challenge’s image classification task as well as for the
object localization task. In testing their algorithm for the object
localization task, the authors consider an object as successfully
localized if a
> 0.5 where
¨ B
and B
, B
denote the ground truth and localized image regions. The
object classification and localization system tested relies to a large
extent on the PicSOM framework for creating self-organizing maps
(see Fig. 45). The authors in [264] take advantage of the topology
preserving nature of the SOM mapping to achieve an image’s classi-
fication by determining the distance of the image’s representation
on the grid, from positive and negative examples of the respective
object class hypothesis. For the classification task a greedy sequen-
tial forward search is performed to enlarge the set of features used
in determining the distance metric, until the classification perfor-
mance stops increasing on the test dataset. The feature descriptors
used, include many of the descriptors used in the MPEG-7 standard
as well as some non-standard descriptors. The authors experi-
mented with using numerous color descriptors. These include, for
example, color histograms in HSV and HMMD color spaces and their
moments, as well as color layout descriptors, where the image is
split in non-overlapping blocks and the dominant colors in YCbCr
space are determined for each block (the corresponding discrete co-
sine transform coefficients are used as the final descriptors). Fur-
thermore, Fourier descriptors of segment contours are used as
features, as well as histograms and co-occurrence matrices of Sobel
edge directions. The object localization algorithm relies to a large
extent on the use of a simple greedy hierarchical segmentation
algorithm that merges regions with high similarity. These regions
are provided as input to the classifier, which in turn enables the ob-
ject localization.
Thus, within the context of Fig. 1 we see that during PASCAL
2006, and as compared to PASCAL 2005, one of the winning sys-
tems evolved by making use of a significantly greater number of
low level features. Furthermore, the use of a self organizing map
by Viitaniemi and Laaksonen [264] demonstrated that the proper
grouping and representation of these features plays a conspicuous
role in the best performing algorithms.
4.2.3. Pascal 2007
During the 2007 challenge, the work by Felzenszwalb et al.
[366] was tested on a number of object localization challenges.
The algorithm’s ability to localize various object classes was fur-
ther demonstrated in subsequent years’ competitions, where it
consistently achieved good performance for various object classes
(see Fig. 46). The authors achieved a twofold improvement in the
person detection task (as compared to the best performing person
detection algorithm from the 2006 Pascal challenge) and for many
object classes it outperformed the best results from the 2007 chal-
lenge. As the authors point out, there appears to be a performance
gap in terms of the performance difference between parts-based
Fig. 45. The distributions of various object classes corresponding to six feature classes. These results were generated by the self-organizing-map algorithm used in the
PicSOM framework [263]. Darker map regions represent SOM areas where images of the respective object class have been densely mapped based on the respective feature
(from [263] with permission).
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 877
methods and rigid-template or bags-of-features type of represen-
tations. The authors point out that a strong point of their paper
is the demonstration that parts-based methods are capable of
bridging this performance gap. The system is based on shifting a
scanning window over the input image in order to fit the target ob-
ject representation on the input image. The object representation
consists of a root and a single level of subparts. A deformation cost
is defined for the subpart windows’ deformations/positions with
respect to the root window position (see Fig. 47). The score of
the placement of an object representation placement is the sum
of the scores of all the windows. A latent variable SVM is used dur-
ing the training process, where the latent variable is used to learn a
set of filter parameters (F
, F
, . . ., F
) and deformation parameters
, b
, . . ., a
, b
). For each input image and any subpart deforma-
tion z, a vector of HOG features (H) and subpart displacements w(H,
z) is extracted. The score of positioning an object representation on
an image using arrangement z, is given by the dot product b w(H,
z), where b = (F
, F
, . . ., F
, a
, b
, . . ., a
, b
In more detail, the authors define a latent SVM as
(x) = max
b w(x; z) (22)
where b w(x, z) is the score of positioning the object representation
according to deformation z, and Z(x) denotes all possible deforma-
tions of the object representation. Given a training dataset
D = (¸x
, y
), . . ., ¸x
, y
)) (where x
denotes the ith HOG pyramid vec-
tor and y
÷ {÷1, 1} denotes a label), the authors attempt to find the
optimal vector b

(D) which is defined as
(D) = arg min

max(0; 1 ÷ y
)) (23)
Notice, however, that due to the existence of positive labeled exam-
ples (y
= 1), this is not a convex optimization problem. As a result
the authors execute the following loop a number of times: (i) Keep
b fixed and find the optimal latent variable z
for the positive exam-
ple. (ii) Then by holding the latent variables of positive examples
constant, optimize b by solving the corresponding convex problem.
The authors try to ignore the ‘‘easy’’ negative training examples,
since these examples are not necessary to achieve good perfor-
mance. During the initial stage of the training process, a simple
SVM is trained for only the root filter. The optimal position of this
filter is then discovered in each training image. Since the training
data only contains a bounding box of the entire object and does
not specify the subpart-positions, during training the subparts are
initialized by finding high-energy subsets of the root-filter’s bound-
ing box. This results in a new training dataset that specifies object
subpart positions. This dataset is iteratively solved using the meth-
odologies above in order to find the filter representations for the en-
tire object and its subparts.The authors decide to use six subparts
since this leads to the best performance.
Perronnin and Dance [364] use the Fisher kernel for image cat-
egorization. The authors extract a gradient vector from a genera-
tive probability of the extracted image features (local SIFT and
RGB statistics). These gradient vectors are then used in a discrim-
inative classifier. An SVM and a logistic regression classifier with
a Laplacian prior is tested. They both perform similarly. The
authors indicate that historically, even on databases containing
very few object classes, the best performance is achieved when
using large vocabularies with hundreds or thousands of visual
words. However, the use of such high-dimensional histogram com-
putations can have a high associated computational cost. Often the
vocabularies extracted from a training image dataset are not uni-
versal, since they tend to be tailored to the particular object catego-
ries being learnt. The authors indicate that an important goal in
vision research is to discover truly universal vocabularies, as we al-
ready discussed in Section 2. However, the lack of significant pro-
gress in this problem, has caused some researchers to abandon this
idea. In more detail, given a set of visual words X = {x
, x
, . . ., x
extracted from an image, a probability distribution function
p(X[k) with parameters k is calculated. In practice, this pdf is mod-
eled as a Gaussian Mixture Model. Given the Fisher information
= E
[ (24)
the authors obtain the corresponding normalized gradient vectors
p(X[k). The authors derive analytical expressions for these gra-
dients with respect to the mean, variance and weight associated
with each one of the Gaussians in the mixture that model this prob-
ability. These gradients were used to train powerful classifiers,
which provided state-of-the-art image classification performance
on the Pascal datasets.
Viitaniemi and Laaksonen [265] overview a general approach
for image classification, object localization, and object segmenta-
tion. The methodology relies on the fusion of multiple classifiers.
The authors report the slightly counter-intuitive observation that
while their approach provides the best performing segmentation
results, and some of the best image classification results, the ap-
proach is unable to provide the best object localization results.
Fig. 46. Example of the algorithm by Felzenszwalb et al. [366] localizing a person using the coarse template representation and the higher resolution subpart templates of the
person (from [366] with permission).
878 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
van de Weijer and Schmid [368] expand local feature descrip-
tors by appending to the respective feature vectors photometric
invariant color descriptors. These descriptors were tested during
the 2007 Pascal competition. The authors survey some popular
photometric invariants and test the effects they have in recognition
performance. It is demonstrated that for images where color is a
highly discriminative feature, such color invariants can be quite
useful. However, there is no single color descriptor that consis-
tently gives good results. In other words, the optimal color descrip-
tor to use is application dependent.
Chum and Zisserman [365] introduced a model for learning
and generating a region of interest around instances of the object,
given labeled and unsegmented training images. The algorithm
achieves good localization performance in various PASCAL chal-
lenges it was tested on. In other words, the algorithm is given
as input only images of the object class in question, with no fur-
ther information on the position, scale or orientation of the ob-
ject in the image. From this data, an object representation is
learnt that is used to localize instances of the object of interest.
Given an input or training set of images, a hierarchical spatial
pyramidal histogram of edges is created. Also a set of highly dis-
criminative ‘‘words’’ is learned from a set of mined appearance
patches (see Fig. 48). A cost function that is the sum of the dis-
tances between all pairs of training examples is used to automat-
ically learn the object position from an input image. The cost
function takes into account the distances between the discrimi-
native words and the edge histograms. A similar procedure, with
a number of heuristics, is used to measure the similarity between
two images and localize any instances of the target object in an
Ferrari et al. [367] present a family of translation and scale-
invariant feature descriptors composed of chains of k-connected
approximately straight contours, referred to as kAS. See Fig. 49
for examples of kAS for k = 2. It is shown that for kAS of intermedi-
ate complexity, these fragments have significant repeatability and
provide a simple framework for simulating certain perceptual
grouping characteristics of the human visual system (see Fig. 7).
The authors show that kAS substantially outperform interest points
for detecting shape-based classes. Given a vocabulary of kAS, an in-
put image is split into cells, and a histogram of the kAS present in
each cell is calculated. An SVM is then used to classify the object
present in an image, by using a multiscale sliding window ap-
proach to extract the respective SVM input vector that is to be clas-
sified. Given an input image, the edges are calculated using the
Berkeley edge detector which takes into consideration texture
and color cues (in addition to brightness) when determining the
objects present in an image. Two extracted edges are ‘‘connected’’
if they are only separated by a small gap or if they form a junction.
This results in a graph structure. For each edge, a depth-first search
is performed, in conjunction with the elimination of equivalent
paths, in order to mine candidate kAS. A simple clustering algo-
rithm is used in order to mine clusters of kAS and a characteristic
‘‘word’’/kAS for each cluster. In other words, each kAS is an ordered
list P = (s
, s
, . . ., s
) of edges. For each kAS a ‘‘root’’ edge s
is deter-
mined, and a vector r
= r
; r
_ _
of the distance from the midpoint
of s
to s
is determined. Similarly an orientation h
and length l
determined for each s
. Thus, the measure used to determine the
similarity D(a, b) between two kAS P
, P
is given by
D(a; b) = w

÷ w

; h
_ _

[ log l
_ _
[ (25)
where D
; h
_ _
÷ [0; p=2[ is the difference between the orienta-
tions of the corresponding segments in kAS ‘a’ and ‘b’. As with many
algorithms in the literature, the algorithm focuses on building a
detector for a single viewpoint. An interesting observation of the
authors is that as the resolution of the tiles/cells used to split an in-
put image increases, the spatial localization ability of kAS grows
stronger, thus, accommodating for less spatial variability in the ob-
ject class. This implies that there exists an optimal number of cells,
suggesting a tradeoff between optimal localization and tolerance to
intraclass variation. The authors also observe that as k increases, the
optimal number of cells in which the image is split has to decrease.
Notice that this behavior on the part of recognition algorithms is
predicted in [45] and in [26] where the influence of object class
complexity, sensor noise, scene complexity and various time con-
straints on the capabilities of recognition algorithms are examined
Fig. 47. The HOG feature pyramid used in [366], showing the coarse root-level template and the higher resolution templates of the person’s subparts (from [366] with
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 879
rigorously, thus proving that these factors place certain fundamen-
tal limits on what one can expect from recognition systems in terms
of their reliability. The authors of [367] conclude their paper by
comparing their object localization algorithm to the algorithm by
Dalal and Triggs [335], and demonstrating that their algorithm per-
forms favorably.
Compared to the Pascal competitions from previous years, a
push towards the use of more complex hierarchies is evident in
Pascal 2007. The use of these hierarchies resulted in improved per-
formance. Despite the belief on the part of many researchers that
finding truly universal words/part-based representations has pro-
ven a failure so far, their research indicates that for class specific
datasets these representations can be of help. Within the context
of Fig. 1, these hierarchies represent a more complex type of fea-
ture grouping. Effectively the authors are using similar low level
features (e.g., edges, color) and they are grouping them in more
complex ways in order to achieve more universal representations
of object parts. In terms of object verification and object hypothe-
sizing (see Fig. 1) the work by Felzenszwalb et al. [366] represents
the most successful approach tested in Pascal 2007, for using a
coarse generative model of object parts to improve recognition
4.2.4. Pascal 2008
Harzallah et al. [360,361] present a framework in which the
outputs of object localization and classification algorithms are
combined to improve each other’s results. For example, knowing
the type of image can help improve the localization of certain ob-
jects (see Fig. 50). Motivated by the cascade of classifiers proposed
by Viola and Jones [227,235,236] (see Section 2.11) the authors
propose a low-computational cost linear SVM classifier for pre-
selection of regions, followed by a costly but more reliable non-lin-
ear SVM (based on a v
kernel) for scoring the final localization
output, providing a good trade-off between speed and accuracy.
A winning image classifier from VOC 2007 is used for the image
classification algorithm. Objects are represented using a combina-
tion of shape and appearance descriptors. Shape descriptors consist
of HOG descriptors calculated over 40 and 350 overlapping or
non-overlapping tiles (the authors compare various approaches
for splitting the image into tiles). The appearance descriptors are
built using SIFT features that are quantized into ‘‘words’’ and calcu-
lated over multiple scales. These words are used to construct visual
word histograms summarizing the content of each one of the tiles.
The authors note that overlapping square tiles seem to give the
best performance. The number of positive training set examples
used by the linear SVM is artificially increased and a procedure
for retaining only the hard negative examples during training, is
presented. The final image classification and localization probabil-
ities are combined via simple multiplication, to obtain the proba-
bility of having an object in an image given the window’s score
(localization) and the image’s score (classification). Various results
presented by the authors show that the combination of the two im-
proves in general the localization and classification results for both
VOC 2007 and VOC 2008.
Tahir et al. [362,342] propose the use of Spectral Regression
combined with Kernel Discriminant Analysis (SR-KDA) for
Fig. 48. The distribution of edges and appearance patches of certain car model training images used by [365], with the learned regions of interest overlaid (from [365], with
Fig. 49. The 35 most frequent 2AS constructed from 10 outdoor images (from [367]
with permission).
880 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
classifying images in a particular class. The authors show that this
classifier is appropriate for large scale visual category recognition,
since its training is much faster than the SVM-based approaches
that they tested, while at the same time achieving at least as good
performance as SVMs. This makes SR-KDA approaches a straight-
forward replacement of the SVM modules often used in the litera-
ture. The image representation is based on classical interest point
detection, combined with various extensions of the SIFT descriptor,
combined with a visual codebook extraction phase. The algorithm
achieves top ranked performance on PASCAL VOC 2008 and Media-
mill challenge. Within the context of Fig. 1, the main innovation
evident in the top-ranked algorithms of Pascal 2008 lies in their
use of more powerful discriminative classifiers which enabled an
improvement of the object verification modules.
4.2.5. Pascal 2009
Felzenszwalb et al. [363] present an extension of their previous
work [366]. In contrast to their earlier work, they now use stochas-
tic gradient descent to perform the latent SVM training. Further-
more, they investigate the use of PCA-based dimensionality
reduction techniques to transform the object representation
vectors and obtain lower dimensional vectors for representing
the image cells. They also introduce the use of contextual knowl-
edge to improve object localization performance. They achieve this
by obtaining the set of localizations from k detections, thus con-
structing a related ‘‘context’’ vector from these scores, and then
using this vector in conjunction with a quadratic-kernel based
SVM to rescore the images. The authors test their algorithm on var-
ious PASCAL challenge datasets, achieving comparatively excellent
Vedaldi et al. [356] investigate the use of a combination of ker-
nels, where each kernel corresponds to a different feature channel
(such as bag of visual words, dense words, histograms of oriented
edges and self-similarity features). The use of combinations of mul-
tiple kernels results in excellent performance, demonstrating that
further research on kernel methods has a high likelihood of further
improving the performance of vision systems. Similarly to the
work in [360,361], the authors use a cascade of progressively more
costly but more accurate kernels (linear, quasi-linear and non-lin-
ear kernels) to efficiently localize the objects. However, as the
authors note, further work could be done to reduce the computa-
tional complexity of the framework. This algorithm also results
in comparatively excellent results on the PASCAL datasets it was
tested on.
Similarly, Wang et al. [357] present the Locality-constrained
Linear Coding (LLC) approach for obtaining sparse representations
of scenes. These sparse bases are obtained through the projection
of the data onto various local coordinate frames. Linear weight
combinations of these bases are used to reconstruct local descrip-
tors. The authors also propose a fast approximation to LLC which
speeds up the LLC computations significantly. An SVM is used to
classify the resulting images’ descriptors, achieving top-ranked
performance when tested with various benchmarks.
Khan et al. [358,359] attempt to bridge the gap between the
bottom-up bags-of-words paradigms which have been quite suc-
cessful in the PASCAL challenges, by incorporating a top-down
attention mechanism that can selectively bias the features ex-
tracted in an image based on their dominant color (see Fig. 51).
As the authors point out, the two main approaches for fusing color
and shape information into a bag-of-words representation is via
Fig. 50. It is easier to understand the left image’s contents (e.g., a busy road with mountains in the background) if the cars in the image have been firstly localized. Conversely,
in the right image, occlusions make the object localization problem difficult. Thus, prior knowledge that the image contains exclusively cars, can make the localization
problem easier (from [361] with permission).
Fig. 51. Demonstrating how top-down category-specific attentional biases can modulate the shape-words during the bag-of-words histogram construction (from [358] with
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 881
early fusion (where joint shape-color descriptors are used) and via
late fusion (where histogram representations of color and shape
are simply concatenated). Given separate vocabularies for shape
and color, each training image’s corresponding color histogram is
estimated and a class specific posterior p(class[word) is estimated.
By concatenating the posteriors for all the color words of interest,
the corresponding low-level features are primed. Difference of
Gaussian detectors, Harris Laplace detectors and SIFT descriptors
are used to obtain the shape descriptors. The Color Name and
HUE descriptor are used as color descriptors [368,376,377]. A stan-
dard v
SVM is used for classifying images. These top-down ap-
proaches are compared to early-fusion based approaches that
combine SIFT descriptors with color descriptors, and which are
known to perform well [378]. It is shown that for certain types
of images the top-down priming can result in drastic classification
Within the context of Fig. 1, it is evident that during Pascal 2009
there was a significant shift towards more complex object repre-
sentations and more complex object inference and verification
algorithms. This is evidenced by the incorporation of top down
priming mechanisms, complex kernels that incorporate contextual
knowledge, as well as by novel local sparse descriptors which
achieved top-ranked performance. Consistent in all this work is
the preference in using SVMs for contextual classification, model
building during training, as well as object recognition, demonstrat-
ing that the use of SVMs has become more subtle and less mono-
lithic compared to early recognition algorithms.
4.2.6. Pascal 2010
Perronnin et al. [354] present an extension of their earlier work
[364], which we have already described in this section. The modi-
fications they introduce, result in an increase of over 10% in the
average precision. An interesting aspect of this work is that during
the 2010 challenge the work was also trained on its own dataset
(non-VOC related) and subsequently tested successfully on various
tasks, demonstrating the algorithm’s ability to generalize. The
authors achieve these results by using linear classifiers. This last
point is important since linear SVMs have a training cost of O(N),
while non-linear SVMs have a training cost of around O(N
) to
), where N is the number of training images. Thus, training
non-linear SVMs becomes impractical with tens of thousands of
training images. The authors achieve this improvement in their re-
sults by normalizing the respective gradient vectors first described
in [364]. Another problem with the gradient representation is the
sparsity of many vector dimensions. As a result the authors apply
to each dimension a function f(z) = sign(z)[z[
for some a ÷ [0, 1],
which results in a significant classification improvement.
A number of other novel ideas were tested within the context of
the 2010 Pascal challenge [351,352,353,355] . van de Sande et al.
[351] proposed a selective search algorithm for efficiently search-
ing within a single image, without having to exhaustively search
the entire image (see Sections 2.11, 3.2 for more related work).
They achieve this by adopting segmentation as a selective search
strategy, so that rather than aiming for a few accurate object local-
izations, they generate more approximate object localizations, thus
placing a higher emphasis on high recall rates. A novel object-
class-specific part representations was also introduced for human
pose estimation [352,353]. It achieved state-of-the-art perfor-
mance for localizing people, demonstrating the significance of
properly choosing the object representations.
Overall, in the top-ranked systems of Pascal 2010 there is evi-
dence of an effort to mitigate the effects of training set biases. This
has motivated Perronnin et al. [354] to test the generalization abil-
ity of their system even when trained on a non-Pascal related data-
set. Approaches proposed to improve the computational
complexity of training and online search algorithms include the
use of combinations of linear and non-linear SVMs as well as
various image search algorithms. Within the context of Fig. 1, this
Fig. 52. (a) The 3-layer tree-like object representation in [348]. (b) A reference template without any part displacement, showing the root-node bounding box (blue), the
centers of the 9 parts in the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and (d) denote object localizations (from [348] with permission). (For
interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
882 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
corresponds to ways of improving the hypothesis generation and
object verification process.
4.2.7. Pascal 2011
Zhu et al. [348] present an incremental concave-convex proce-
dure (iCCP) which enables the authors to efficiently learn both two
and three layer object representations. The authors demonstrate
that their algorithm outperforms the model by Felzenszwalb
et al. [363]. These results are used by the authors as evidence that
deep-structures (3-layers) are better than 2-layer based object rep-
resentations (see Fig. 52). The authors begin their exposition by
describing the task of structural SVM learning. Let (x
, y
, h
), . . .,
; y
; h
) ÷ X×Y×H denote training samples where the x
training patches, the y
are class labels, and h
= (V
; ~p
) with V
denoting a viewpoint and ~p
denoting the positions of object parts.
In other words, the h
encode the spatial arrangement of the object
representation. In structural SVM learning the task is to learn a
function F
(x) = argmax
y, h
[w U(x, y, h)] where U is a joint feature
vector encoding the relation between the input x and the structure
(y, h). In practice U encodes spatial and appearance information
similarly to [363]. If the structure information h is not labeled in
the training set (as the case usually is since in training data we
are customarily only given the bounding box of the object of inter-
est and not part-relation information) then, we deal with the latent
structural SVM problem, where we need to solve
÷ C

[w U
÷ L
[ ÷max
[w U
[[ (26)
where C is a constant penalty value, U
= U(x
, y, h) and L
= L(y
y, h) is the loss function which is equal to 1 iff y
= y. The authors use
some previous results from the latent structural SVM training liter-
ature: By splitting the above expression in two terms, they itera-
tively find a hyperplane (a function of w) which bounds the last
max term (which is concave in terms of w), replace the max term
with this hyperplane, solve the resulting convex problem, and re-
peat the process. This trains the model and enables the authors to
use F
to localize objects in an image, achieving comparatively
excellent results. Chen et al. [349] present a similar latent hierarchi-
cal model which is also solved using a concave-convex procedure,
and whose results are comparable to other state of the art algo-
rithms. The latent-SVM procedure is again used to learn the hierar-
chical object representation. A top-down dynamic programming
algorithm is used to localize the objects.
Song et al. [350] present a paper on using context to improve
image classification and object localization performance when
we are dealing with ambiguous situations where methodologies
that do not use context tend to fail (see Fig. 53). The authors report
top-ranked results on the PASCAL VOC 2007 and 2010 datasets. In
more detail, the authors present the Contextualized Support Vector
Machine. In general, SVM based classification assumes the use of
a fixed hyperplane w
÷ b = 0. Given an image X
specific fea-
ture vector x
and image specific contextual information vector x
the authors adapt the vector w
into a vector w
= Px
÷ w
is based on x
and a transformation matrix P. Matrix
P =

is constrained as a low rank matrix with few
parameters and as a result w
= w

_ _
Thus, the
SVM margin of image X
is c
= y

_ _
_ _
÷ b
_ _
where y
÷ {÷1, 1} is a class label. The authors define each vector
so that for unambiguous features x
, the scalar u
_ _
takes a
small value close to 0, in which case c
~ y
÷ b
_ _
. Thus, only
for ambiguous images is contextual knowledge used. The context
vector length is equal to the number of objects we are searching
for, and is built using a straightforward search for the highest con-
fidence location in an image of each one of those objects. The
authors specify an iterative methodology for adapting these vec-
tors w
until satisfactory performance is achieved. The significant
improvements that this methodology offers demonstrate how
important context is in the object recognition problem.
Overall, in one of the top-ranked approaches of Pascal 2011, Zhu
et al. [348] demonstrated that even deeper hierarchies are achiev-
able. They showed that such hierarchies can provide even better
results than another top-ranked Pascal competition algorithm
[363]. Within the context of Fig. 1, the work by Zhu et al. [348] pro-
vides an approach for building deeper hierarchies which affect the
grouping, hypothesis generation and verification modules of the
standard recognition pipeline. Song et al. [350] provided an elegant
way for adaptively controlling the object hypothesis module, by
using context as an index that adaptively selects a different classi-
fier that is appropriate for the current context.
4.3. The evolving landscape
In 1965, Gordon Moore stated that the number of transistors
that could be incorporated per integrated circuit would increase
exponentially with time [379,380]. This provided one of the earli-
est technology roadmaps for semiconductors. Even earlier, Engel-
bart [381] made a similar prediction on the miniaturization of
circuitry. Engelbart would later join SRI and found the Augmenta-
tion Research Center (ARC) which is widely credited as a pioneer in
the creation of the modern Internet era computing, due to the cen-
ter’s early proposals for the mouse, videoconferencing, interactive
text editing, hypertext and networking [382]. As Engelbart would
later point out, it was his early prediction on the rapid increase
of computational power that convinced him on the promise of
the research topics later pursued by his ARC laboratory. The early
identification of trends and shifts in technology, can provide a
competitive edge for any individual or corporation. The question
arises as to whether we are currently entering a technological shift
of the same scope and importance as the one identified by Moore
and Engelbart fifty years ago.
For all intents and purposes, Moore’s law is coming to an end.
While Moore’s law is still technically valid, since multicore tech-
nologies have enabled circuit designers to inexpensively pack more
transistors on a single chip, this no longer leads to commensurate
increases in application performance. Moore’s law has historically
provided a vital technology roadmap that influenced the agendas
of diverse groups in academia and business. Today, fifty years after
the early research on object recognition systems, we are simulta-
neously confronted with the end to Moore’s law and with a gargan-
tuan explosion in multimedia data growth [253]. Fundamental
limits on processing speed, power consumption, reliability and
Fig. 53. On using context to mitigate the negative effects of ambiguous localiza-
tions [350]. The greater the ambiguities, the greater role contextual knowledge
plays (from [350] with permission).
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 883
programmability are placing severe constraints on the evolution of
the computing technologies that have driven economic growth
since the 1950s [383]. It is becoming clear that traditional von-
Neumann architectures are becoming unsuitable for human-level
intelligence tasks, such as vision, since the machine complexity
in terms of the number of gates and their power requirements
tends to grow exponentially with the size of the input and the
environment complexity [384,383]. The question for the near fu-
ture is that of determining to what extent the end to Moore’s
law will lead to a significant evolution in vision research that will
be capable of accommodating the shifting needs of industry. As the
wider vision community slowly begins to address this fact, it will
define the evolution of object recognition research, it will influence
the vision systems that remain relevant, and it will lead to signifi-
cant changes in vision and computer science education in general
by affecting other related research areas that are strongly depen-
dent on vision (such as robotics).
According to the experts responsible for the International Tech-
nology Roadmap for Semiconductors [384], the most promising fu-
ture strategy for chip and system design is that of complementing
current information technology with low-power computing sys-
tems inspired by the architecture of the brain [383]. How would
von-Neumann architectures compare to a non-von-Neumann
architecture that emulates the organization of the organic brain?
The two architectures should be suitable for complementary appli-
cations. The complexity of neuromorphic architectures should in-
crease more gradually with increasing environment complexity,
and it should tolerate noise and errors [383]. However, such neuro-
morphic architectures would likely not be suitable for high preci-
sion numerical analysis tasks. Modern von-Neumann computing
precipitates the need for a program that relies on synchronous, se-
rial, centralized, hardwired, general purpose and brittle circuits
[385]. The brain architecture on the other hand relies on neurons
and synapses operating in a mixed digital-analog mode, is asyn-
chronous, parallel, fault tolerant, distributed, slow, and with a
blurred distinction between CPU and memory (as compared to
von-Neumann architectures) since the memory is, to a large ex-
tent, represented by the synaptic weights.
How does our current understanding of the human brain differ-
entiate it from typical von-Neumann architectures? Turing made
the argument that since brains are computers then brains are com-
putable ([386]). But if that is indeed the case, why do reliable im-
age understanding algorithms still elude us? Churchland [387] and
Hawkins [388] argue that general purpose AI is difficult because (i)
computers must have a large knowledge base which is difficult to
construct, and because (ii) it is difficult to extract the most relevant
and contextual information from such a knowledge base. As it was
demonstrated throughout our discussion on object recognition
systems, the problem of efficient object representations and effi-
cient feature extraction constitutes a central tenet of any non-triv-
ial recognition system, which supports the viewpoint of
Churchland and Hawkins.
There is currently a significant research thrust towards the con-
struction of neuromorphic systems, both at the hardware and the
software level. This is evidenced by recent high-profile projects,
such as EU funding of the human brain project with over a billion
Euros over 10 years [383], US funding for the NIH BRAIN Initiative
[389], and by the growing interest in academia and industry for re-
lated projects [390–397]. The appeal of neuromorphic architec-
tures lies in [398] (i) the possibility of such architectures
achieving human like intelligence by utilizing unreliable devices
that are similar to those found in neuronal tissue, (ii) the ability
of neuromorphic strategies to deal with anomalies, caused by noise
and hardware faults for example, and (iii) their low-power require-
ments, due to their lack of a power intensive bus and due to the
blurring of a distinction between CPU and memory.
Vision and object recognition should assume a central role in
any such research endeavor. About 40% of the neocortex is de-
voted to visual areas V1, V2 [388], which in turn are devoted just
to low-level feature extraction. It is thus reasonable to argue that
solving the general AI problem is similar in scope to solving the
image understanding problem (see Section 1). Current hardware
and software architectures for vision systems are unable to scale
to the massive computational resources required for this task. The
elegance of the solution to the vision problem is astounding. The
human neocortex consists of 80% of the human brain, which has
around 100 billion neurons and 10
synapses, consumes just 20–
30 Watts, and is to a large extent self trained [399]. One of the
most astounding results in neuroscience is attributable to Mount-
castle [400,388,401]. By investigating the detailed anatomy of the
neocortex, he was able to show that the micro-architecture of the
regions looks extremely similar regardless of whether a region is
for vision, hearing or language. Mountcastle proposed that all
parts of the neocortex operate based on a common principle, with
the cortical column being the unit of computation. What distin-
guishes different regions is simply their input (whether their in-
put is vision based, auditory based etc.). From a machine
learning perspective this is a surprising and puzzling result, since
the no-free-lunch theorem, according to which it is best to use a
problem specific optimization/learning algorithm, permeates
much of the machine learning research. In contrast the neocortex
seems to rely on a single learning architecture for all its tasks and
input modalities. Looking back at the object recognition algo-
rithms surveyed in this paper, it becomes clear that no main-
stream vision system comes close to achieving the
generalization abilities of the neocortex. This sets the stage for
what may well become one of the most challenging and reward-
ing scientific endeavors of this century.
5. Conclusion
We have presented a critical overview of the object recognition
literature, pointed out some of the major challenges facing the
community and emphasized some of the characteristic approaches
attempted over the years for solving the recognition problem. We
began the survey by discussing how the needs of industry led to
some of the earliest industrial inspection and character recognition
systems. It is pleasantly surprising to note that despite severe lim-
itations in CPU speeds and sensor quality, such early systems were
astoundingly accurate, thus contributing to the creation of the field
of computer vision, with object recognition assuming a central
role. We pointed out that recognition systems perform well in con-
trolled environments but have been unable to generalize in less
controlled environments. Throughout the survey we have dis-
cussed various proposals set forth by the community on possible
causes and solutions to this problem. We continued by surveying
some of the characteristic classical approaches for solving the
problem. We then discussed how this led to the realization that
more control over the data acquisition process is needed. This real-
ization contributed to the popularity of active and attentive sys-
tems. We noted how this led to a stronger confluence between
the vision and robotics community and surveyed some relevant
systems. We continued the survey by discussing some common
testing strategies and fallacies that are associated with recognition
systems. We concluded by discussing in some depth some of the
most successful recognition systems that have been openly tested
in various object recognition challenges. As we alluded in the pre-
vious section, titillating evidence from neuroscience indicates that
there is an elegant solution to the vision problem that should also
be capable of spanning the full AI problem (e.g., voice recognition,
reasoning, etc.), thus providing the necessary motivation for a
884 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
radical rethinking of the strategies used by the community in tack-
ling the problem.
In Tables 1–7 we compared some of the more distinct recogni-
tion algorithms along a number of dimensions which characterize
each algorithm’s ability to bridge the so-called semantic gap: the
inability of less complex but easily extractable indexing primitives,
to be grouped/organized so that they provide more high-level and
more powerful indexing primitives. This dilemma has directly or
indirectly influenced much of the literature (also see Fig. 1). This
is exemplified, for example, by CBIR systems which rely on low-le-
vel indexing primitives for efficiency reasons. This is also exempli-
fied by the fact that no recognition system has consistently
demonstrated graceful degradation as the scene complexity in-
creases, as the number of object classes increases, and as the com-
plexity of each object class increases [26]. While there is significant
success in building robust exemplar recognition systems, the suc-
cess in building generic recognition systems is questionable. Fur-
thermore, from Tables 1–7 we notice that few papers have
attempted to address to a large extent all the dimensions of robust
recognition systems. For example, in more recent systems, the role
of 3D parts-based representations has significantly diminished.
Within this context, active recognition systems were proposed as
an aid in bridging the semantic gap, by adding a greater level of
intelligence to the data acquisition process. However, in practice,
and as it is exemplified from Tables 5 and 6, very few such systems
currently address the full spectrum of recognition sub-problems.
As the role of the desktop computer is diminished and the role of
mobile computing becomes more important, a commensurate in-
crease in the importance of power-efficient systems emerges. A
power-efficient solution to the recognition problem precipitates
significant advancements to all of the above mentioned problems.
Within the context of the vision problem, recognition consti-
tutes the most difficult but also the most rewarding problem, since
most vision problems can be reformulated in terms of the recogni-
tion problem (albeit perhaps not as efficiently). Some general com-
ments are in place. The recognition and vision problem is highly
interdisciplinary, spanning the fields of machine learning and deci-
sion making under uncertainty, robotics, signal processing, mathe-
matics, statistics, psychology, neuroscience, HCI, databases,
supercomputing and visualization/graphics. The highly interdisci-
plinary nature of the problem is both an advantage and a disadvan-
tage. It is an advantage due to the vast research opportunities it
gives to the experienced vision practitioner. It is a disadvantage be-
cause the diversity of the field makes it all the more pertinent that
the practitioner is careful and sufficiently experienced in identify-
ing research that can advance the field.
Based on the above survey, we reach a number of conclusions:
(i) The solution to the recognition problem will require significant
advances in the representation of objects, the inference and learn-
ing algorithms used, as well as the hardware platforms used to
execute such systems. In general, artificial recognition systems
are still far removed from the elegance, and generalization capabil-
ities that solutions based on the organic brain are endowed with.
(ii) The issue of bridging the semantic gap between low level image
features and high level object representations keeps re-emerging
in the literature. Such low-level indexing primitives are easy to ex-
tract from images but are often not very powerful indexing primi-
tives (see Fig. 1). In contrast high level object representations are
significantly more powerful indexing primitives, but efficiently
learning object models based on such primitives, and extracting
such primitives from images, remains a difficult problem. The di-
lemma of indexing strength vs. system efficiency permeates the
recognition literature and plays a decisive role in the design of
commercial systems, such as Content Based Image Retrieval sys-
tems. (iii) A parts-based hierarchical modeling of objects will al-
most certainly play a central role in the problem’s solution and
the bridging of the semantic gap. While such models have shown
some success in distinguishing between a small number of classes,
they generally fail as the scene complexity increases, as the num-
ber of object classes increases and as the similarity between the
object classes increases. (iv) For each neuron in the neocortex there
correspond on average 10,000 synapses, thus demonstrating that
there is a significant gap in terms of the input size and the compu-
tational resources needed to reliably process the input. Active and
attentive approaches can help vision systems cope with many of
the intractable aspects of passive approaches to the vision problem
by reducing the complexity of the input space. An active approach
to vision can help solve real world problems such as degeneracies,
occlusions, varying illumination and extreme variations in object
A great deal of the research on passive recognition has focused,
to some extent, on the feature selection stage of the recognition
problem without taking into consideration the effects of various
cost constraints discussed in the paper. Virtually all the research
on active object recognition has only attempted to optimize a small
number of extrinsic camera parameters while assuming that the
recognition algorithm is a rather static black box. More work on
investigating the confluence of the two sets of parameters could
potentially lead to more efficient search strategies. Finally, the sur-
vey has supported the view that the computational complexity of
vision algorithms must constitute a central guiding principle dur-
ing the construction of such systems.
We thank the reviewers for their insightful comments and sug-
gestions that helped us improve the paper. A.A. first submitted this
paper while he was affiliated with York University.
[1] R. Graves, The Greek Myths: Complete Edition, Penguin, 1993.
[2] L.G. Roberts, Pattern recognition with an adaptive network, in: Proc. IRE
International Convention Record, 1960, pp. 66–70.
[3] J.T. Tippett, D.A. Borkowitz, L.C. Clapp, C.J. Koester, A.J. Vanderburgh (Eds.),
Optical and Electro-Optical Information Processing, MIT Press, 1965.
[4] L.G. Roberts, Machine Perception of Three Dimensional Solids, Ph.D. thesis,
Massachusetts Institute of Technology, 1963.
[5] M. Ejiri, Machine vision in early days: Japan’s pioneering contributions, in:
Proc. 8th Asian Conference on Computer Vision (ACCV), 2007.
[6] S. Kashioka, M. Ejiri, Y. Sakamoto, A transistor wire-bonding system utilizing
multiple local pattern matching techniques, IEEE Transactions on Systems,
Man and Cybernetics 6 (8) (1976) 562–570.
[7] G. Gallus, Contour analysis in pattern recognition for human chromosome
classification, Appl Biomed Calcolo Electronico 2 1968.
[8] G. Gallus, G. Regoliosi, A decisional model of recognition applied to the
chromosome boundaries, Journal of Histochemistry & Cytochemistry 22
[9] A. Jimenez, R. Ceres, J. Pons, A survey of computer vision methods for locating
fruits on trees, IEEE Transactions of the ASABE 43 (6) (2000) 1911–1920.
[10] E.N. Malamas, E.G.M. Petrakis, M. Zervakis, L. Petit, J-D. Legat, A survey on
industrial vision systems, applications and tools, Image and Vision
Computing 21 (2) (2003) 171–188.
[11] T. McInerney, D. Terzopoulos, Deformable models in medical image analysis:
a survey, Medical Image Analysis 1 (2) (1996) 91–108.
[12] A. Andreopoulos, J.K. Tsotsos, Efficient and generalizable statistical models of
shape and appearance for analysis of cardiac MRI, Medical Image Analysis 12
(3) (2008) 335–357.
[13] O.D. Trier, A.K. Jain, T. Taxt, Feature extraction methods for character
recognition – a survey, Pattern Recognition 29 (4) (1996) 641–662.
[14] S. Mori, H. Nishida, H. Yamada, Optical Character Recognition, John Wiley and
Sons, 1999.
[15] K. Takahashi, T. Kitamura, M. Takatoo, Y. Kobayashi, Y. Satoh, Traffic flow
measuring system by image processing, in: Proc. IAPR MVA, 1996, pp. 245–
[16] C.-N. Anagnostopoulos, I. Anagnostopoulos, I. Psoroulas, V. Loumos, E.
Kayafas, License plate recognition from still images and video sequences: a
survey, IEEE Transactions on Intelligent Transportation Systems 9 (3) (2008)
[17] D. Maltoni, D. Maio, A.K. Jain, S. Prabhakar, Handbook of Fingerprint
Recognition, 2nd ed., Springer Publishing Company, 2009.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 885
[18] K.W. Bowyer, K. Hollingsworth, P.J. Flynn, Image understanding for iris
biometrics: a survey, Computer Vision and Image Understanding 110 (2)
(2008) 281–307.
[19] C.-L. Lin, K.-C. Fan, Biometric verification using thermal images of palm-dorsa
vein patterns, IEEE Transactions on Circuits and Systems for Video
Technology 14 (2) (2004) 199–213.
[20] N. Miura, A. Nagasaka, Extraction of finger-vein patterns using maximum
curvature points in image profiles, in: IAPR Conference on Machine Vision
Applications, 2005.
[21] J. Tsotsos, The Encyclopedia of Artificial Intelligence, John Wiley and Sons,
1992. pp. 641–663 (Chapter: Image Understanding).
[22] S. Dickinson, What is Cognitive Science?, Basil Blackwell Publishers, 1999 pp.
172–207 (Chapter: Object Representation and Recognition).
[23] D. Marr, Vision: A Computational Investigation into the Human
Representation and Processing of Visual Information, W.H. Freeman and
Company, 1982.
[24] A. Andreopoulos, S. Hasler, H. Wersing, H. Janssen, J.K. Tsotsos, E. Körner,
Active 3D Object Localization using a humanoid robot, IEEE Transactions on
Robotics 27 (1) (2011) 47–64.
[25] P. Perona, Object Categorization: Computer and Human Perspectives,
Cambridge University Press, 2009. pp. 55–68 (Chapter: Visual Recognition
Circa 2008).
[26] A. Andreopoulos, J.K. Tsotsos, A computational learning theory of active
object recognition under uncertainty, International Journal of Computer
Vision 101 (1) (2013) 95–142.
[27] S. Edelman, Object Categorization: Computer and Human Vision Perspectives,
Cambridge University Press, 2009. pp. 3–24 (Chapter: On what it means to
see, and what we can do about it).
[28] J.K. Tsotsos, On the relative complexity of active vs. passive visual search,
International Journal of Computer Vision 7 (2) (1992) 127–141.
[29] J.K. Tsotsos, A Computational Perspective on Visual Attention, MIT Press,
[30] Aristotle, On the Soul(De anima), translated by J.A. Smith, The Great Books,
Encyclopedia Britannica, Inc., Volume 8 1980 (Original publication 350 B.C.).
Editorial advice by the University of Chicago.
[31] R. Bajcsy, Active perception, Proceedings of the IEEE 76 (8) (1988) 966–
[32] J. Aloimonos, A. Bandopadhay, I. Weiss, Active vision, International Journal of
Computer Vision 1 (1988) 333–356.
[33] J.M. Findlay, I.D. Gilchrist, Active Vision: The Psychology of Looking and
Seeing, Oxford University Press, 2003.
[34] F. Brentano, Psychologie vom Empirischen Standpunkt, Meiner, Leipzig 1874.
[35] H. Barrow, R. Popplestone, Relational descriptions in picture processing,
Machine Intelligence 6 (1971) 377–396.
[36] T. Garvey, Perceptual strategies for purposive vision, Tech. Rep., Technical
Note 117, SRI Int’l., 1976.
[37] J. Gibson, The Ecological Approach to Visual Perception, Houghton Mifflin,
Boston, 1979.
[38] R. Nevatia, T. Binford, Description and recognition of curved objects, Artificial
Intelligence 8 (1977) 77–98.
[39] R. Brooks, R. Greiner, T. Binford, The ACRONYM model-based vision system,
in: Proc. of 6th Int. Joint Conf. on Artificial Intelligence, 1979.
[40] I. Biederman, Recognition-by-components: a theory of human image
understanding, Psychological Review 94 (1987) 115–147.
[41] K. Ikeuchi, T. Kanade, Automatic generation of object recognition programs,
in: IEEE, vol. 76, 1988, pp. 1016–1035.
[42] R. Bajcsy, Active perception vs. passive perception, in: IEEE Workshop on
Computer Vision Representation and Control, Bellaire, Michigan, 1985.
[43] D. Ballard, Animate vision, Artificial Intelligence 48 (1991) 57–86.
[44] S. Soatto, Actionable information in vision, in: Proc. IEEE Int. Conf. on
Computer Vision, 2009.
[45] A. Andreopoulos, J.K. Tsotsos, A theory of active object localization, in: Proc.
IEEE Int. Conf. on Computer Vision, 2009.
[46] L. Valiant, Deductive learning, Philosophical Transactions of the Royal Society
of London 312 (1984) 441–446.
[47] L. Valiant, A theory of the learnable, Communications of the ACM 27 (11)
(1984) 1134–1142.
[48] L. Valiant, Learning disjunctions of conjunctions, in: Proc. 9th International
Joint Conference on Artificial Intelligence, 1985.
[49] S. Dickinson, D. Wilkes, J. Tsotsos, A computational model of view
degeneracy, IEEE Transactions on Pattern Analysis and Machine Intelligence
21 (8) (1999) 673–689.
[50] E. Dickmanns, Dynamic Vision for Perception and Control of Motion,
Springer-Verlag, London, 2007.
[51] S.J. Dickinson, A. Leonardis, B. Schiele, M.J. Tarr (Eds.), Object Categorization:
Computer and Human Vision Perspectives, Cambridge University Press, 2009.
[52] A. Pinz, Object categorization, Foundations and Trends in Computer Graphics
and Vision 1 (4) 2005.
[53] A.R. Hanson, E.M. Riseman, Computer Vision Systems, Academic Press, 1977.
[54] T. Binford, Visual perception by computer, in: IEEE Conference on Systems
and Control, Miami, FL, 1971.
[55] D. Marr, H. Nishihara, Representation and recognition of the spatial
organization of three dimensional shapes, in: Proceedings of the Royal
Society of London B, vol. 200, 1978, pp. 269–294.
[56] R. Brooks, Symbolic reasoning among 3-D models and 2-D images, Artificial
Intelligence Journal 17 (1–3) (1981) 285–348.
[57] I. Biederman, M. Bar, One-shot viewpoint invariance in matching novel
objects, Vision Research 39 (1999) 2885–2899.
[58] W.G. Hayward, M.J. Tarr, Differing views on views: comments on Biederman
and Bar (1999), Vision Research 40 (2000) 3895–3899.
[59] I. Biederman, M. Bar, Differing views on views: response to Hayward and Tarr
(2000), Vision Research 40 (2000) 3901–3905.
[60] M. Tarr, Tarr, Q. Vuong, Steven’s Handbook of Experimental Psychology,
Sensation and Perception, third ed., vol. 1, John Wiley & Sons, 2002. pp. 287–
314 (Chapter: Visual object recognition).
[61] M. Zerroug, R. Nevatia, Three-dimensional descriptions based on the analysis
of the invariant and quasi-invariant properties of some curved-axis
generalized cylinders, IEEE Transactions on Pattern Analysis and Machine
Intelligence 18 (3) (1996) 237–253.
[62] R. Bolles, R. Horaud, 3DPO: a three-dimensional part orientation system,
International Journal of Robotics Research 5 (3) (1986) 26.
[63] C. Goad, From Pixels to Predicates, Ablex Publishing., 1986. pp. 371–391
(Chapter: Special Purpose Automatic Programming for 3D model-based
[64] D.G. Lowe, Three-dimensional object recognition from single two-
dimensional images, Artificial Intelligence 31 (3) (1987) 355–395.
[65] D. Huttenlocher, S. Ullman, Recognizing solid objects by alignment with an
image, International Journal of Computer Vision 5 (2) (1990) 195–212.
[66] S. Sarkar, K. Boyer, Integration, inference and management of spatial
information using Bayesian networks: perceptual organization, IEEE
Transactions on Pattern Analysis and Machine Intelligence 15 (3) (1993)
[67] W. Grimson, T. Lozano-Perez, Model based recognition and localization from
sparse range or tactile data, The International Journal of Robotics Research 3
(3) (1984) 3–35.
[68] T. Fan, G. Medioni, R. Nevatia, Recognizing 3-D objects using surface
descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence
11 (11) (1989) 1140–1157.
[69] D. Clemens, Region-based feature interpretation for recognizing 3D models in
2D images, Tech. Rep. 1307, MIT AI Laboratory, 1991.
[70] H. Blum, A transformation for extracting new descriptors of shape, in: Models
for the Perception of Speech and Visual Form, MIT press, 1967.
[71] J. Koenderink, A. van Doorn, Internal representation of solid shape with
respect to vision, Biological Cybernetics 32 (4) (1979) 211–216.
[72] D. Lowe, Object recognition from local scale-invariant features, in: Proc. ICCV,
[73] A. Andreopoulos, J.K. Tsotsos, On sensor bias in experimental methods for
comparing interest point saliency and recognition algorithms, IEEE
Transactions on Pattern Analysis and Machine Intelligence 34 (1) (2012)
[74] M.J. Kearns, U.V. Vazirani, An Introduction to Computational Learning Theory,
MIT Press, 1994.
[75] D. Nistér, H. Stewénius, Scalable recognition with a vocabulary tree, in: Proc.
IEEE Conference on Computer Vision and Pattern Recognition, 2006.
[76] M. Wertheimer, Untersuchungen zur Lehre von der Gestalt II, Psychologische
Forschung 4 (1923) 301–350.
[77] W. Köhler, Gestalt Psychology, Liveright, New York, 1929.
[78] K. Koffka, Principles of Gestalt Psychology, Harcourt, Brace, New York, 1935.
[79] S. Palmer, Vision Science: Photons to Phenomenology, MIT Press, 1999.
[80] D. Forsyth, J. Ponce, Computer Vision: A Modern Approach, Prentice Hall,
[81] J. Elder, S. Zucker, A measure of closure, Vision Research 34 (1994) 3361–
[82] A. Berengolts, M. Lindenbaum, On the distribution of saliency, in: Computer
Vision and Pattern Recognition, 2004.
[83] A. Berengolts, M. Lindenbaum, On the distribution of saliency, IEEE
Transactions on Pattern Analysis and Machine Intelligence 28 (12) (2006)
[84] D. Lowe, The viewpoint consistency constraint, International Journal of
Computer Vision 1 (1) (1987) 57–72.
[85] J. Canny, A computational approach to edge detection, IEEE Transactions on
Pattern Analysis and Machine Intelligence 8 (1986) 679–714.
[86] S.X. Yu, J. Shi, Segmentation with pairwise attraction and repulsion, in:
International Conference on Computer Vision, 2001.
[87] S.X. Yu, J. Shi, Understanding popout through repulsion, in: Computer Vision
and Pattern Recognition, 2001.
[88] P. Verghese, D. Pelli, The information capacity of visual attention, Vision
Research 32 (5) (1992) 983–995.
[89] Y. Lamdan, J. Schwartz, H. Wolfson, Affine invariant model-based object
recognition, IEEE Transactions on Robotics and Automation 6 (5) (1990) 578–
[90] J. Schwartz, M. Sharir, Identification of partially obscured objects in two and
three dimensions by matching noisy characteristic curves, International
Journal of Robotics Research 6 (2) (1986) 29–44.
[91] A. Kalvin, E. Schonberg, J. Schwartz, M. Sharir, Two-dimensional model-based
boundary matching using footprints, International Journal of Robotics
Research 5 (4) (1986) 38–55.
[92] D. Forsyth, J. Mundy, A. Zisserman, C. Coelho, A. Heller, C. Rothwell, Invariant
descriptors for 3-D object recognition and pose, IEEE Transactions on Pattern
Analysis and Machine Intelligence 13 (10) (1991) 971–991.
[93] P. Flynn, A. Jain, 3D object recognition using invariant feature indexing of
interpretation tables, CVGIP 55 (2) (1992) 119–129.
886 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
[94] I. Rigoutsos, R. Hummel, A Bayesian approach to model matching with
geometric hashing, Computer Vision and Image Understanding 61 (7) (1995)
[95] H. Wolfson, I. Rigoutsos, Geometric hashing: an overview, IEEE Computer
Science and Engineering 4 (4) (1997) 10–21.
[96] A. Wallace, N. Borkakoti, J. Thornton, TESS: a geometric hashing algorithm for
deriving 3D coordinate templates for searching structural databases.
Application to enzyme active sites, Protein Science 6 (11) (1997) 2308–
[97] U. Grenander, General Pattern Theory, Oxford University Press, 1993.
[98] S. Agarwal, D. Roth, Learning a sparse representation for object detection, in:
European Conference on Computer Vision, vol. 4, 2002.
[99] M. Weber, M. Welling, P. Perona, Towards automatic discovery of object
categories, in: IEEE Conference on Computer Vision and Pattern Recognition,
[100] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised
scale-invariant learning, in: Computer Vision and Pattern Recognition, 2003.
[101] S. Lazebnik, C. Schmid, J. Ponce, Semi-local affine parts for object recognition,
in: British Machine Vision Conference, 2004.
[102] K. Mikolajczyk, C. Schmid, Scale and affine invariant interest point detectors,
International Journal of Computer Vision 60 (1) (2004) 63–86.
[103] D.G. Pelli, B. Farell, D.C. Moore, The remarkable inefficiency of word
recognition, Nature 423 (2003) 752–756.
[104] M. Riesenhuber, T. Poggio, Hierarchical models of object recognition in
cortex, Nature Neuroscience 2 (11) (1999) 1019–1025.
[105] T. Serre, L. Wolf, T. Poggio, Object recognition with features inspired by visual
cortex, in: Computer Vision and Pattern Recognition, 2005.
[106] J. Mutch, D.G. Lowe, Multiclass object recognition with sparse localized
features, in: Computer Vision and Pattern Recognition, 2006.
[107] M.A. Fischler, R.A. Elschlager, The representation and matching of pictorial
structures, IEEE Transactions on Computers C-22 (1) (1973) 67–92.
[108] K. Tanaka, Neuronal mechanisms of object recognition, Science (1993) 685–
[109] A. Pentland, Perceptual organization and the representation of natural form,
Artificial Intelligence 28 (2) (1986) 293–331.
[110] A. Jaklic, A. Leonardis, F. Solina, Segmentation and Recovery of Superquadrics,
Springer, 2000.
[111] T. Heimann, H.-P. Meinzer, Statistical shape models for 3D medical image
segmentation: a review, Medical Image Analysis 13 (4) (2009) 543–563.
[112] S. Dickinson, D. Metaxas, Integrating qualitative and quantitative shape
recovery, International Journal of Computer Vision 13 (3) (1994) 1–20.
[113] S. Sclaroff, A. Pentland, Modal matching for correspondence and recognition,
IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (6) (1995)
[114] T. Cootes, G. Edwards, C. Taylor, Active appearance models, IEEE Transactions
on Pattern Analysis and Machine Intelligence 23 (6) (2001) 681–685.
[115] T. Cootes, C. Taylor, D. Cooper, J. Graham, Active shape models-their training
and application, Computer Vision and Image Understanding 61 (1) (1995)
[116] T. Strat, M. Fischler, Context-based vision: recognizing objects using
information from both 2D and 3D imagery, IEEE Transactions on Pattern
Analysis and Machine Intelligence 13 (10) (1991) 1050–1065.
[117] L. Stark, K. Bowyer, Function-based generic recognition for multiple object
categories, CVGIP 59 (1) (1994) 1–21.
[118] A. Torralba, P. Sinha, Statistical context priming for object detection, in:
Proceedings of the IEEE International Conference on Computer Vision, 2001,
pp. 763–770.
[119] A. Torralba, K. Murphy, W. Freeman, M. Rubin, Context -based vision system
for place and object recognition, in: ICCV, 2003.
[120] A. Torralba, Contextual priming for object detection, International Journal of
Computer Vision 53 (2) (2003) 169–191.
[121] D. Hoiem, A.A. Efros, M. Hebert, Putting objects in perspective, in: Computer
Vision and Pattern Recognition, 2006.
[122] C. Siagian, L. Itti, Gist: a mobile robotics application of context-based vision in
outdoor environment, in: Computer Vision and Pattern Recognition
Workshops, 2005.
[123] L. Wolf, S. Bileschi, A critical view of context, International Journal of
Computer Vision 69 (2) (2006) 251–261.
[124] M. Minsky, A Framework for Representing Knowledge, Tech. Rep. 306, MIT-AI
Laboratory Memo, 1974.
[125] D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and
Techniques, MIT Press, 2009.
[126] A. Hanson, E. Riseman, The VISIONS Image-Understanding System, Lawrence
Erlbaum Associates, 1988. pp. 1–114 (Chapter 1).
[127] H. Grabner, J. Gall, L.V. Gool, What makes a chair a chair? in: Proc. CVPR,
[128] M. Stark, P. Lies, M. Zillich, J. Wyatt, B. Schiele, Functional object class
detection based on learned affordance cues, in: Proc. of the 6th International
Conference on Computer Vision Systems, 2008.
[129] J. Gibson, The Theory of Affordances, Erlbaum Associates, 1977.
[130] C. Castellini, T. Tommasi, N. Noceti, F. Odone, B. Caputo, Using object
affordances to improve object recognition, IEEE Transactions on Autonomous
Mental Development 3 (3) (2011) 207–215.
[131] B. Ridge, D. Skocaj, A. Leonardis, Unsupervised learning of basic object
affordances from object properties, in: Proc. Computer Vision Winter
Workshop, 2009.
[132] A. Saxena, J. Driemeyer, A. Ng, Robotic grasping of novel objects using vision,
The International Journal of Robotics Research 27 (2) (2008) 157–173.
[133] E.M. Riseman, A.R. Hanson, Computer vision research at the University of
Massachussetts, International Journal of Computer Vision 2 (1989) 199–207.
[134] S.Z. Li, Markov Random Field Modeling in Image Analysis, Springer-Verlag,
[135] S. Kumar, M. Hebert, Discriminative random fields: a discriminative
framework for contextual interaction in classification, in: International
Conference on Computer Vision, 2003.
[136] K.P. Murphy, A. Torralba, W.T. Freeman, Using the forest to see the trees: a
graphical model relating features, objects and scenes, in: NIPS, 2003.
[137] L. Li, L.F. Fei, What, where and who? Classifying events by scene and object
recognition, in: ICCV, 2007.
[138] j. Shotton, M. Johnson, R. Cipolla, Semantic texton forests for image
categorization and segmentation, in: CVPR, 2008.
[139] G. Heitz, D. Koller, Learning spatial context: using stuff to find things, in:
ECCV, 2008.
[140] S. Divvala, D. Hoiem, J. Hays, A. Efros, M. Hebert, An empirical study of
context in object detection, in: CVPR, 2009.
[141] H. Murase, S. Nayar, Visual learning and recognition of 3-D objects from
appearance, IJCV 14 (1995) 5–24.
[142] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for
classification of texture and object categories: a comprehensive study,
International Journal of Computer Vision 73 (2) (2007) 213–238.
[143] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P.
Yanker, The QBIC project: querying images by content using color, texture
and shape, in: SPIE Conference on Geometric Methods in Computer Vision II,
[144] M. Pontil, A. Verri, Support vector machines for 3D object recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence 20 (6) (1998)
[145] B. Schiele, J. Crowley, Recognition without correspondence using
multidimensional receptive field histograms, International Journal of
Computer Vision 36 (1) (2000) 31–50.
[146] M. Turk, A. Pentland, Face recognition using eigenfaces, in: Proc. IEEE
Conference on Computer Vision and Pattern Recognition, 1991.
[147] C. Huang, O. Camps, T. Kanungo, Object recognition using appearance-based
parts and relations, in: Proceedings of the IEEE Computer Vision and Pattern
Recognition Conference, 1997, pp. 877–883.
[148] A. Leonardis, H. Bischof, Robust recognition using eigenimages, Computer
Vision and Image Understanding 78 (1) (2000) 99–118.
[149] S. Zhou, R. Chellappa, B. Moghaddam, Adaptive visual tracking and
recognition using particle filters, in: International Conference on
Multimedia and Expo, 2003.
[150] R.P.N. Rao, D.H. Ballard, An active vision architecture based on iconic
representations, Artificial Intelligence 78 (1) (1995) 461–505.
[151] C. Schmid, R. Mohr, Local grayvalue invariants for image retrieval, IEEE
Transactions on Pattern Analysis and Machine Intelligence 19 (5) (1997)
[152] G. Carneiro, A. Jepson, Phase-based local features, in: ECCV, 2002.
[153] F. Rothganger, S. Lazebnik, C. Schmid, J. Ponce, 3D Object modeling and
recognition using local affine-invariant image descriptors and multi-view
spatial constraints, International Journal of Computer Vision, 2006.
[154] R.C. Nelson, A. Selinger, A Cubist Approach to Object Recognition, in: Proc.
International Conference on Computer Vision, Bombay, India, 1998, pp. 614–
[155] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using
shape contexts, IEEE Transactions on Pattern Analysis and Machine
Intelligence 24 (4) (2002) 509–522.
[156] R.K. McConnell, Method of and apparatus for pattern recognition (US Patent
No. 4,567,610), 1986.
[157] W. Freeman, M. Roth, Orientation histograms for hand gesture recognition,
in: Proc. IEEE Intl. Workshop on Automatic Face and Gesture Recognition,
1995, pp. 296–301.
[158] K. Mikolajczyk, C. Schmid, An affine invariant interest point detector, in:
European Conference on Computer Vision, 2002.
[159] P. Torr, A.W. Fitzgibbon, A. Zisserman, Maintaining multiple motion model
hypotheses over many views to recover matching and structure, in:
International Conference on Computer Vision, 1998.
[160] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, B. Schiele, L.V. Gool, Towards
multi-view object class detection, in: Computer Vision and Pattern
Recognition, 2006.
[161] V. Ferrari, T. Tuyelaars, L.V. Gool, Integrating multiple model views for object
recognition, in: Computer Vision and Pattern Recognition, 2004.
[162] V. Ferrari, T. Tuytelaars, L.V. Gool, Wide-baseline multiple-view
correspondences, in: Computer Vision and Pattern Recognition, 2003.
[163] T. Tuytelaars, L.V. Gool, Wide baseline stereo matching based on local affinely
invariant regions, in: British Machine Vision Conference, 2000.
[164] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide baseline stereo from
maximally stable extremal regions, in: British Machine Vision Conference,
[165] S. Se, D. Lowe, J. Little, Mobile robot localization and mapping with
uncertainty using scale-invariant visual landmarks, The International
Journal of Robotics Research 21 (8) (2002) 735–758.
[166] F. Li, J. Kosecka, Probabilistic location recognition using reduced feature set,
in: IEEE International Conference on Robotics and Automation, 2006.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 887
[167] W. Zhang, J. Kosecka, Image based localization in urban environments, in:
International Symposium on 3D Data Processing, Visualization and
Transmission, 2006.
[168] A. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image
retrieval at the end of the early years, IEEE Transactions on Pattern Analysis
and Machine Intelligence 22 (12) (2000) 1349–1380.
[169] G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization
with bags of keypoints, in: ECCV International Workshop on Statistical
Learning in Computer Vision, 2004.
[170] J. Sivic, A. Zisserman, Video Google: a text retrieval approach to object
matching in videos, in: International Conference on Computer Vision, 2003.
[171] K. Grauman, T. Darrell, Unsupervised learning of categories from sets of
partially matching image features, in: Computer Vision and Pattern
Recognition, 2006.
[172] I. Kokkinos, A. Yuille, Scale invariance without scale selection, in: Proc. IEEE
Conf. on Computer Vision and Pattern Recognition, 2008.
[173] C. Lampert, M. Blaschko, T. Hofmann, Efficient subwindow search: a branch
and bound framework for object localization, IEEE Transactions on Pattern
Analysis and Machine Intelligence 31 (2009) 2129–2142.
[174] R. Fergus, P. Perona, A. Zisserman, A sparse object category model for efficient
learning and exhaustive recognition, in: Computer Vision and Pattern
Recognition, 2005.
[175] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, W.T. Freeman, Discovering
objects and their location in images, in: International Conference on
Computer Vision, 2005.
[176] S. Ullman, M. Vidal-Naquet, E. Sali, Visual features of intermediate complexity
and their use in classification, Nature Neuroscience 5 (7) (2002) 682–687.
[177] P.F. Felzenszwalb, D.P. Huttenlocher, Pictorial structures for object
recognition, International Journal of Computer Vision 61 (1) (2005) 55–79.
[178] B. Leibe, B. Schiele, Interleaved object categorization and segmentation, in:
British Machine Vision Conference, 2003.
[179] F. Li, J. Kosecka, H. Wechsler, Strangeness based feature selection for part
based recognition, in: Computer Vision and Pattern Recognition, 2006.
[180] V. Ferrari, T. Tuytelaars, L.V. Gool, Simultaneous object recognition and
segmentation by image exploration, in: European Conference on Computer
Vision, 2004.
[181] K. Siddiqi, A. Shokoufandeh, S. Dickinson, S. Zucker, Shock graphs and shape
matching, International Journal of Computer Vision 30 (1999) 1–24.
[182] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
[183] B. Ommer, M. Sauter, J. Buhmann, Learning top-down grouping of
compositional hierarchies for recognition, in: CVPR, 2006.
[184] B. Ommer, J. Buhmann, Learning the compositional nature of visual objects,
in: CVPR, 2007.
[185] J. Deng, S. Satheesh, A. Berg, L. Fei-Fei, Fast and balanced: efficient label tree
learning for large scale object recognition, in: NIPS, 2011.
[186] J. Deng, A. Berg, L.F. Fei, Hierarchical semantic indexing for large scale image
retrieval, in: CVPR, 2011.
[187] E. Bart, I. Porteous, P. Perona, M. Welling, Unsupervised learning of visual
taxonomies, in: CVPR, 2008.
[188] E. Bart, M. Welling, P. Perona, Unsupervised organization of image
collections: taxonomies and beyond, IEEE Transactions on Pattern Analysis
and Machine Intelligence 33 (11) (2011) 2302–2315.
[189] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A. Ng,
Building high-level features using large scale unsupervised learning, in:
ICML, 2012.
[190] C. Lampert, M. Blaschko, A multiple kernel learning approach to joint multi-
class object detection, in: DAGM, 2008.
[191] C. Lampert, M. Blaschko, T. Hofmann, Beyond sliding windows: object
localization by efficient subwindow search, in: CVPR, 2008.
[192] C. Lampert, Detecting objects in large image collections and videos by
efficient subimage retrieval, in: ICCV, 2009.
[193] N. Pinto, D. Cox, J. DiCarlo, Why is real-world visual object recognition hard?,
PLoS Computational Biology 4 (1) (2008) 151–156
[194] A. Torralba, A. Efros, Unbiased look at dataset bias, in: IEEE Conference on
Computer Vision and Pattern Recognition, 2011.
[195] R. Fergus, P. Perona, A. Zisserman, A visual category filter for Google images,
in: European Conference on Computer Vision, 2004.
[196] A. Opelt, A. Pinz, A. Zisserman, Incremental learning of object detectors using
a visual shape alphabet, in: Computer Vision and Pattern Recognition, 2006.
[197] F.-F. Li, R. Fergus, P. Perona, Learning generative visual models from few
training examples: an incremental Bayesian approach tested on 101 object
categories, in: Computer Vision and Pattern Recognition Workshops, 2004.
[198] B. Leibe, B. Schiele, Scale-invariant object categorization using a scale-
adaptive mean-shift search, in: DAGM, 2004.
[199] B. Leibe, A. Leonardis, B. Schiele, Combined object categorization and
segmentation with an implicit shape model, in: ECCV Workshop on
Statistical Learning in Computer Vision, 2004.
[200] R. Fergus, L. Fei-Fei, P. Perona, A. Zisserman, Learning object categories from
Google’s image search, in: International Conference on Computer Vision,
[201] F. Jurie, B. Triggs, Creating efficient codebooks for visual recognition, in:
International Conference on Computer Vision, 2005.
[202] M. Isard, PAMPAS: Real-valued graphical models for computer vision, in:
Computer Vision and Pattern Recognition, 2003.
[203] R. Fergus, P. Perona, A. Zisserman, Weakly supervised scale-invariant learning
of models for visual recognition, International Journal of Computer Vision 71
(3) (2007) 273–303.
[204] E. Bienenstock, S. Geman, D. Potter, Compositionality, MDL priors, and object
recognition, in: NIPS, 1997.
[205] S.-C. Zhu, D. Mumford, A stochastic grammar of images, Foundations and
Trends in Computer Graphics and Vision 2 (4) (2007) 259–362.
[206] P. Laplace, Essai philosophique sur les probabilités, 1812.
[207] K. Fu, Syntactic Pattern Recognition and Applications, Prentice Hall, 1982.
[208] H. Blum, Biological shape and visual science, Journal of Theoretical Biology 38
(1973) 207–285.
[209] M. Leyton, A process grammar for shape, Artificial Intelligence 34 (1988)
[210] T.B. Sebastian, P. Klein, B. Kimia, Recognition of shapes by editing their shock
graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (5)
(2004) 550–571.
[211] M. Pelillo, K. Siddiqi, S. Zucker, Matching hierarchical structures using
association graphs, IEEE PAMI 21 (11) (1999) 1105–1120.
[212] D. Macrini, A. Shokoufandeh, S. Dickinson, K. Siddiqi, S. Zucker, View-based 3-
D object recognition using shock graphs, in: Proc. International Conference
on Pattern Recognition, 2002.
[213] F. Demicri, A. Shokoufandeh, Y. Keselman, L. Bretzner, S. Dickinson, Object
recognition as many-to-many feature matching, International Journal of
Computer Vision.
[214] Y. Keselman, S. Dickinson, Generic model abstraction from examples, IEEE
Transactions on Pattern Analysis and Machine Intelligence: special issue on
Syntactic and Structural Pattern Recognition 27 (7) 2005.
[215] A. Shokoufandeh, D. Macrini, S. Dickinson, K. Siddiqi, S. Zucker, Indexing
hierarchical structures using graph spectra, IEEE Transactions on Pattern
Analysis and Machine Intelligence 27 2005.
[216] K. Fukushima, Neocognitron: a self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position, Biological
Cybernetics 36 (4) (1980) 193–202.
[217] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L. Jackel,
Backpropagation applied to handwritten zip code recognition, Neural
Computation 1 (4) (1989) 541–551.
[218] H. Wersing, E. Körner, Learning optimized features for hierarchical models of
invariant object recognition, Neural Computation 15 (7) (2003) 1559–
[219] S. Fidler, G. Berginc, A. Leonardis, Hierarchical statistical learning of generic
parts of object structure, in: Computer Vision and Pattern Recognition, 2006.
[220] I. Kokkinos, A. Yuille, HOP: hierarchical object parsing, in: Proc. IEEE
Conference on Computer Vision and Pattern Recognition, 2009.
[221] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber, A
novel connectionist system for unconstrained handwriting recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence 31 (5) (2009)
[222] Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time-
series, in: The Handbook of Brain Theory and Neural Networks, MIT Press,
[223] T. Avraham, M. Lindenbaum, Dynamic visual search using inner-scene
similarity: algorithms and inherent limitations, in: ECCV, 2004.
[224] T. Avraham, M. Lindenbaum, Attention-based dynamic visual search using
inner-scene similarity: algorithms and bounds, IEEE Transactions on Pattern
Analysis and Machine Intelligence 28 (2) (2006) 251–264.
[225] T. Avraham, M. Lindenbaum, Esaliency – a stochastic attention model
incorporating similarity information and knowledge-based preferences, in:
International Workshop on the Representation and Use of Prior Knowledge in
Vision, 2006.
[226] J. Duncan, G. Humphreys, Visual search and stimulus similarity, Psychological
Review 96 (1989) 433–458.
[227] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple
features, in: Computer Vision and Pattern Recognition, 2001.
[228] B.A. Draper, J. Bins, K. Baek, ADORE: adaptive object recognition, in: ICVS,
[229] L. Paletta, G. Fritz, C. Seifert, Cascaded sequential attention for object
recognition with informative local descriptors and q-learning of grouping
strategies, in: CVPR, 2005.
[230] L. Paletta, G. Fritz, C. Seifert, Q-learning of sequential attention for visual
object recognition from informative local descriptors, in: ICML, 2005.
[231] C. Greindl, A. Goyal, G. Ogris, L. Paletta, Cascaded attention and grouping for
object recognition from Video, in: ICIAP, 2003.
[232] C. Bandera, F.J. Vico, J.M. Bravo, M.E. Harmon, L.C.B. III, Residual Q-learning
applied to visual attention, in: ICML, 1996.
[233] T. Darrell, Reinforcement learning of active recogntion behaviors, in: NIPS,
[234] H.D. Tagare, K. Toyama, J.G. Wang, A maximum-likelihood strategy for
directing attention during visual search, IEEE Transactions on Pattern
Analysis and Machine Intelligence 23 (5) (2001) 490–500.
[235] P. Viola, M. Jones, Robust real-time object detection, in: Second International
Workshop on Statistical and Computational Theories of Vision –Modeling,
Learning, Computing and Sampling, 2001.
[236] P. Viola, M.J. Jones, D. Snow, Detecting pedestrians using patterns of motion
and appearance, International Journal of Computer Vision 63 (2) (2005) 153–
888 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
[237] A. Torralba, K.P. Murphy, W.T. Freeman, Sharing features: efficient boosting
procedures for multiclass object detection, in: Computer Vision and Pattern
Recognition, 2004.
[238] A. Opelt, A. Pinz, A. Zisserman, A boundary-fragment-model for object
detection, in: European Conference on Computer Vision, 2006.
[239] A. Andreopoulos, J.K. Tsotsos, Active vision for door localization and door
opening using playbot: a computer controlled wheelchair for people with
mobility impairments, in: Proc. 5th Canadian Conference on Computer and
Robot Vision, 2008.
[240] Y. Amit, D. Geman, A computational model for visual selection, Neural
Computation 11 (1999) 1691–1715.
[241] J.H. Piater, Visual Feature Learning, Ph.D. thesis, University of Massachusetts
Amherst, 2001.
[242] F. Fleuret, D. Geman, Coarse-to-fine face detection, International Journal of
Computer Vision 41 (1–2) (2001) 85–107.
[243] J. Sullivan, A. Blake, M. Isard, J. MacCormick, Bayesian object localisation in
images, International Journal of Computer Vision 44 (2) (2001) 111–
[244] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani,
J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, Query by image and video
content: the QBIC system, IEEE Computer 28 (9) (1995) 23–32.
[245] A. Gupta, R. Jain, Visual information retrieval, Communications of the ACM 40
(5) (1997) 70–79.
[246] S. Mukherjea, K. Hirata, Y. Hara, Amore: A World Wide Web image retrieval
engine, in: Proc. International World Wide Web Conference, 1999.
[247] A. Pentland, R. Picard, S. Sclaroff, Photobook: tools for content-based
manipulation of image databases, in: Proc. of the Conference on Storage
and Retrieval for Image and Video Database |, SPIE, 1994.
[248] J. Smith, S.-F. Chang, Visualseek: a fully automated content-based image
query system., in: Proc. of the ACM International Conference on Multimedia,
[249] J. Wang, G. Wiederhold, O. Firschein, S. Wei, Content-based image indexing
and searching using Daubechies’ wavelets, International Journal of Digital
Libraries 1 (4) (1998) 311–328.
[250] W. Ma, B. Manjunath, Netra: a toolbox for navigating large image databases,
in: Proc. IEEE International Conference on Image Processing, 1997.
[251] J. Laaksonen, M. Koskela, S. Laakso, E. Oja, Picsom – content-based image
retrieval with self-organizing maps, Pattern Recognition Letters 21 (2000)
[252] T. Judd, F. Durand, A. Torralba, A Benchmark of Computational Models of
Saliency to Predict Human Fixations, Tech. Rep. TR-2012-001, MIT-CSAIL,
[253] P. Zikopoulos, D. deRoos, K.P. Corrigan, Harness the Power of Big Data: The
IBM Big Data Platform, McGraw-Hill, 2012.
[254] www.comscore.com.
[255] A. Blaser, Database techniques for pictorial applications, Lecture Notes in
Computer Science, vol. 81, Springer Verlag, 1979.
[256] R. Jain, Visual information management systems, in: Proc. US NSF Workshop,
[257] M.S. Lew, N. Sebe, C. Djeraba, R. Jain, Content-based multimedia information
retrieval: state of the art and challenges, ACM Transactions on Multimedia
Computing 2 (1) (2006) 1–19.
[258] R. Datta, D. Joshi, J. Li, J.Z. Wang, Image retrieval: ideas, influences, and trends
of the new age, ACM Computing Surveys 40 (2) (2008) 1–60.
[259] R.C. Veltkamp, M. Tanase, Content-based image retrieval systems: a survey,
Tech. Rep., Department of Computer Science, Utrecht University, 2002.
[260] D. Huijsmans, N. Sebe, How to complete performance graphs in content-
based image retrieval: add generality and normalize scope, IEEE Transactions
on Pattern Analysis and Machine Intelligence 27 (2) (2005) 245–251.
[261] H. Tamura, S. Mori, T. Yamawaki, Texture features corresponding to visual
perception, IEEE Transactions on Systems, Man and Cybernetics 8 (6) (1978)
[262] A. Pentland, R.W. Picard, S. Sclaroff, Photobook: content-based manipulation
of image databases, International Journal of Computer Vision 18 (3) (1996)
[263] J. Laaksonen, M. Koskela, E. Oja, PicSOM – self-organizing image retrieval
with MPEG-7 content descriptors, IEEE Transactions on Neural Networks 13
(4) (2002) 841–853.
[264] V. Viitaniemi, J. Laaksonen, Techniques for still image scene classification and
object detection, in: ICANN, 2006.
[265] V. Viitaniemi, J. Laaksonen, Techniques for image classification, object
detection and object segmentation, in: Visual Information Systems, Web-
Based Visual Information Search and Management, 2008.
[266] D. Wilkes, J. Tsotsos, Behaviours for active object recognition, in: SPIE
Conference, 1993, pp. 225–239.
[267] S.D. Roy, S. Chaudhury, S. Banerjee, Isolated 3D object recognition through
next view planning, IEEE Transactions on Systems, Man and Cybernetics, Part
A: Systems and Humans 30 (1) (2000) 67–76.
[268] E. Dickmanns, : http://www.dyna-vision.de/.
[269] H. Meissner, E. Dickmanns, Control of an unstable plant by computer vision,
in: Image Sequence Processing and Dynamic Scene Analysis, Springer-Verlag,
Berlin, 1983, pp. 532–548.
[270] E. Dickmanns, A. Zapp, Guiding land vehicles along roadways by computer
vision, in: Proc. Congres Automatique, 1985.
[271] E. Dickmanns, A. Zapp, A curvature-based scheme for improving road vehicle
guidance by computer vision, in: Proc. Mobile Robots, SPIE, 1986.
[272] B. Mysliwetz, E. Dickmanns, A vision system with active gaze control for real-
time interpretation of well structured dynamic scenes, in: Proc. 1st
conference on intelligent autonomous systems (IAS-1), 1986.
[273] E. Dickmanns, B. Mysliwetz, Recursive 3-D road and relative ego-state
recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence
14 (2) (1992) 199–213.
[274] M. Lutzeler, E. Dickmanns, Road recognition with MarVEye, in: Proc. Intern.
Conf. on Intelligent Vehicles, 1998.
[275] F. Thomanek, E. Dickmanns, D. Dickmanns, Multiple object recognition and
scene interpretation for autonomous road vehicle guidance, in: Proc. of Int.
Symp. on Intelligent Vehicles, 1994.
[276] J. Schick, E. Dickmanns, Simultaneous Estimation of 3-D Shape and Motion of
Objects by Computer Vision, in: IEEE Workshop on Visual Motion, 1991.
[277] S. Werner, S. Furst, D. Dickmanns, E. Dickmanns, A vision-based multi-sensor
machine perception system for autonomous aircraft landing approach, in:
Enhanced and Synthetic Vision AeroSense, 1996.
[278] S. Furst, S. Werner, D. Dickmanns, E. Dickmanns, Landmark navigation and
autonomous landing approach with obstacle detection for aircraft, in: Proc.
AeroSense, 1997.
[279] F. Callari, F. Ferrie, Active recognition: looking for differences, International
Journal of Computer Vision 43 (3) (2001) 189–204.
[280] S. Dickinson, H. Christensen, J. Tsotsos, G. Olofsson, Active object recognition
integrating attention and viewpoint control, Computer Vision and Image
Understanding 67 (3) (1997) 239–260.
[281] B. Schiele, J. Crowley, Transinformation for active object recognition, in: Proc.
Int. Conf. on Computer Vision, 1998.
[282] H. Borotschnig, L. Paletta, M. Prantl, A. Pinz, Active object recognition in
parametric eigenspace, in: Proc. British Machine Vision Conference, 1998, pp.
[283] L. Paletta, M. Prantl, Learning temporal context in active object recognition
using Bayesian analysis, in: International Conference on Pattern Recognition,
[284] S.D. Roy, S. Chaudhury, S. Banerjee, recognizing large 3D objects through next
view planning using an uncalibrated camera, in: Proc. ICCV, 2001.
[285] S.D. Roy, N. Kulkarni, Active 3D object recognition using appearance based
aspect graphs, in: Proc. ICVGIP, 2004, pp. 40–45.
[286] S.A. Hutchinson, A. Kak, Planning sensing strategies in a robot work cell with
multi-sensor capabilities, IEEE Transactions on Robotics and Automation 5
(6) (1989) 765–783.
[287] K. Gremban, K. Ikeuchi, Planning multiple observations for object recognition,
International Journal of Computer Vision 12 (2/3) (1994) 137–172.
[288] S. Herbin, Recognizing 3D objects by generating random actions, in: CVPR,
[289] S. Kovacic, A. Leonardis, F. Pernus, Planning sequences of views for 3D object
recognition and pose determination, Pattern Recognition 31 (10) (1998)
[290] J. Denzler, C.M. Brown, Information theoretic sensor data selection for active
object recognition and state estimation, IEEE Transactions on Pattern
Analysis and Machine Intelligence 24 (2) (2002) 145–157.
[291] C. Laporte, T. Arbel, Efficient discriminant viewpoint selection for active
Bayesian recognition, International Journal of Computer Vision 68 (3) (2006)
[292] A.K. Mishra, Y. Aloimonos, Active segmentation, International Journal of
Humanoid Robotics 6 (3) (2009) 361–386.
[293] A.K. Mishra, Y. Aloimonos, C. Fermüller, Active segmentation for robotics, in:
IROS, 2009.
[294] X. Zhou, D. Comaniciu, A. Krishnan, Conditional feature sensitivity: a unifying
view on active recognition and feature selection, in: ICCV, 2003.
[295] Microsoft Kinect, http://www.xbox.com/en-us/kinect.
[296] J. Tang, S. Miller, A. Singh, P. Abbeel, A textured object recognition pipeline for
color and depth image data, in: ICRA, 2012.
[297] N. Silberman, R. Fergus, Indoor scene segmentation using a structured light
sensor, in: Proc ICCV Workshops, 2011.
[298] L. Xia, C.-C. Chen, J. Aggarwal, Human detection using depth information by
kinect, in: Proc. Computer Vision and Pattern Recognition Workshops, 2011.
[299] K. Lai, L. Bo, X. Ren, D. Fox, Sparse distance learning for object recognition
combining RGB and depth information, in: Proc. ICRA, 2011.
[300] F. Callari, F. Ferrie, Active recognition: using uncertainty to reduce ambiguity,
in: Proc. ICPR, 1996.
[301] L. Paletta, A. Pinz, Active object recognition by view integration and
reinforcement learning, Robotics and Autonomous Systems 31 (2000) 71–86.
[302] R.D. Rimey, C.M. Brown, Control of selective perception using bayes nets and
decision theory, International Journal of Computer Vision 12 (2/3) (1994)
[303] L.E. Wixson, D.H. Ballard, Using intermediate objects to improve the
efficiency of visual search, International Journal of Computer Vision 12 (2/
3) (1994) 209–230.
[304] K. Sjöö, A. Aydemir, P. Jensfelt, Topological spatial relations for active visual
search, Robotics and Autonomous Systems 60 (9) (2012) 1093–1107.
[305] K. Brunnström, T. Lindeberg, J.-O. Eklundh, Active detection and
classsification of junctions by foveation with a head-eye system guided by
the scale-space primal sketch, in: ECCV, 1992.
[306] K. Brunnström, J.-O. Eklundh, T. Uhlin, Active fixation for scene exploration,
International Journal of Computer Vision 17 (2) (1996) 137–162.
[307] Y. Ye, J. Tsotsos, Sensor planning for 3D object search, Computer Vision and
Image Understanding 73 (2) (1999) 145–168.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 889
[308] S. Minut, S. Mahadevan, A reinforcement learning model of selective visual
attention, in: International Conference on Autonomous Agents, 2001.
[309] T. Kawanishi, H. Murase, S. Takagi, Quick 3D object detection and localization
by dynamic active search with multiple active cameras, in: International
Conference on Pattern Recognition, 2002.
[310] S. Ekvall, P. Jensfelt, D. Kragic, Integrating active mobile robot object
recognition and SLAM in natural environments, in: Proc. Intelligent Robots
and Systems, 2006.
[311] D. Meger, P. Forssen, K. Lai, S. Helmer, S. McCann, T. Southey, M. Baumann, J.
Little, D. Lowe, Curious George: An attentive semantic robot, in: Proc. Robot.
Auton. Syst., 2008.
[312] P. Forssen, D. Meger, K. Lai, S. Helmer, J. Little, D. Lowe, Informed visual
search: combining attention and object recognition, in: Proc. IEEE
International Conference on Robotics and Automation, 2008.
[313] F. Saidi, O. Stasse, K. Yokoi, F. Kanehiro, Online object search with a humanoid
robot, in: Proc. Intelligent Robots and Systems, 2007.
[314] H. Masuzawa, J. Miura, Observation planning for efficient environment
information summarization, in: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots
and Systems, 2009, pp. 5794–5800.
[315] K. Sjöö, D.G. Lopez, C. Paul, P. Jensfelt, D. Kragic, Object search and
localization for an indoor mobile robot, Journal of Computing and
Information Technology 1 (2009) 67–80.
[316] J. Ma, T.H. Chung, J. Burdick, A probabilistic framework for object search with
6-DOF pose estimation, The International Journal of Robotics Research 30
(10) (2011) 1209–1228.
[317] K. Ozden, K. Schindler, L.V. Gool, Multibody structure-from-motion in
practice, PAMI 32 (6) (2010) 1134–1141.
[318] A. Yarbus, Eye Movements and Vision, Plenum, New York, 1967.
[319] D. Bruckner, m. Vincze, I. Hinterleitner, Towards reorientation with a
humanoid robot, leveraging applications of formal methods, Verification
and Validation (2012) 156–161.
[320] J.-K. Yoo, J.-H. Kim, Fuzzy integral-based gaze control architecture
incorporated with modified-univector field-based navigation for humanoid
robots, IEEE Transactions on Systems Science and Cybernetics, Part B:
Cybernetics 42 (1) (2012) 125–139.
[321] J. Malik, Interpreting line drawings of curved objects, International Journal of
Computer Vision 1 (1) (1987) 73–104.
[322] A. Andreopoulos, Active Object Recognition in Theory and Practice, Ph.D.
thesis, York University, January 2011.
[323] J. Najemnik, W.S. Geisler, Optimal eye movement strategies in visual search,
Nature 434 (2005) 387–391.
[324] M. Everingham, L.V. Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL
visual object classes (VOC) challenge, International Journal on Computer
Vision 88 (2) 2010.
[325] L. Fei-Fei, R. Fergus, P. Perona, Caltech 101 dataset. <http://
[326] G. Griffin, A. Holub, P. Perona, Caltech 256 dataset. <http://
[327] A.F. Smeaton, P. Over, W. Kraaij, TRECVID dataset. <http://www-
[328] C.G. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, A.W. Smeulders,
The challenge problem for automated detection of 101 semantic concepts in
multimedia, in: Proceedings of ACM Multimedia, 2006.
[329] B. Yao, X. Yang, S. Zhu, The Lotus Hill dataset. <http://www.
[330] M. Sanderson, P. Clough, H. Muller, J. Kalpathy-Cramer, M. Ruiz, D.D.
Fushman, S. Nowak, J. Liebetrau, T. Tsikrika, J. Kludas, A. Popescu, H. Goeau,
A. Joly, ImageCLEF dataset. <http://www.imageclef.org/>.
[331] S.A. Nene, S.K. Nayar, H. Murase, COIL-100 dataset. <http://
www.cs.columbia.edu/CAVE/software/softlib/> coil-100.php.
[332] B. Leibe, B. Schiele, The ETH-80 dataset. <http://www.mis.informatik.tu-
[333] J. Willamowski, D. Arregui, G. Csurka, C. Dance, L. Fan, Categorizing nine
visual classes using local appearance descriptors, in: ICPR Workshop on
Learning for Adaptive Visual Systems, 2004.
[334] I. Laptev, T. Lindeberg, KTH action dataset, 2004. <http://www.nada.kth.se/
[335] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:
Computer Vision and Pattern Recognition, 2005.
[336] A. Opelt, M. Fussenegger, A. Pinz, P. Auer, Weak hypotheses and boosting for
generic object detection and recognition, in: European Conference on
Computer Vision, 2004.
[337] B. Russell, A. Torralba, K. Murphy, W.T. Freeman, LabelMe: a database and
web-based tool for image annotation, International Journal of Computer
Vision, 2007. <http://labelme2.csail.mit.edu/Release3.0/browserTools/php/
[338] A. Torralba, R. Fergus, W. Freeman, 80 Million tiny images: a large dataset for
non-parametric object and scene recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence 30 (11) (2008). <http://
[339] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale
hierarchical image database, IEEE Computer Vision and Pattern Recognition,
2009. <http://www.image-net.org/>.
[340] B. Yao, X. Jiang, A. Khosla, A. Lin, L. Guibas, L. Fei-Fei, Human action
recognition by learning bases of action attributes and parts, International
Conference on Computer Vision, 2011. <http://vision.stanford.edu/Datasets/
[341] A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET
curve in assessment of detection task performance, in: 5th European
Conference on Speech Communication and Technology, 1997.
[342] M. Tahir, J. Kittler, K. Mikolajczyk, F. Yan, K. van de Sande, T. Gevers, Visual
category recognition using spectral regression and kernel discriminant
analysis, in: IEEE International Conference on Computer Vision Workshops,
[343] R. Kasturi, D.B. Goldgof, P. Soundararajan, V. Manohar, J.S. Garofolo, R.
Bowers, M. Boonstra, V.N. Korzhova, J. Zhang, Framework for performance
evaluation of face, text, and vehicle detection and tracking in video: data,
metrics, and protocol, IEEE Transactions on Pattern Analysis and Machine
Intelligence 31 (2) (2009) 319–336.
[344] V. Mariano, J. Min, J.-H. Park, R. Kasturi, D. Mihalcik, D. Doermann, T. Drayer,
Performance evaluation of object detection algorithms, in: International
Conference on Pattern Recognition, 2002.
[345] D. Doermann, D. Mihalcik, Tools and techniques for video performances
evaluation, in: International Conference on Pattern Recognition, 2000.
[346] MIT, Frontiers in Computer Vision, 2011. <http://www.
[347] J.P.A. Ioannidis, Why most published research findings are false, PLoS
Medicine 2 (8) (2005) 696–701.
[348] L.L. Zhu, Y. Chen, A. Yuille, W. Freeman, Latent hierarchical structural learning
for object detection, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2010.
[349] Y. Chen, L.L. Zhu, A. Yuille, Active MASK hierarchies for object detection, in:
ECCV, 2010.
[350] Z. Song, Q. Chen, Z. Huang, Y. Hua, S. Yan, Contextualizing object detection
and classification, in: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2011.
[351] K.E.A. van de Sande, J.R.R. Uijlings, T. Gevers, A.W.M. Smeulders,
Segmentation as selective search for object recognition, in: IEEE
International Conference on Computer Vision, 2011.
[352] L. Bourdev, J. Malik, Poselets: Body part detectors trained using 3D human
pose annotations, in: IEEE 12th International Conference on Computer Vision,
[353] L. Bourdev, S. Maji, T. Brox, J. Malik, Detecting people using mutually
consistent poselet activations, in: ECCV, 2010.
[354] F. Perronnin, J. Sanchez, T. Mensink, Improving the fisher kernel for large-
scale image classification, in: ECCV, 2010.
[355] Q. Chen, Z. Song, S. Liu, X. Chen, X. Yuan, T.-S. Chua, S. Yan, Y. Hua, Z. Huang,
S. Shen, Boosting Classification with Exclusive Context. <http://pascallin.
[356] A. Vedaldi, V. Gulshan, M. Varma, A. Zisserman, Multiple kernels for object
detection, in: IEEE International Conference on Computer Vision, 2009.
[357] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear
coding for image classification, in: Computer Vision and Pattern Recognition,
[358] F.S. Khan, J. van de Weijer, M. Vanrell, Top-down color attention for
object recognition, in: IEEE International Conference on Computer Vision,
[359] F.S. Khan, J. van de Weijer, M. Vanrell, Modulating shape features by color
attention for object recognition, International Journal of Computer Vision
[360] H. Harzallah, C. Schmid, F. Jurie, A. Gaidon, Classification aided two
stage localization. <http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/
[361] H. Harzallah, F. Jurie, C. Schmid, Combining efficient object localization and
image classification, in: IEEE International Conference on Computer Vision,
[362] M.A. Tahir, K. van de Sande, J. Uijlings, F. Yan, X. Li, K. Mikolajczyk, J. Kittler, T.
Gevers, A. Smeulders, UvA & Surrey @ PASCAL VOC 2008. <http://pascallin.
[363] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection
with discriminatively trained part-based models, IEEE Transactions on
Pattern Analysis and Machine Intelligence 32 (9) (2010) 1627–1645.
[364] F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image
categorization, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2007.
[365] O. Chum, A. Zisserman, An exemplar model for learning object classes, in:
IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[366] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained,
multiscale, deformable part model, in: IEEE Conference on Computer Vision
and Pattern Recognition, 2008.
[367] V. Ferrari, L. Fevrier, F. Jurie, C. Schmid, Groups of adjacent contour segments
for object detection, IEEE Transactions on Pattern Analysis and Machine
Intelligence 30 (1) (2008) 36–51.
[368] J. van de Weijer, C. Schmid, Coloring local feature extraction, in: ECCV, 2006.
[369] M. Everingham, A. Zisserman, C. Williams, L.V. Gool, The Pascal Visual Object
Classes Challenge 2006 (VOC2006) Results. <http://pascallin.ecs.soton.ac.uk/
[370] M. Everingham, L.V. Gool, C. Williams, A. Zisserman, Pascal Visual Object
Classes Challenge Results for 2005. <http://pascallin.ecs.soton.ac.uk/
890 A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891
[371] M. Everingham, L.V. Gool, C. Williams, A. Zisserman, Pascal Visual Object
Classes Challenge Website. <http://pascallin.ecs.soton.ac.uk/challenges/VOC/>.
[372] T. Lindeberg, Feature detection with automatic scale selection, International
Journal of Computer Vision 30 (2) (1998) 79–116.
[373] D. Lowe, Distinctive image features from scale scale-invariant keypoints,
International Journal of Computer Vision 60 (2) (2004) 91–110.
[374] S. Lazebnik, C. Schmid, J. Ponce, A sparse texture representation using local
affine regions, IEEE Transactions on Pattern Analysis and Machine
Intelligence 27 (8) (2005) 1265–1278.
[375] Y. Rubner, C. Tomasi, L. Guibas, The earth mover’s distance as a metric for
image retrieval, International Journal of Computer Vision 40 (2) (2000) 99–
[376] J. van de Weijer, C. Schmid, Applying color names to image description, in:
ICIP, 2007.
[377] J. van de Weijer, C. Schmid, J. Verbeek, Learning color names from real-world
images, in: CVPR, 2007.
[378] K. van de Sande, T. Gevers, C. Snoek, Evaluation of color descriptors for object
and scene recognition, in: CVPR, 2008.
[379] G. Moore, Cramming more Components onto Integrated Circuits, Electronics
38 (8) 1965.
[380] W. Arden, M. Brillouet, P. Cogez, M. Graef, B. Huizing, R. Mahnkopf, More than
Moore White Paper by the IRC, Tech. Rep., International Technology Roadmap
for Semiconductors, 2010.
[381] D. Engelbart, Microelectronics and the art of similitude, in: Proc. IEEE
International Solid-State Circuits Conference, 1960.
[382] J. Markoff, It’s Moore’s Law, but Another Had the Idea First, New York Times
April 18 2005. <www.nytimes.com/2005/04/18/technology/18moore.html>.
[383] The Human Brain Project: A Report to the European Commission, 2012.
[384] International Technology Roadmap for Semiconductors, 2011. <www.itrs.
[385] R. Preissl, T.M. Wong, P. Datta, M. Flickner, R. Singh, S.K. Esser, W.P. Risk, H.D.
Simon, D.S. Modha, Compass: A scalable simulator for an architecture for
cognitive computing, in: Proc. of the International Conference on High
Performance Computing, Networking, Storage and Analysis, 2012.
[386] A.M. Turing, Computing machinery and intelligence, Mind 59 (1950) 433–
[387] P. Churchland, P. Churchland, Could a machine think?, Scientific American
262 (1) (1990) 32–37
[388] J. Hawkins, On Intelligence, Times Books, 2004.
[389] Brain Research through Advancing Innovative Neurotechnologies (BRAIN)
Initiative. <http://www.nih.gov/science/brain/>.
[390] SpiNNaker. <http://apt.cs.man.ac.uk/projects/SpiNNaker/>.
[391] FACETS: Fast Analog Computing with Emergent Transient States. <http://
[392] IFAT 4G. <http://etienne.ece.jhu.edu/projects/ifat/index.html>.
[393] NEUROGRID, Stanford University. <http://www.stanford.edu/group/
[394] Brain Corporation. <http://www.braincorporation.com/>.
[395] DARPA Neovision2 project. <www.darpa.mil/Our_Work/DSO/Programs/
[396] DARPA SyNAPSE project. <www.darpa.mil/Our_Work/DSO/Programs/
[397] IBM Cognitive Computing. <www.ibm.com/smarterplanet/us/en/
[398] International Technology Roadmap for Semiconductors 2011 Edition:
Emerging Research Devices, 2011. <www.itrs.net/Links/2011ITRS/
[399] Y. Sugita, Face perception in monkeys reared with no exposure to faces, PNAS
105 (1) (2008) 394–398.
[400] V. Mountcastle, The Mindful Brain, MIT Press, 1978 (Chapter: An Organizing
Principle for Cerebral Function: The Unit Model and the Distributed System).
[401] R. Kurzweil, How To Create a Mind: The Secret of Human Thought Revealed,
Viking Penguin, 2012.
Alexander Andreopoulos received an Honours B.Sc.
degree (2003) in Computer Science and Mathematics,
with High Distinction, from the University of Toronto. In
2005 he received his M.Sc. degree and in January 2011
he completed his Ph.D. degree, both in Computer Sci-
ence at York University, Toronto, Canada. During 2011
he worked on the DARPA Neovision2 project. Since
January 2012 he has been a researcher at IBM-Almaden,
working on the DARPA-SyNAPSE/Cognitive-Computing
project. He has received the DEC award for the most
outstanding student in Computer Science to graduate
from the University of Toronto, a SONY science schol-
arship, NSERC PGS-M/PGS-D scholarships and a best
paper award.
John K. Tsotsos received his Ph.D. in 1980 from the
University of Toronto. He was on the faculty of Com-
puter Science at the University of Toronto from 1980 to
1999. He then moved to York University appointed as
Director of York’s Centre for Vision Research (2000–
2006) and is currently Distinguished Research Professor
of Vision Science in the Dept. of Computer Science &
Engineering. He is Adjunct Professor in both Ophthal-
mology and Computer Science at the University of
Toronto. He has published many scientific papers, six
conference papers receiving recognition. He currently
holds the NSERC Tier I Canada Research Chair in Com-
putational Vision and is a Fellow of the Royal Society of Canada. He has served on
the editorial boards of Image & Vision Computing, Computer Vision and Image
Understanding, Computational Intelligence and Artificial Intelligence and Medicine
and on many conference committees. He served as General Chair for IEEE Inter-
national Conference on Computer Vision 1999.
A. Andreopoulos, J.K. Tsotsos / Computer Vision and Image Understanding 117 (2013) 827–891 891

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.