Académique Documents
Professionnel Documents
Culture Documents
C
(C
A
0 ) describes the spatial relationship between two patches, where C
A
is
the center position of patch A (normalized to [0, 1]). The patch Gaussian kernel k
F
(F
A
, F
A
0 ) = exp(
F
kF
A
F
A
0 k
2
) =
F
(F
A
)
>
F
(F
A
0 ) measures the similarity of two
patch-level features, where F
A
are gradient, shape or color kernel descriptors in our
case. The linear kernel W
A
W
A
0 weights the contribution of each patch-level fea- ture
where W small positive constant. W
A
is the average of gradient mag-nitudes for the
gradient kernel descriptor, the average of standard deviations for the shape kernel
descriptor and is always 1 for the color kernel descriptor.
Note that although efficient match kernels [1] used match kernels to aggregate patch-
level features, they dont con- sider spatial information in match kernels and so spatial
pyramid is required to integrate spatial information. In addi-tion, they also do not
weight the contribution of each patch, which can be suboptimal. The novel joint
match kernels ( 5) provide a way to integrate patch-level features, patch varia-tion,
and spatial information jointly.
32
Evaluating match kernels ( 5) is expensive. Both for com-putational efficiency and for
representational convenience, we again extract compact low-dimensional features
from ( 5) using the idea from kernel descriptors.
The inner product representation of two Gaussian ker-nels is given by
Following [ 1], we learn compact features by projecting the infinite-dimensional
vector F (FA) C (CA) to a set of basis vectors. Since CA is a two-dimensional
vector, we can generate the set {C (X1), , C (XdC )} of basis vec-tors by
sampling X on 5 5 regular grids (dC = 25). However, patch-level features FA are in
high-dimensional space and it is infeasible to sample them on dense and uni-form
grids. Instead, we cluster patch-level features from training images using K-means,
similar to the bag of vi-sual words method, and take the resulting centers as the set.
)}. If 5000 basis vectors are generated from patch-level features, the dimensional ity
of the second layer kernel descriptors is 5000 25 = 125, 000. To obtain the second
layer kernel descriptors of reasonable size, we can reduce the number of basis vectors
using KPCA. KPCA finds the linear combination of basis vectors that best preserves
variance of the original data. The first kernel principal component can be computed
by maxi-mizing the variance of projected data with the normalization condition.
performed on the joint kernel, the product of spatial ker-nel kC and feature kernel kF ,
33
which can be written as a single Gaussian kernel. This procedure is optimal in the
sense of minimizing the least square approximation error. However, it is intractable to
compute the eigenvectors of a 125, 000 125, 000 matrix on a modern personal
computer. Here we propose a fast algorithm for finding the eigenvec-tors of the
Kronecker product of kernel matrices. Since ker-nel matrices are symmetric positive
definite, we have
suggests that the top r eigenvectors of KF KC can be chosen from the Kronecker
product of the eigenvectors of KF and those of KC , which significantly reduces com-
putational cost. The second layer kernel descriptors have the form. Recursively
applying kernel descriptors in a similar man-ner, we can get kernel descriptors of
more layers, which represents features at different levels.
c) Everyday Object Recognition using RGB-D
We recorded with the camera mounted at three different heights relative to the
turntable, giving viewing angles of approxi-mately 30, 45 and 60 degrees with the
horizon. One revolu-tion of each object was recorded at each height. Each video
sequence is recorded at 20 Hz and contains around 250 frames, giving a total of
250,000 RGB + Depth frames. A combination of visual and depth cues (Mixture-of-
Gaussian fitting on RGB, RANSAC plane fitting on depth) produces a segmentation
34
for each frame separating the object of in-terest from the background. The objects are
organized into a hierarchy taken from WordNet hypernym/hyponym rela-tions and is
a subset of the categories in ImageNet. Each of the 300 objects in the dataset belong
to one of 51 cate-gories
Our hierarchical kernel descriptors, being a generic ap-proach based on kernels, has
no trouble generalizing from color images to depth images. Treating a depth image as
a grayscale image, i.e. using depth values as intensity, gra-dient and shape kernel
descriptors can be directly extracted and they capture edge and shape information in
the depth channel. However, color kernel descriptors extracted over the raw depth
image does not have any significant mean-ing. Instead, we make the observation that
the distance d of an object from the camera is inversely proportional to the square root
of its area s in RGB images. For a given object.
Since we have the segmen-tation of objects, we can represent s using the number of
pixels belonging to the object mask. Finally, we multiply depth values by s before
extracting color kernel descrip-tors over this normalized depth image. This yields a
feature that is sensitive to the physical size of the object.
35
In the experiments section, we will compare in de-tail the performance of our
hierarchical kernel descrip-tors on RGB-D object recognition to that in [15]. Our
approach consistently outperforms the state of the art in [15]. In particular, our
hierarchical kernel descriptors on the depth image perform much better than the com-
bination of depth features (including spin images) used in [15], increasing the depth-
only object category recog-nition from 53.1% (linear SVMs) and 64.7% (nonlinear
SVMs) to 75.7% (hierarchical kernel descriptors and lin-ear SVMs). Moreover, our
depth features served as the backbone in the object-aware situated interactive system
that was successfully demonstrated at the Consumer Elec-tronics Show 2011 despite
adverse lighting conditions.
d) Experiments
In this section, we evaluate hierarchical kernel descrip-tors on CIFAR10 and the
RGB-D Object Dataset. We also
Features KDES [1]
HKDES (this
work)
Color 53.9 63.4
Shape 68.2 69.4
Gradient 66.3 71.2
36
Combination 76.0 80.0
Table 1. Comparison of kernel descriptors (KDES) and hierarchi-cal kernel
descriptors (HKDES) on CIFAR10 provide extensive comparisons with current state-
of-the-art algorithms in terms of accuracy.
In all experiments we use the same parameter settings as the original kernel
descriptors for the first layer of hi-erarchical kernel descriptors. For SIFT as well as
gradi-ent and shape kernel descriptors, all images are transformed into grayscale ([0,
1]). Image intensity and RGB values are normalized to [0, 1]. Like HOG [5], we
compute gradients using the mask [1, 0, 1] for gradient kernel descriptors. We also
evaluate the performance of the combination of the three hierarchical kernel
descriptors by concatenating the image-level feature vectors. Our experiments suggest
that this combination always improves accuracy.
i) CIFAR10
CIFAR10 is a subset of the 80 million tiny images dataset [26, 14]. These images are
downsampled to 32 32 pixels. The training set contains 5,000 images per category,
while the test set contains 1,000 images per category.
Due to the tiny image size, we use two-layer hierarchical kernel descriptors to obtain
image-level features. We keep the first layer the same as kernel descriptors. Kernel
37
de-scriptors are extracted over 8 8 image patches over dense regular grids with a
spacing of 2 pixels. We split the whole training set into 10,000/40,000
training/validation set, and optimize the kernel parameters of the second layer kernel
descriptors on the validation set using grid search. Fi-nally, we train linear SVMs on
the full training set using the optimized kernel parameter setting. Our hierarchical
model can handle large numbers of basis vectors. We tried both 1000 and 5000 basis
vectors for the patch-level Gaus-sian kernel k
F
, and found that a larger number of
visual words is slightly better (0.5% to 1% improvement depend-ing on the type of
kernel descriptor). In the second layer, we use 1000 basis vector, enforce KPCA to
keep 97% of the energy for all kernel descriptors, and produce roughly 6000-
dimensional image-level features. Note that the sec-ond layer of hierarchical kernel
descriptors are image-level features, and should be compared to that of image-level
features formed by EMK, rather than that of kernel descriptors over image patches.
The dimensionality of EMK features [1] in is 14000, higher than that of hierarchical
kernel descriptors.
We compare kernel descriptors and hierarchical kernel
Method Accuracy
Logistic regression 36.0
Support Vector Machines 39.5
GIST 54.7
38
SIFT 65.6
fine-tuning GRBM 64.8
GRBM two layers 56.6
mcRBM 68.3
mcRBM-DBN 71.0
Tiled CNNs 73.1
improved LCC 74.5
KDES + EMK + linear SVMs 76.0
Convolutional RBM 78.9
K-means (Triangle, 4k features) 79.6
HKDES + linear SVMs (this
work) 80.0
descriptors in Table 1. As we see, hierarchical kernel de-scriptors consistently
outperform kernel descriptors. The shape hierarchical kernel descriptor is slightly
better than the shape kernel descriptor. The other two hierarchical ker-nel descriptors
are much better than their counterparts: gra-dient hierarchical kernel descriptor is
about 5 percent higher than gradient kernel descriptor and color hierarchical kernel
descriptor is 10 percent better than color kernel descriptor. Finally, the combination of
all three hierarchical kernel de-scriptors outperform the combination of all three
kernel de-scriptors by 4 percent. We were not able to run nonlinear SVMs with
39
Laplacian kernels on the scale of this dataset in reasonable time, given the high
dimensionality of image-level features. Instead, we make comparisons on a subset of
5,000 training images and our experiments suggest that non-linear SVMs have similar
performance with linear SVMs when hierarchical kernel descriptors are used.
We compare hierarchical kernel descriptors with the cur-rent state-of-the-art feature
learning algorithms in Table 2. Deep belief nets and sparse coding have been
extensively evaluated on this dataset [25, 31]. mcRBM can model pixel intensities and
pairwise dependencies between them jointly. Factorized third-order restricted
Boltzmann machine, fol-lowed by deep belief nets, has an accuracy of 71.0%. Tiled
CNNs has the best accuracy among deep networks. The improved LCC extends the
original local coordinate coding by including local tangent directions and is able to
integrate geometric information. As we have seen, sophisticated fea-ture extraction
can significantly boost accuracy and is much better than using raw pixel features.
SIFT features have an accuracy of 65.2% and works reasonably even on tiny images.
The combination of three hierarchical kernel de-scriptors has an accuracy of 80.0%,
higher than all other competing techniques; its accuracy is 14.4 percent higher than
SIFT, 9.0 percent higher than mcRBM combined with DBNs, and 5.5 percent higher
than the improved LCC. Hi-erarchical kernel descriptors slightly outperform the very
recent work: the convolutional RBM and the triangle K-means with 4000 centers.
ii) RGB-D Object Dataset
40
We evaluated hierarchical kernel descriptors on the RGB-D Object Dataset. The goal
of this experiment is to: 1) verify that hierarchical kernel descriptors work well for
both RGB and depth images; 2) test whether using depth in-formation can improve
object recognition. We subsampled the turntable video data by taking every fifth
frame, giving around 41,877 RGB-depth image pairs. To the best of our knowledge,
the RGB-D Object Dataset presented here is the largest multi-view object dataset
where both RGB and depth images are provided for each view.
We use two-layer hierarchical kernel descriptors to con-struct image-level features.
We keep the first layer the same as kernel descriptors and tune the kernel parameters
of the second layer kernel descriptors by cross validation optimization. We extract the
first layer of kernel descrip-tors over 16 16 image patches in dense regular grids
with spacing of 8 pixels. In the second layer, we use 1000 basis vectors for the patch-
level Gaussian kernel k
F
, enforce that KPCA keep 97% of the energy for all kernel
descriptors as mentioned in Section 4.1, and produce roughly 3000-dimensional
image-level features. Finally, we train linear SVMs on the training set and apply them
on the test set. We also tried three layer kernel descriptors, but they gave similar
performance to two-layer ones.
As in, we distinguish between two levels of object recognition: instance recognition
and category recognition. Instance recognition is recognizing distinct objects, for ex-
ample a coffee mug with a particular appearance and shape. Category recognition is
41
determining the category name of an object (e.g. coffee mug). One category usually
contains many different object instances.
To test the generalization ability of our approaches, for category recognition we train
models on a set of objects and at test time present to the system objects that were not
present in the training set [15]. At each trial, we randomly leave one object out from
each category for testing and train classifiers on the remaining 300 - 51 = 249 objects.
For in-stance recognition we also follow the experimental setting suggested by [15]:
train models on the video sequences of each object where the viewing angles are 30
and 60
video sequence.
For category recognition, the average accuracy over 10 random train/test splits is
reported in the second column of Table. For instance recognition, the accuracy on the
test set is reported in the third column of Table. As we ex-pect, the combination of
hierarchical kernel descriptors is much better than any single descriptor. The
underlying rea- son is that each depth descriptor captures different informa-tion and
the weights learned by linear SVMs using super-vised information can automatically
balance the importance of each descriptor across objects.
Method Category Instance
Color HKDES (RGB) 60.12.1 58.4
42
Shape HKDES (RGB) 72.61.9 74.6
Gradient HKDES (RGB) 70.12.9 75.9
Combination of HKDES
(RGB) 76.12.2 79.3
Color HKDES (depth) 61.82.4 28.8
Shape HKDES (depth) 65.81.8 36.7
Gradient HKDES (depth) 70.82.7 39.3
Combination of HKDES
(depth) 75.72.6 46.8
Combination of all HKDES 84.12.2 82.4
Table2: Comparisons on the RGB-D Object Dataset. RGB de-notes features over
RGB images and depth denotes features over depth images.
Approaches Category Instance
Linear SVMs [15] 81.92.8 73.9
Nonlinear SVMs [15] 83.83.5 74.8
Random Forest [15] 79.64.0 73.1
Combination of all
HKDES 84.12.2 82.4
Table3: Comparisons to existing recognition approaches using a combination of depth
features and image features. Nonlinear SVMs use Gaussian kernel.
43
In Table 4, we compare hierarchical kernel descriptors with the rich feature set used
in, where SIFT, color and textons were extracted from RGB images, and 3-D bound-
ing boxes and spin images over depth images. Hier-archical kernel descriptors are
slightly better than this rich feature set for category recognition, and much better for
in-stance recognition.
It is worth noting that, using depth alone, we improve the category recognition
accuracy in from 53.1% (lin-ear SVMs) to 75.7% (hierarchical kernel descriptors and
linear SVMs). This shows the power of our hierarchical kernel descriptor formulation
when being applied to a non-conventional domain. The depth-alone results are
meaning-ful for many scenarios where color images are not used for privacy or
robustness reasons.
As a comparison, we also extracted SIFT features on both RGB and depth images and
trained linear SVMs over image-level features formed by spatial pyramid EMK. The
resulting classifier has an accuracy of 71.9% for category recognition, much lower
than the result of the combination of hierarchical kernel descriptors (84.2%). This is
not sur-prising since SIFT fails to capture shape and object size information.
Nevertheless, hierarchical kernel descriptors provide a unified way to generate rich
feature sets over both RGB and depth images, giving significantly better accuracy.
44
4) OBJECT RECOGNITION METHODS BASED ON TRANSFORMATION
Recognition of general three-dimensional objects from 2D images and videos is a
challenging task. The common for-mulation of the problem is essentially: given some
knowl-edge of how certain objects may appear, plus an image of a scene possibly
containing those objects, find which objects are present in the scene and where.
Recognition is accom-plished by matching features of an image and model of an
object. The two most important issues that a method must address are the definition of
a feature, and how the matching is found.
What is the goal in designing an object recognition sys-tem? Achieving generality, i.e.
the ability to recognise any object hand-crafted adaptation to a specific task,
robustness, the ability to recognise the objects in arbitrary conditions, and easy
learning, i.e. avoiding special or demanding proce-dures to obtain the database of
models. Obviously these requirements are generally impossible to achieve, as it is for
example impossible to recognise objects in images taken in complete darkness. The
challenge is then to develop a method with minimal constraints.
Object recognition methods can be classified according to a number of characteristics.
We focus on model acqui-sition (learning) and invariance to image formation condi-
tions. Historically, two main trends can be identified. In the so called geometry- or
model-based object recognition, the knowledge of an object appearance is provided
by the user as an explicit CAD-like model. Typically, such a model describes only the
45
3D shape, omitting other properties such as colour and texture. On the other end of
the spectrum are the appearance-based methods, where no explicit user-provided
model is required. The object representations are usually acquired through an
automatic learning phase (but not necessarily), and the model typically relies on
surface re-flectance (albedo) properties. Recently, methods which put local image
patches into correspondence emerged. Models are learned automatically, objects are
represented by ap-pearance of small local elements. Global arrangement of the
representation is constrained by weak or strong geometric models.
The rest of the paper is structured as follows. In Sec-tion 2, an overview of classes of
object recognition methods is given. Survey on methods which are based on match-
ing of local features is presented in Section 3, and Section 4 describes some of their
successful applications. Section 5 concludes the paper.
4) CLASSES OF OBJECT RECOGNITION METHODS
i) Appearance Based Methods
The central idea behind appearance-based methods is the following. Having seen all
possible appearances of an object, can recognition be achieved by just eciently
remembering all of them? Could recognition be thus implemented as an ecient
visual (pictorial) memory? The answer obviously depends on what is meant by all
appearances. The ap-proach has been successfully demonstrated for scenes with
unoccluded objects on black background [34]. But remem-bering all possible object
46
appearances in the case of arbitrary background, occlusion and illumination, is
currently compu-tationally prohibitive.
Appearance based methods [6, 70, 20, 3, 40, 33, 68, 21, 30, 34] typically include two
phases. In the first phase, a model is constructed from a set of reference images. The
set includes the appearance of the object under dierent ori-entations, dierent
illuminants and potentially multiple in-stances of a class of objects, for example faces.
The images are highly correlated and can be eciently compressed using e.g.
Karhunen-Loeve transformation (also known as Princi-pal Component Analysis -
PCA).
In the second phase, recall, parts of the input image (subimages of the same size as
the training images) are ex-tracted, possibly by segmentation (by texture, colour, mo-
tion) or by exhaustive enumeration of image windows over whole image. The
recognition system then compares an ex-tracted part of the input image with the
reference images (e.g. by projecting the part to the Karhunen-Loeve space).
A major limitation of the appearance-based approaches is that they require isolation of
the complete object of inter-est from the background. They are thus sensitive to occlu-
sion and require good segmentation. A number of attempts have been made to address
recognition with occluded or par-tial data [32, 30, 65, 5, 21, 4, 64, 20, 15, 19].
47
The family of appearance-based object recognition meth-ods includes global
histogram matching methods. In [66, 67], Swain and Ballard proposed to represent an
object by a colour histogram. Objects are identified by matching his-tograms of image
regions to histograms of a model image. While the technique is robust to object
orientation, scaling, and occlusion, it is very sensitive to lighting conditions, and it is
not suitable for recognition of objects that cannot be identified by colour alone. The
approach has been later mod-ified by Healey and Slater [14] and Funt and Finlayson
[12] to exploit illumination invariants. Recently, the concept of histogram matching
was generalised by Schiele [52, 51, 50], where, instead of pixel colours, responses of
various filters are used to form the histograms (called then receptive field histograms).
To summarise, appearance based approaches are attrac-tive since they do not require
image features or geometric primitives to be detected and matched. But their
limitations, i.e. the necessity of dense sampling of training views and the low
robustness to occlusion and cluttered background, make them suitable mainly for
certain applications with limited or controlled variations in the image formation
conditions, e.g. for industrial inspection.
iii) Geometry-Based Methods
In geometry- (or shape-, or model-) based methods, the in-formation about the objects
is represented explicitly. The recognition can than be interpreted as deciding whether
48
(a part of) a given image can be a projection of the known (usually 3D) model [41] of
an object.
Generally, two representations are needed: one to repre-sent object model, and
another to represent the image con-tent. To facilitate finding a match between model
and image, the two representations should be closely related. In the ideal case there
will be a simple relation between primitives used to describe the model and those used
to describe the image. Would the object be, for example, described by a wireframe
model, the image might be best described in terms of linear intensity edges. Each edge
can be then matched directly to one of the model wires. However, the model and
image rep-resentations often have distinctly dierent meanings. The model may
describe the 3D shape of an object while the im-age edges correspond only to visible
manifestations of that shape mixed together with false edges (discontinuities in
surface albedo) and illumination eects (shadows).
To achieve pose and illumination invariance, it is prefer-able to employ model
primitives that are at least somewhat invariant with respect to changes in these
conditions. Con-siderable eort has been directed to identify primitives that are
invariant with respect to viewpoint change.
The main disadvantages of geometry-based methods are: the dependency on reliable
extraction of geometric primi-tives (lines, circles, etc.), the ambiguity in interpretation
of the detected primitives (presence of primitives that are not modelled), the restricted
49
modelling capabilities only to a class of objects which are composed of few easily
detectable elements, and the need to create the models manually.
iv) Recognition as a Correspondence of Local Features
Neither geometry-based nor appearance-based methods dis-cussed previously do well
as defined by the requirements stated in the beginning of the paper, i.e. the generality,
ro-bustness, and easy learning. Geometry-based approaches re-quire the user to
specify the object models, and can usu-ally handle only objects consisting of simple
geometric prim-itives. They are not general, nor do they support easy learn-ing.
Appearance-based methods demanded exhaustive set of learning images, taken from
densely distributed views and il-luminations. Such set is only available when the
object can be observed in a controlled environment, e.g. placed on a turntable. The
methods are also sensitive to occlusion of the objects, and to the unknown
background, thus they are not robust.
As an attempt to address the above mentioned issues, methods based on matching
local features have been pro-posed. Objects are represented by a set of local features,
which are automatically computed from the training images. The learned features are
organised into a database. When recognising a query image, local features are
extracted as in the training images. Similar features are then retrieved from the
database and the presence of objects is assessed in the terms of the number of local
50
correspondences. Since it is not required that all local features match, the approaches
are robust to occlusion and cluttered background.
To recognise objects from dierent views, it is necessary to handle all variations in
object appearance. The varia-tions might be complex in general, but at the scale of the
local features they can be modelled by simple, e.g. ane, transformations. Thus, by
allowing simple transformations at local scale, a significant viewpoint invariance is
achieved even for objects with complicated shapes. As a result, it is possible to obtain
models of objects from only a few views, taken e.g. 90 degrees apart.
The main advantages of the approaches based on match-ing local features are
summarised below.
[1] Learning, i.e. the construction of internal models of known objects, is done
automatically from images depict-ing the objects. No user intervention is required
except for providing the training images.
[2] The local representation is based on appearance. There is no need to extract geometric
primitives (e.g. lines), which are generally hard to detect reliably.
[3] Segmentation of objects from background is not required prior recognition, and yet
objects are recognised on an unknown background.
[4] Objects of interest are recognised even if partially oc-cluded by other unknown
objects in the scene.
51
[5] Complex variations in object appearance caused by vary-ing viewpoint and
illumination conditions are approxi-mated by simple transformations at a local scale.
[6] Measurements on both database and query images are obtained and represented in an
identical way.
Putting local features into correspondence is an approach that is robust to object
occlusion and cluttered background in principle. When a part of an object is occluded
by other objects in the scene, only features of that part are missed. As long as there
are enough features detected in the unoccluded part, the object can be recognised. The
problem of cluttered background is solved in a final step of the recognition process,
when a hypothesised match is verified and confirmed, and false correspondences are
rejected.
Several approaches based on local features have been pro-posed. Generally, they
follow a certain common structure, which is summarised below.
Detectors. First, image elements of interest are detected. The elements will serve as
anchor locations in the images descriptors of local appearance will be computed at
these locations. Thus, an image element is of interest if it depicts a part of an object,
which can be repeatedly detected and localised in images taken over large range of
conditions. The challenge is to find such a definition of interest, that would allow
fast, reliable and precisely localised detection of such elements. The brute force
52
alternative to the detectors is to generate local descriptors at every point. This course
is obviously infeasible due to its computational complexity.
Descriptors. Once the elements of interest are found, the local image appearance in
their neighbourhood has to be encoded in a way that would allow for searching of
similar elements.
When designing a descriptor (also called a feature vec-tor), several aspects have to be
taken into account. First, the descriptors should be discriminative enough to distin-
guish between features of the objects stored in the database. Would we for example
want to distinguish between two or three objects, each described by some ten odd
features, the descriptions of local appearance can be as simple as e.g. four-bin colour
histograms. On the other hand, handling thousands of database objects requires the
ability to distin-guish between a vast number of descriptors, demanding thus highly
discriminative representation. This problem can be partially alleviated by using
grouping, i.e. simultaneous con-sistent matching of several detected elements.
Another aspect in designing a descriptor is that it has to be invariant, or at least in
some degree robust, to variations in an objects appearance that are not reflected by
the detector. If, for example, the detector detects circular or el-liptical regions without
assigning an orientation to them, the descriptor must be made invariant to the
orientation (rota-tional invariants). Or if the detector is imprecise in locating the
elements of interest, e.g. having few pixel tolerance, the descriptor must be insensitive
53
to these small misalignments. Such a descriptor might be based e.g. on colour
moments (in-tegral statistics over whole region), or on local histograms.
It follows that the major factors that aect the discrim-inative potential, and thus the
ability to handle large object databases, of a method are the repeatability and the
locali-sation precision of the detector.
Indexing. During learning of object models, descriptors of local appearance are stored
into a database. In the recognition phase, descriptors are computed on the query
image, and the database is looked up for similar descriptors (potential matches). The
database should be organised (indexed) in a way that allows an ecient retrieval of
similar descriptors. The character of suitable indexing structure depends generally on
the properties of the descriptors (e.g. their di-mensionality) and on the distance
measure used to determine which are the similar ones (e.g. euclidean distance). Gen-
erally, for optimal performance of the index (fast retrieval times), such combination of
descriptor and distance measure should be sought, that minimises the ratio of
distances to correct and to false matches.
The choice of indexing scheme has major eect on the speed of the recognition
process, especially on how the speed scales to large object databases. Commonly,
though, the database searches are done simply by sequential scan, i.e. without using
any indexing structure.
54
Matching. When recognising objects in an unknown query image, local features are
computed in the same form as for the database images. None, one, or possibly more
tentative correspondences are then established for every feature de-tected in the query
image. Searching the database, euclidean or mahalanobis distance is typically
evaluated between the query feature and the features stored in the database. The
closest match, if close enough, is retrieved. These tentative correspondences are based
purely on the similarity of the descriptors. A database object which exhibit high (non-
random) number of established correspondences is considered as a candidate match.
Verification. The similarity of descriptors, on its own, is not a measure reliable
enough to guarantee that an established correspondence is correct. As a final step of
the recognition process, a verification of presence of the model in the query image is
performed. A global transformation connecting the images is estimated in a robust
way (e.g. by using RANSAC algorithm). Typically, the global transformation has the
form of epipolar geometry constraint for general (but rigid) 3D objects, or of
homography for planar objects. More complex transformations can be derived for
non-rigid or articulated (piecewise rigid) objects.
As mentioned before, if a detector cannot recover certain parameters of the image
transformations, descriptor must be made invariant to them. It is preferable, though, to
have a covariant detector rather than an invariant descriptor, as that allows for more
powerful global consistency verification. If, for example, the detector does not
provide the orienta-tions of the image elements, rotational invariants have to be
55
employed in the descriptor. In such a case, it is impossi-ble to verify that all of the
matched elements agree in their orientation.
Finally, tentative correspondences which are not consistent with the estimated global
transformation are rejected, and only remaining correspondences are used to estimate
the final score of the match.
In the following, main contributions to the field of object recognition based on local
correspondences are reviewed. The approaches follow the aforementioned structure,
but differ in individual steps; in the way how are the local features obtained
(detectors), and what are the features themselves (descriptors).
56
5) RECOGNITION AS A CORRESPONDENCE OF LOCAL FEATURES - A
SURVEY
a) The Approach of David Lowe
David Lowe has developed an object recognition system, with emphasis on eciency,
achieving real-time recognition times. Anchor points of interest are detected with
invariance to scale, rotation and translation. Since local patches undergo more
complicated transforma-tions then similarities, a local-histogram based descriptor is
proposed, which is robust to imprecisions in alignment of the patches.
Detector. The detection of regions of interest proceeds as follows:
[12] Detection of scale-space peaks. Circular regions with maximal response of the
dierence-of-gaussians (DoG) filter, are detected at all scales and image locations. Ef-
ficient implementation exploits the scale-space pyramid. The initial image is
repeatedly convolved with a Gaus-sian filter to produce a set of scale-space images.
Adja-cent scale-space images are then subtracted to produce a set of DoG images. In
these images, local minima and maxima (i.e. extrema of the DoG filter response) are
de-tected, both in spatial and scale domains. The result of the first phase is thus a set
of triplets x, y and , image locations and a characteristic scales.
[13] The location of the detected points is refined. The DoG responses are locally
fitted with 3D quadratic function and the location and characteristic scale of the
circular regions are determined with subpixel accuracy. The re-finement is necessary,
57
as, at higher levels of the pyramid, a displacement by a single pixel might result in a
large shift in the image domain. Unstable regions are then rejected, the stability is
given by the magnitude of the DoG response. Regions with the response lower than a
predefined threshold are removed. Further regions are discarded which were found
along linear edges, which, although having high DoG response, have unstable local-
isation in one direction.
[14] One or more orientations are assigned to each region. Local histograms of
gradient orientations are formed and peaks in the histogram determine the
characteristic ori-entations.
The SIFT Descriptor. Local image gradients are mea-sured at the regions
characteristic scale, weighted by the distance from the region centre and combined
into a set of orientation histograms. Using the histograms, small mis-alignments in the
localisation does not aect the final de-scription. The construction of the descriptors
allows for ap-proximately 20