Day 3

Antonio Torralba
Day 3
Reconocimiento de objetos II
Summer2010
3.1 Descriptores locales
Categories
Find a bottle:
Cant do
unless you do not
care about few errors
Instances
Find these two objects
Can nail it
Building a Panorama
M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003
How do we build a panorama?

We need to match (align) images
Global methods sensitive to occlusion, lighting, parallax
effects. So look for local features that match well.
How would you do it by eye?
Matching with Features

Detect feature points in both images

Find corresponding pairs

Find corresponding pairs
Use these pairs to align images

Problem 1:
Detect the same point independently in both
images
no chance to match!
We need a repeatable detector

Problem 2:
For each point correctly recognize the
corresponding one
We need a reliable and distinctive descriptor
More motivation
Feature points are used also for:
Image alignment (homography, fundamental matrix)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other
Selecting Good Features

Whats a good feature?
Satisfies brightness constancylooks the same in both
images
Has sufficient texture variation
Does not have too much texture variation
Corresponds to a real surface patchsee below:
Bad feature
Right
eye
view
Left eye
view
Good feature
Does not deform too much over time
An introductory example:
Harris corner detector
C.Harris, M.Stephens. A Combined Corner and Edge Detector. 1988
The Basic Idea

We should easily localize the point by looking
through a small window
Shifting a window in any direction should give
a large change in intensity
Harris Detector: Basic Idea
flat region:
no change as shift
window in all
directions
edge:
no change as shift
window along the
edge direction
corner:
significant change as
shift window in all
directions
Harris Detector: Mathematics

Window-averaged change of intensity induced
by shifting the image data by [u,v]:
Window
function
Shifted
intensity
Window function w(x,y) =
Intensity
or
1 in window, 0 outside
Gaussian
Taylor series approx to shifted image

E(u,v) w(x, y)[I(x, y) + uIx + vIy I(x, y)]2
x,y
= w(x, y)[uIx + vIy ]2

x,y
Ix Ix
= w(x, y)(u v)
Ix Iy
x,y
Ix Iy u

Iy Iy v

Expanding I(x,y) in a Taylor series expansion, we have, for small shifts [u,v], a
bilinear approximation:
where M is a 22 matrix computed from image derivatives:
M is also called structure tensor

Intensity change in shifting window: eigenvalue analysis
1, 2 eigenvalues of M
direction of the
fastest change
Ellipse E(u,v) = const

Iso-intensity contour of E(u,v)
direction of
the slowest
change
(max)-1/2
(min)-1/2
1 and 2 are large
large 1, small 2
small 1, small 2

Classification of image
points using
eigenvalues of M:
Edge
2 >> 1
Corner
1 and 2 are large,
1 ~ 2;
E increases in all
directions
1 and 2 are small;

E is almost
constant in all
directions
Flat
region
Edge
1 >> 2
1
Harris Detector: Workflow
Harris Detector: Workflow
Ideal feature detector

Would always find the same point on an
object, regardless of changes to the image.
Ie, insensitive to changes in:
Scale
Lighting
Perspective imaging
Partial occlusion
Harris Detector: Some Properties

Rotation invariance?

Rotation invariance
Ellipse rotates but its shape (i.e. eigenvalues) remains

the same
Corner response R is invariant to image rotation

Invariant to image scale?

Not invariant to image scale!
All points will be

classified as edges
Corner !

Quality of Harris detector for different scale
changes
Repeatability rate:
# correspondences
# possible correspondences
C.Schmid et.al. Evaluation of Interest Point Detectors. IJCV 2000
Evaluation plots are from this paper
We want to:
detect the same interest points
regardless of image changes
Models of Image Change

Geometry
Rotation
Similarity (rotation + uniform scale)
Affine (scale dependent on direction)
valid for: orthographic camera, locally planar
object
Photometry
Affine intensity change (I a I + b)
Scale Invariant Detection

Consider regions (e.g. circles) of different sizes
around a point
Regions of corresponding sizes will look the same
in both images

The problem: how do we choose corresponding
circles independently in each image?

Solution:
Design a function on the region (circle), which is scale
invariant (the same for corresponding regions, even if
they are at different scales)
Example: average intensity. For corresponding
regions (even of different sizes) it will be the same.
For a point in one image, we can consider it as a function of
region size (circle radius)
Image 1
Image 2
scale = 1/2
region size
region size

Common approach:
Take a local maximum of this function
Observation: region size, for which the maximum is achieved,

should be invariant to image scale.
Important: this scale invariant region size is

found in each image independently!
Image 1
Image 2
scale = 1/2
s1
region size
s2
region size

A good function for scale detection:
has one stable sharp peak
f
b
ad
bad
region size
region size
Good
!
region size
For usual images: a good function would be a

one which responds to contrast (sharp local
intensity change)

Functions for determining scale
Kernels:
(Laplacian)
(Difference of Gaussians)
where Gaussian
Note: both kernels are invariant
to scale and rotation
trace
scale
det
scale
From Lindeberg 1998
blob detection; Marr 1982; Voorhees and Poggio 1987; Blostein and Ahuja 1989;
Scale Invariant Detectors

Find local maximum of:
Harris corner detector in
space (image coordinates)
Laplacian in scale
SIFT (Lowe)2
Find local maximum of:

Difference of Gaussians in
space and scale
1 K.Mikolajczyk,
y
Harris
DoG
Laplacian
scale
scale
DoG
Harris-Laplacian1
C.Schmid. Indexing Based on Scale Invariant Interest Points. ICCV 2001

2 D.Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV 2004
CVPR 2003 Tutorial

Recognition and Matching
Based on Local Invariant
Features
David Lowe
Computer Science Department
University of British Columbia
Invariant Local Features

Image content is transformed into local feature
coordinates that are invariant to translation, rotation,
scale, and other imaging parameters
SIFT Features
Advantages of invariant local features

Locality: features are local, so robust to
occlusion and clutter (no prior segmentation)
Distinctiveness: individual features can be
matched to a large database of objects
Quantity: many features can be generated for
even small objects
Efficiency: close to real-time performance
Extensibility: can easily be extended to wide
range of differing feature types, with each
adding robustness
Scale invariance
Requires a method to repeatably select points in location
and scale:
The only reasonable scale-space kernel is a Gaussian
(Koenderink, 1984; Lindeberg, 1994)
An efficient choice is to detect peaks in the difference of
Gaussian pyramid (Burt & Adelson, 1983; Crowley &
Parker, 1984 but examining more scales)
Difference-of-Gaussian with constant ratio of scales is a
close approximation to Lindebergs scale-normalized
Laplacian (can be shown from the heat diffusion
equation)
Scale space processed one octave at a time
Key point localization

Detect maxima and minima of
difference-of-Gaussian in scale
space
Fit a quadratic to surrounding
values for sub-pixel and sub-scale
interpolation (Brown & Lowe,
2002)
Taylor expansion around point:
Offset of extremum (use finite

differences for derivatives):
Select canonical orientation

Create histogram of local
gradient directions computed
at selected scale
Assign canonical orientation
at peak of smoothed
histogram
Each key specifies stable 2D
coordinates (x, y, scale,
orientation)
Example of keypoint detection

Threshold on value at DOG peak and on ratio of principle
curvatures (Harris approach)
(a) 233x189 image
(b) 832 DOG extrema
(c) 729 left after peak
value threshold
(d) 536 left after testing
ratio of principle
curvatures
SIFT vector formation

Thresholded image gradients are sampled over 16x16
array of locations in scale space
Create array of orientation histograms
8 orientations x 4x4 histogram array = 128 dimensions
Sensitivity to number of histogram orientations
Feature stability to noise

Match features after random change in image scale &
orientation, with differing levels of image noise
Find nearest neighbor in database of 30,000 features
Feature stability to affine change

Match features after random change in image scale &
orientation, with 2% image noise, and affine distortion
Find nearest neighbor in database of 30,000 features
Distinctiveness of features
Vary size of database of features, with 30 degree affine
change, 2% image noise
Measure % correct for single nearest neighbor match
Ratio of distances reliable for matching
RECOGNITION MODELS
Families of recognition algorithms

Voting models
Bag of words models
Csurka, Dance, Fan, Willamowski, and

Bray 2004
Sivic, Russell, Freeman, Zisserman,
ICCV 2005
Viola and Jones, ICCV 2001

Heisele, Poggio, et. al., NIPS 01
Schneiderman, Kanade 2004
Vidal-Naquet, Ullman 2003
Constellation models
Fischler and Elschlager, 1973

Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000
Fergus, Perona, & Zisserman, CVPR 2003
Shape matching
Deformable models
Berg, Berg, Malik, 2005

Cootes, Edwards, Taylor, 2001
Rigid template models
Sirovich and Kirby 1987

Turk, Pentland, 1991
Dalal & Triggs, 2006
3.2 Bag of words models
Bag of Words
Independent features
Histogram representation
Compute
descriptor
e.g. SIFT [Lowe99]
Normalize
patch
Detect patches
[Mikojaczyk and Schmid 02]
[Mata, Chum, Urban & Pajdla, 02]
[Sivic & Zisserman, 03]
Local interest operator

or
Regular grid
Sivic et al. 2005
Sivic et al. 2005
128-D SIFT space
Sivic et al. 2005
Codewords
+
+
+
Vector quantization
128-D SIFT space
Sivic et al. 2005
Sivic et al. 2005
frequency
Histogram of features
assigned to each cluster
..
codewords
Sivic et al. 2005
Uses of BoW representation

Treat as feature vector for standard classifier
e.g SVM
Cluster BoW vectors over image collection

Discover visual themes
Hierarchical models
Decompose scene/object
BoW as input to classifier

SVM for object classification
Csurka, Bray, Dance & Fan, 2004
Nave Bayes
See 2007 edition of this course
Early bag of words models: mostly texture

recognition
Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik,
2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik,
Schmid & Ponce, 2003
Hierarchical Bayesian models for documents

(pLSA, LDA, etc.)
Hoffman 1999; Blei, Ng & Jordan, 2004; Teh, Jordan, Beal &
Blei, 2004
Object categorization
Csurka, Bray, Dance & Fan, 2004; Sivic, Russell, Efros,
Freeman & Zisserman, 2005; Sudderth, Torralba, Freeman &
Willsky, 2005;
Natural scene categorization

Vogel & Schiele, 2004; Fei-Fei & Perona, 2005; Bosch,
Zisserman & Munoz, 2006
Feature level
Spatial influence through correlogram features:
Savarese, Winn and Criminisi, CVPR 2006
Feature level
Generative models
Sudderth, Torralba, Freeman & Willsky, 2005, 2006
Hierarchical model of scene/objects/parts
Feature level
Generative models
Sudderth, Torralba, Freeman & Willsky, 2005, 2006
Niebles & Fei-Fei, CVPR 2007
P1
P2
P3
P4
w
Image
Bg
Scenes of Fixed Sets of Objects

Pr(object | scene)
Assumes a fixed
number of object
instances
Pr(part | object)
o
Reference positions
(ONE PER OBJECT)
covariance
context
K
J
Slide credit: Erik Sudderth
Street Scene Segmentations
1-2 minutes Gibbs sampling per image
TDP for 3D Scenes

Global Density
Object category
Part size & shape
Transformation prior
G0
Transformed Densities
Object category
Part size & shape
Transformed locations
Gj
3D Scene Features
Object category
3D Location
2D Image Features
Appearance Descriptors
2D Pixel Coordinates
N J
Single-Part Office Scene Model
Background
Bookshelves
Computer Screen
Desk
Feature level
Generative models
Discriminative methods
Lazebnik, Schmid & Ponce, 2006
3.3 Parts and structure models
Problemwithbagofwords
Allhaveequalprobabilityforbagofwordsmethods
Loca?oninforma?onisimportant
BoW+loca?ons?lldoesntgivecorrespondence
Model: Parts and Structure
Representa?on
Objectassetofparts
Genera?verepresenta?on
Model:
Rela?veloca?onsbetweenparts
Appearanceofpart
Issues:
Howtomodelloca?on
Howtorepresentappearance
Howtohandleocclusion/cluNer
Figure from [Fischler & Elschlager 73]
History of Parts and Structure

approaches
Fischler & Elschlager 1973
Yuille 91
Brunelli & Poggio 93
Lades, v.d. Malsburg et al. 93
Cootes, Lanitis, Taylor et al. 95
Amit & Geman 95, 99
Perona et al. 95, 96, 98, 00, 03, 04, 05
Felzenszwalb & Huttenlocher 00, 04
Crandall & Huttenlocher 05, 06
Leibe & Schiele 03, 04
Many papers since 2000
Sparse representation
+ Computationally tractable (105 pixels 101 -- 102 parts)
+ Generative representation of class
+ Avoid modeling global variability
+ Success in specific object recognition
- Throw away most image information

- Parts need to be distinctive to separate from other classes
Thecorrespondenceproblem
ModelwithPparts
ImagewithNpossibleassignmentsforeachpart
Considermappingtobe11
NP combinations!!!
Different connectivity structures

Fergus et al. 03
Fei-Fei et al. 03
O(N6)
Csurka 04
Vasconcelos 00
Crandall et al. 05
Fergus et al. 05
O(N2)
Crandall et al. 05
Felzenszwalb &
Huttenlocher 00
O(N2)
O(N3)
Bouchard & Triggs 05
Carneiro & Lowe 06
fromSparseFlexibleModelsofLocalFeatures
GustavoCarneiroandDavidLowe,ECCV2006
Howmuchdoesshapehelp?
Crandall,Felzenszwalb,HuNenlocherCVPR05
Shapevarianceincreaseswithincreasingmodelcomplexity
Dogetsomebenetfromshape
Appearancerepresenta?on
SIFT
Decisiontrees
[Lepetit and Fua CVPR 2005]
PCA
Figure from Winn &

Shotton, CVPR 06
LearnAppearance
Genera?vemodelsofappearance
CanlearnwithliNlesupervision
E.g.Fergusetal03
Discrimina?vetrainingofpartappearance
model
SVMpartdetectors
Felzenszwalb,Mcallester,Ramanan,CVPR2008
MuchbeNerperformance
HierarchicalRepresenta?ons
PixelsPixelgroupingsPartsObject
Multi-scale approach
increases number of
low-level features
Amit and Geman 98

Ullman et al.
Bouchard & Triggs 05
Zhu and Mumford
Jin & Geman 06
Zhu & Yuille 07
Fidler & Leonardis 07
Images from [Amit98]
Stochas)cGrammarofImages
S.C.Zhuetal.andD.Mumford
Context and Hierarchy in a Probabilistic Image Model

Jin & Geman (2006)
e.g. animals, trees,

rocks
e.g. contours,
intermediate objects
e.g. linelets,
curvelets, T
-junctions
e.g. discontinuities,
gradient
animal head instantiated

by tiger head
animal head instantiated

by bear head
A Hierarchical Compositional System for Rapid

Object Detection
Long Zhu, Alan L. Yuille, 2007.
Able to learn #parts at each level
Learning a Compositional Hierarchy of Object Structure
Parts model
The architecture
Learned parts
PartsandStructuremodels
Summary
Explicitno?onofcorrespondencebetween
imageandmodel
Ecientmethodsforlarge#partsand#
posi?onsinimage
Withpowerfulpartdetectors,cangetstateof
theartperformance
Hierarchicalmodelsallowformoreparts
3.4 Metodos discriminativos
Classifier: Nearest Neighbor

Shakhnarovich, Viola, Darrell, 2003
106 examples
Berg, Berg and Malik, 2005
Classifier: Neural Networks

Fukushimas Neocognitron, 1980
Rowley, Baluja, Kanade 1998
LeCun, Bottou, Bengio, Haffner 1998
Serre et al. 2005
Riesenhuber, M. and Poggio, T. 1999
LeNet convolutional architecture (LeCun 1998)
Classifier: Boosting
Viola & Jones 2001
Haar features via Integral Image
Cascade
Real-time performance
.
Torralba et al., 2004
Part-based Boosting
Each weak classifier is a part
Part location modeled by
offset mask
Classifier: Support Vector Machine

Guyon, Vapnik
Heisele, Serre, Poggio, 2001
..
Dalal & Triggs , CVPR 2005
HOG Histogram of
Oriented gradients
Learn weighting of
descriptor with linear
SVM
Image
HOG
descriptor
HOG descriptor weighted by

+ve SVM
-ve SVM
weights
Histograms of oriented gradients

Dalal & Trigs, 2006
Not a person
person
Adding parts
Felzenszwalb, McAllester, Ramanan. 2008.
Adding parts
3.5 Metodos multiclase
Multiclass object detection

the not so early days
Multiclass object detection

the not so early days
Using a set of independent binary classifiers was a common strategy:
Viola-Jones extension for dealing with rotations
- two cascades for each view
Schneiderman-Kanade multiclass object detection
(a) One detector for each class
There is nothing wrong with this approach if you have access to

lots of training data and you do not care about efficiency.
Some symptoms of one-vs-all

multiclass approaches
What is the best representation to detect a traffic sign?
Very regular object: template matching will do the job
Parts derived from

training a binary
classifier.
~100%
detection rate
with 0 false alarms
Some of these parts cannot be used for anything else than this object.

Part-based object representation (looking for meaningful parts):
A. Agarwal and D. Roth
M. Weber, M. Welling and P. Perona
These studies try to recover parts that are meaningful. But is this the
right thing to do? The derived parts may be too specific, and they are
not likely to be useful in a general system.

Computational cost grows linearly with Nclasses * Nviews * Nstyles
Shared features
Is learning the object class 1000 easier
than learning the first?
Can we transfer knowledge from one

object to another?
Are the shared properties interesting by
themselves?
Multitask learning
R. Caruana. Multitask Learning. ML 1997
MTL improves generalization by leveraging the domain-specific information
contained in the training signals of related tasks. It does this by training tasks in
parallel while using a shared representation.
vs.
Sejnowski & Rosenberg 1986; Hinton 1986; Le Cun et al. 1989; Suddarth &
Kergosien 1990; Pratt et al. 1991; Sharkey & Sharkey 1992;
Multitask learning
R. Caruana. Multitask Learning. ML 1997
Primary task: detect door knobs
Tasks used:
horizontal location of doorknob
single or double door
horizontal location of doorway center
width of doorway
horizontal location of left door jamb
horizontal location of right door jamb

width of left door jamb
width of right door jamb
horizontal location of left edge of door
horizontal location of right edge of door
Sharing invariances
S. Thrun. Is Learning the n-th Thing Any Easier Than Learning The First?
NIPS 1996
Knowledge is transferred between tasks via a learned model of the

invariances of the domain: object recognition is invariant to rotation,
translation, scaling, lighting, These invariances are common to all
object recognition tasks.
Toy world
With sharing
Without sharing
Convolutional Neural Network
Le Cun et al, 98
Translation invariance is already built into the network
The output neurons share all the intermediate levels
Sharing transformations
Miller, E., Matsakis, N., and Viola, P. (2000). Learning from one example
through shared densities on transforms. In IEEE Computer Vision and
Pattern Recognition.
Transformations are shared

and can be learnt from other tasks.
Models of object recognition

I. Biederman, Recognition-by-components: A theory of human image
understanding, Psychological Review, 1987.
M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in
cortex, Nature Neuroscience 1999.
T. Serre, L. Wolf and T. Poggio. Object recognition with features inspired

by visual cortex. CVPR 2005
Sharing in constellation models

(next Wednesday)
Pictorial Structures
Fischler & Elschlager, IEEE Trans. Comp. 1973
SVM Detectors
Heisele, Poggio, et. al., NIPS 2001
Constellation Model
Model-Guided Segmentation
Fergus, Perona, & Zisserman, CVPR 2003
Mori, Ren, Efros, & Malik, CVPR 2004
Reusable Parts
Krempp, Geman, & Amit Sequential Learning of Reusable Parts for Object
Detection. TR 2002
Goal: Look for a vocabulary of edges that reduces the number of

features.
Number of features
Examples of reused parts
Number of classes
Additive models and boosting

Independent binary classifiers:
Screen detector
Car detector
Face detector
Binary classifiers that share features:
Screen detector
Car detector
Face detector
Torralba, Murphy, Freeman. CVPR 2004. PAMI 2007
Specific feature
pedestrian
chair
Traffic light
sign
face
Background class
Non-shared feature: this feature

is too specific to faces.
Shared feature
shared feature
50 training samples/class
29 object classes
2000 entries in the dictionary
Class-specific features
Results averaged on 20 runs

Error bars = 80% interval
Shared features
Generalization as a function of
object similarities
12 viewpoints
K = 2.1
Number of training samples per class
Area under ROC
Area under ROC
12 unrelated object classes
K = 4.8
Number of training samples per class
Sharing patches
Bart and Ullman, 2004
For a new class, use only features similar to features that where good for other
classes:
Proposed Dog
features
Transfer Learning for Image Classification with Sparse

Prototype Representations
A. Quattoni, M. Collins, T. Darrell, CVPR 2008
Coefficients for
for feature 2
Coefficients for
classifier 2
Hierarchical Topic Models

Topic models typically use a
bag of words approx.:
Learning topics allows transfer
of information within a corpus of
related documents
Mixing proportions capture the
distinctive features of particular
documents
Pr(topic | doc)

z

K
x
N
Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003

Pr(x=word | z=topic) Pr(z=topic | doc)

Pr(x=word | doc) =
topic

Pr(topic | doc)
z

K
x
N
Pr(word | topic)

Pr(topic | doc)
bag of features models:
x
N
Object Recognition (Sivic et. al., ICCV 2005)

Scene Recognition (Fei-Fei et. al., CVPR 2005)
Pr(word | topic)
Hierarchical Sharing and Context

E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. ICCV 2005.
Scenes share objects
Objects share parts
Parts share features
Learning Shared Parts
Objects are often locally similar in appearance

Discover parts shared across categories
How many total parts should we share?
How many parts should each category use?
HDP Object Model

Parts are distributions
over appearances and
locations
We learn the
number of parts.
Each object
uses a different
number of parts.
The model
assumes a
known number
of object
categories.
HDP Object Model
There is no context, so the model is happy in creating

impossible part combinations.
Sharing Parts: 16 Categories
Caltech 101 Dataset (Li & Perona)

Horses (Borenstein & Ullman)
Cat & dog faces (Vidal-Naquet & Ullman)
Bikes from Graz-02 (Opelt & Pinz)

Google
Visualization of Shared Parts
Pr(position | part)
Pr(appearance | part)
Visualization of Shared Parts
Pr(position | part)
Pr(appearance | part)
Detection Task
Detection Results
6 Training Images per Category
Detection vs. Training Set Size
Recognition Task
VS.
Recognition Results
6 Training Images per Category
Recognition vs. Training Set Size
Recognition performance decreases. By sharing features, the classes look more similar.
Some more references
Baxter 1996
Caruana 1997
Schapire, Singer, 2000
Thrun, Pratt 1997
Krempp, Geman, Amit, 2002
E.L.Miller, Matsakis, Viola, 2000
Mahamud, Hebert, Lafferty, 2001
Fink et al. 2003, 2004
LeCun, Huang, Bottou, 2004
Holub, Welling, Perona, 2005
3.6 Modelos 3D
2D frontal face detection
Amazing how far they have gotten with so little
People have the bad taste of not being

rotationally symmetric
Examples of un-collaborative subjects
Objects are not flat
Solution to deal with 3D variations:

do not deal with it
not-Dealing with rotations and pose:
Train a different
model for each view.
The combined detector is invariant to pose variations without an explicit 3D model.
So, how many classifiers?

Object classes
viewpoints
And why should we stop with pose?

Lets do the same with styles,
lighting conditions, etc, etc, etc
Need to detect Nclasses * Nviews * Nstyles, in clutter.

Lots of variability within classes, and across viewpoints.
Depth without objects

Random dot stereograms (Bela Julesz)
Julesz, 1971
3D is so important for humans that we

decided to grow two eyes in front of the
face instead of having one looking to the
front and another to the back.
(this is not something that Julesz said but he could, maybe
he did)
Objects 3D shape priors
by H Blthoff Max-Planck-Institut fr biologische Kybernetik in Tbingen

Video taken from http://www.michaelbach.de/ot/fcs_hollow-face
/index.html
3D drives perception of important

object attributes
by Roger Shepard (Turning the Tables)

Depth processing is automatic, and we can not shut it down
3D drives perception of important

object attributes
The two Towers of Pisa

Frederick Kingdom, Ali Yoonessi and Elena Gheorghiu of McGill Vision Research unit.
It is not all about objects
3D percept is driven by the scene, which imposes its ruling to the objects
Class experiment
Class experiment
Experiment 1: draw a horse (the entire
body, not just the head) in a white piece of
paper.
Do not look at your neighbor! You already
know how a horse looks like no need to
cheat.
Class experiment
Experiment 2: draw a horse (the entire
body, not just the head) but this time
chose a viewpoint as weird as possible.
Anonymous participant
3D object categorization
Wait: object categorization in humans is not
invariant to 3D pose
3D object categorization
Despite we can categorize all three
pictures as being views of a horse,
the three pictures do not look as
being equally typical views of
horses. And they do not seem to
be recognizable with the same
easiness.
by Greg Robbins
Canonical Perspective
Experiment (Palmer, Rosch & Chase 81):
participants are shown views of an object
and are asked to rate how much each one
looked like the objects they depict
(scale; 1=very much like, 7=very unlike)
2
From Vision Science, Palmer
Canonical Perspective
Examples of canonical perspective:
In a recognition task, reaction time
correlated with the ratings.
Canonical views are recognized faster
at the entry level.
Why?
Canonical Viewpoint
Frequency hypothesis
Maximal information hypothesis
Canonical Viewpoint
Frequency hypothesis: easiness of recognition is
related to the number of times we have see the
objects from each viewpoint.
For a computer, using its Google memory, a horse
looks like:
It is not a uniform sampling on viewpoints

(some artificial datasets might contain non natural statistics)
Canonical Viewpoint
Frequency hypothesis: easiness of recognition is
related to the number of times we have see the
objects from each viewpoint.
Canonical Viewpoint
Maximal information hypothesis: Some views
provide more information than others about the
objects.
Best views tend to show
multiple sides of the
object.
Canonical Viewpoint
Maximal information hypothesis:
Clocks are preferred as purely frontal
Canonical Viewpoint
Frequency hypothesis
Maximal information hypothesis
Probably both are correct.
Edelman & Bulthoff 92: created new objects to control familiarity.
1- When presenting all view points with the same frequency, observers had
preference for specific viewpoints.
2- When few viewpoints were presented, recognition was better for previously
seen viewpoints.
Object representations
Explicit 3D models: use volumetric
representation. Have an explicit model of
the 3D geometry of the object.
Appealing but hard to get it to work
Implicit 3D models: matching the input 2D
view to view-specific representations.
Not very appealing but somewhat easy to get it to work
Implicit 3D models: matching the input 2D
view to view-specific representations.
The object is represented as a collection of 2D
views (maybe the most frequent views seen in the
past).
Tarr & Pinker (89) show people are faster at
recognizing previously seen views, as if they were
storing them. People were also able to recognize
unseen views, so they also generalize to new
views. It is not just template matching.
Why do I explain all this?

As we build systems and develop
algorithms it is good to:
Get inspiration from what others have thought
Get intuitions about what can work, and how
things can fail.
Explicit 3D model
Object Recognition in the Geometric Era: a Retrospective, Joseph L. Mundy
Explicit 3D model
Not all explicit 3D models were disappointing.
For some object classes, with accurate
geometric and appearance models, it is
possible to get remarkable results.
A Morphable Model for the Synthesis

of 3D Faces
Blanz & Vetter, Siggraph 99
A Morphable Model for the Synthesis

of 3D Faces
Blanz & Vetter, Siggraph 99
We have not achieved yet the same level of

description for other object classes
Implicit 3D models
Aspect Graphs
The nodes of the graph represent object views that are adjacent to each other
on the unit sphere of viewing directions but differ in some significant way.
The most common view relationship in aspect graphs is based on the
topological structure of the view, i.e., edges in the aspect graph arise from
transitions in the graph structure relating vertices, edges and faces of the
projected object. Joseph L. Mundy
Aspect Graphs
Patch-based single view detector

Vidal-Naquet, Ullman (2003)
Car model
Screen model
For a single view

First we collect a set of part templates from a set of training
objects.
Vidal-Naquet, Ullman (2003)
Extended fragments
View-Invariant Recognition Using Corresponding Object Fragments

E. Bart, E. Byvatov, & S. Ullman
Extended fragments

Extended fragments

Extended fragments
Extended patches are extracted using short sequences.
Use Lucas-Kanade motion estimation to track patches across the sequence.
Learning
Once a large pool of extended fragments is created, there
is a training stage to select the most informative
If C and F
fragments.
are independent,
then I(C,F) = 0
For each fragment evaluate:
Class label
Fragment present/absent
Select the fragment B with
In the subsequent rounds, use
All these operations are easy to compute. It is just counting.
1
P(C=1, F=1) = 3 / 10
P(C=1, F=0) =
P(C=0, F=1) =
P(C=0, F=0) =
Training without sequences

Challenges:
- We do not know which fragments are in
correspondence (we can not use motion
estimation due to strong transformation)
Fragments that are in correspondence will have
detections that are correlated across viewpoints.
The same approach can be used for
arbitrary transformations
Bart & Ullman
Shared features for Multi-view object

detection
Training does not require having different views of the same object.
View
invariant
features
View
specific
features
Torralba, Murphy, Freeman. PAMI 07
Shared features for Multi-view

object detection
Sharing is not a tree. Depends also on 3D symmetries.
Multi-view object detection
Strong learner
H response for
car as function
of assumed
view angle
Voting schemes
Towards Multi-View Object Class
Detection
Alexander Thomas
Vittorio Ferrari
Bastian Leibe
Tinne Tuytelaars
Bernt Schiele
Luc Van Gool
Viewpoint-Independent Object Class Detection using 3D Feature Maps

Training dataset: synthetic objects
Features
Voting scheme and detection
Each cluster casts votes for the

voting bins of the discrete poses
contained in its internal list.
Liebelt, Schmid, Schertler. CVPR 2008
Laboratorio
SIFT + visual words
http://people.csail.mit.edu/torralba/courses/seminarioUC3M/dia3/
Laboratorio3.m

Day 3

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Day 3

Transféré par

Droits d'auteur :

Formats disponibles

Antonio Torralba

3.1 Descriptores locales

Find these two objects

M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003

How do we build a panorama?

Matching with Features

Matching with Features

Matching with Features

Matching with Features

We need a repeatable detector

Matching with Features

We need a reliable and distinctive descriptor

Selecting Good Features

Does not deform too much over time

C.Harris, M.Stephens. A Combined Corner and Edge Detector. 1988

The Basic Idea

Harris Detector: Basic Idea

Harris Detector: Mathematics

Window function w(x,y) =

Taylor series approx to shifted image

= w(x, y)[uIx + vIy ]2

Harris Detector: Mathematics

where M is a 22 matrix computed from image derivatives:

M is also called structure tensor

Harris Detector: Mathematics

Ellipse E(u,v) = const

Selecting Good Features

1 and 2 are large

Selecting Good Features

Selecting Good Features

Harris Detector: Mathematics

1 and 2 are small;

Harris Detector: Workflow

Harris Detector: Workflow

Ideal feature detector

Harris Detector: Some Properties

Harris Detector: Some Properties

Ellipse rotates but its shape (i.e. eigenvalues) remains

Harris Detector: Some Properties

Harris Detector: Some Properties

All points will be

Harris Detector: Some Properties

C.Schmid et.al. Evaluation of Interest Point Detectors. IJCV 2000

Evaluation plots are from this paper

Models of Image Change

Scale Invariant Detection

Scale Invariant Detection

Scale Invariant Detection

Scale Invariant Detection

Observation: region size, for which the maximum is achieved,

Important: this scale invariant region size is

Scale Invariant Detection

For usual images: a good function would be a

Scale Invariant Detection

From Lindeberg 1998

Scale Invariant Detectors

Find local maximum of:

C.Schmid. Indexing Based on Scale Invariant Interest Points. ICCV 2001

CVPR 2003 Tutorial

Invariant Local Features

Advantages of invariant local features

Scale space processed one octave at a time

Key point localization