Académique Documents
Professionnel Documents
Culture Documents
Day 3
Reconocimiento de objetos II
Summer2010
Categories
Find a bottle:
Cant do
unless you do not
care about few errors
Instances
Can nail it
Building a Panorama
no chance to match!
More motivation
Feature points are used also for:
Image alignment (homography, fundamental matrix)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other
Left eye
view
Good feature
An introductory example:
Harris corner detector
flat region:
no change as shift
window in all
directions
edge:
no change as shift
window along the
edge direction
corner:
significant change as
shift window in all
directions
Window
function
Shifted
intensity
Intensity
or
1 in window, 0 outside
Gaussian
Ix Ix
= w(x, y)(u v)
Ix Iy
x,y
Ix Iy u
Iy Iy v
1, 2 eigenvalues of M
direction of the
fastest change
direction of
the slowest
change
(max)-1/2
(min)-1/2
large 1, small 2
small 1, small 2
Edge
2 >> 1
Corner
1 and 2 are large,
1 ~ 2;
E increases in all
directions
constant in all
directions
Flat
region
Edge
1 >> 2
1
Corner !
We want to:
detect the same interest points
regardless of image changes
Photometry
Affine intensity change (I a I + b)
Image 1
Image 2
scale = 1/2
region size
region size
Image 2
scale = 1/2
s1
region size
s2
region size
b
ad
bad
region size
region size
Good
!
region size
(Laplacian)
(Difference of Gaussians)
where Gaussian
Note: both kernels are invariant
to scale and rotation
trace
scale
det
scale
blob detection; Marr 1982; Voorhees and Poggio 1987; Blostein and Ahuja 1989;
SIFT (Lowe)2
1 K.Mikolajczyk,
y
Harris
DoG
Laplacian
scale
scale
DoG
Harris-Laplacian1
SIFT Features
Scale invariance
Requires a method to repeatably select points in location
and scale:
The only reasonable scale-space kernel is a Gaussian
(Koenderink, 1984; Lindeberg, 1994)
An efficient choice is to detect peaks in the difference of
Gaussian pyramid (Burt & Adelson, 1983; Crowley &
Parker, 1984 but examining more scales)
Difference-of-Gaussian with constant ratio of scales is a
close approximation to Lindebergs scale-normalized
Laplacian (can be shown from the heat diffusion
equation)
Distinctiveness of features
Vary size of database of features, with 30 degree affine
change, 2% image noise
Measure % correct for single nearest neighbor match
RECOGNITION MODELS
Constellation models
Shape matching
Deformable models
Bag of Words
Independent features
Histogram representation
Compute
descriptor
e.g. SIFT [Lowe99]
Normalize
patch
Detect patches
[Mikojaczyk and Schmid 02]
[Mata, Chum, Urban & Pajdla, 02]
[Sivic & Zisserman, 03]
Codewords
+
+
+
Vector quantization
128-D SIFT space
frequency
Histogram of features
assigned to each cluster
..
codewords
Sivic et al. 2005
Hierarchical models
Decompose scene/object
Nave Bayes
See 2007 edition of this course
Object categorization
Csurka, Bray, Dance & Fan, 2004; Sivic, Russell, Efros,
Freeman & Zisserman, 2005; Sudderth, Torralba, Freeman &
Willsky, 2005;
Feature level
Spatial influence through correlogram features:
Savarese, Winn and Criminisi, CVPR 2006
Feature level
Generative models
Sudderth, Torralba, Freeman & Willsky, 2005, 2006
Hierarchical model of scene/objects/parts
Feature level
Generative models
Sudderth, Torralba, Freeman & Willsky, 2005, 2006
Niebles & Fei-Fei, CVPR 2007
P1
P2
P3
P4
w
Image
Bg
Assumes a fixed
number of object
instances
Pr(part | object)
o
Reference positions
(ONE PER OBJECT)
covariance
context
K
J
Slide credit: Erik Sudderth
G0
Transformed Densities
Object category
Part size & shape
Transformed locations
Gj
3D Scene Features
Object category
3D Location
2D Image Features
Appearance Descriptors
2D Pixel Coordinates
N J
Background
Bookshelves
Computer Screen
Desk
Feature level
Generative models
Discriminative methods
Lazebnik, Schmid & Ponce, 2006
Problemwithbagofwords
Allhaveequalprobabilityforbagofwordsmethods
Loca?oninforma?onisimportant
BoW+loca?ons?lldoesntgivecorrespondence
Representa?on
Objectassetofparts
Genera?verepresenta?on
Model:
Rela?veloca?onsbetweenparts
Appearanceofpart
Issues:
Howtomodelloca?on
Howtorepresentappearance
Howtohandleocclusion/cluNer
Figure from [Fischler & Elschlager 73]
Yuille 91
Brunelli & Poggio 93
Lades, v.d. Malsburg et al. 93
Cootes, Lanitis, Taylor et al. 95
Amit & Geman 95, 99
Perona et al. 95, 96, 98, 00, 03, 04, 05
Felzenszwalb & Huttenlocher 00, 04
Crandall & Huttenlocher 05, 06
Leibe & Schiele 03, 04
Sparse representation
+ Computationally tractable (105 pixels 101 -- 102 parts)
+ Generative representation of class
+ Avoid modeling global variability
+ Success in specific object recognition
Thecorrespondenceproblem
ModelwithPparts
ImagewithNpossibleassignmentsforeachpart
Considermappingtobe11
NP combinations!!!
O(N6)
Csurka 04
Vasconcelos 00
Crandall et al. 05
Fergus et al. 05
O(N2)
Crandall et al. 05
Felzenszwalb &
Huttenlocher 00
O(N2)
O(N3)
fromSparseFlexibleModelsofLocalFeatures
GustavoCarneiroandDavidLowe,ECCV2006
Howmuchdoesshapehelp?
Crandall,Felzenszwalb,HuNenlocherCVPR05
Shapevarianceincreaseswithincreasingmodelcomplexity
Dogetsomebenetfromshape
Appearancerepresenta?on
SIFT
Decisiontrees
[Lepetit and Fua CVPR 2005]
PCA
LearnAppearance
Genera?vemodelsofappearance
CanlearnwithliNlesupervision
E.g.Fergusetal03
Discrimina?vetrainingofpartappearance
model
SVMpartdetectors
Felzenszwalb,Mcallester,Ramanan,CVPR2008
MuchbeNerperformance
HierarchicalRepresenta?ons
PixelsPixelgroupingsPartsObject
Multi-scale approach
increases number of
low-level features
Stochas)cGrammarofImages
S.C.Zhuetal.andD.Mumford
e.g. contours,
intermediate objects
e.g. linelets,
curvelets, T
-junctions
e.g. discontinuities,
gradient
Parts model
The architecture
Learned parts
PartsandStructuremodels
Summary
Explicitno?onofcorrespondencebetween
imageandmodel
Ecientmethodsforlarge#partsand#
posi?onsinimage
Withpowerfulpartdetectors,cangetstateof
theartperformance
Hierarchicalmodelsallowformoreparts
106 examples
Classifier: Boosting
Viola & Jones 2001
Haar features via Integral Image
Cascade
Real-time performance
.
Torralba et al., 2004
Part-based Boosting
Each weak classifier is a part
Part location modeled by
offset mask
Learn weighting of
descriptor with linear
SVM
Image
HOG
descriptor
Not a person
person
Adding parts
Felzenszwalb, McAllester, Ramanan. 2008.
Adding parts
~100%
detection rate
with 0 false alarms
Some of these parts cannot be used for anything else than this object.
These studies try to recover parts that are meaningful. But is this the
right thing to do? The derived parts may be too specific, and they are
not likely to be useful in a general system.
Shared features
Is learning the object class 1000 easier
than learning the first?
Multitask learning
R. Caruana. Multitask Learning. ML 1997
MTL improves generalization by leveraging the domain-specific information
contained in the training signals of related tasks. It does this by training tasks in
parallel while using a shared representation.
vs.
Sejnowski & Rosenberg 1986; Hinton 1986; Le Cun et al. 1989; Suddarth &
Kergosien 1990; Pratt et al. 1991; Sharkey & Sharkey 1992;
Multitask learning
R. Caruana. Multitask Learning. ML 1997
Primary task: detect door knobs
Tasks used:
horizontal location of doorknob
single or double door
horizontal location of doorway center
width of doorway
horizontal location of left door jamb
Sharing invariances
S. Thrun. Is Learning the n-th Thing Any Easier Than Learning The First?
NIPS 1996
With sharing
Without sharing
Le Cun et al, 98
Translation invariance is already built into the network
The output neurons share all the intermediate levels
Sharing transformations
Miller, E., Matsakis, N., and Viola, P. (2000). Learning from one example
through shared densities on transforms. In IEEE Computer Vision and
Pattern Recognition.
Pictorial Structures
Fischler & Elschlager, IEEE Trans. Comp. 1973
SVM Detectors
Heisele, Poggio, et. al., NIPS 2001
Constellation Model
Model-Guided Segmentation
Reusable Parts
Krempp, Geman, & Amit Sequential Learning of Reusable Parts for Object
Detection. TR 2002
Number of features
Number of classes
Specific feature
pedestrian
chair
Traffic light
sign
face
Background class
Shared feature
shared feature
50 training samples/class
29 object classes
2000 entries in the dictionary
Class-specific features
Shared features
Generalization as a function of
object similarities
12 viewpoints
K = 2.1
K = 4.8
Sharing patches
Bart and Ullman, 2004
For a new class, use only features similar to features that where good for other
classes:
Proposed Dog
features
Coefficients for
for feature 2
Coefficients for
classifier 2
Pr(topic | doc)
z
K
x
N
Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003
Pr(topic | doc)
z
K
x
N
Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003
x
N
Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003
We learn the
number of parts.
Each object
uses a different
number of parts.
The model
assumes a
known number
of object
categories.
Pr(position | part)
Pr(appearance | part)
Pr(position | part)
Pr(appearance | part)
Detection Task
Detection Results
Recognition Task
VS.
Recognition Results
Recognition performance decreases. By sharing features, the classes look more similar.
Baxter 1996
Caruana 1997
Schapire, Singer, 2000
Thrun, Pratt 1997
Krempp, Geman, Amit, 2002
E.L.Miller, Matsakis, Viola, 2000
Mahamud, Hebert, Lafferty, 2001
Fink et al. 2003, 2004
LeCun, Huang, Bottou, 2004
Holub, Welling, Perona, 2005
3.6 Modelos 3D
viewpoints
Julesz, 1971
3D percept is driven by the scene, which imposes its ruling to the objects
Class experiment
Class experiment
Experiment 1: draw a horse (the entire
body, not just the head) in a white piece of
paper.
Do not look at your neighbor! You already
know how a horse looks like no need to
cheat.
Class experiment
Experiment 2: draw a horse (the entire
body, not just the head) but this time
chose a viewpoint as weird as possible.
Anonymous participant
3D object categorization
Wait: object categorization in humans is not
invariant to 3D pose
3D object categorization
Despite we can categorize all three
pictures as being views of a horse,
the three pictures do not look as
being equally typical views of
horses. And they do not seem to
be recognizable with the same
easiness.
by Greg Robbins
Canonical Perspective
Experiment (Palmer, Rosch & Chase 81):
participants are shown views of an object
and are asked to rate how much each one
looked like the objects they depict
(scale; 1=very much like, 7=very unlike)
2
From Vision Science, Palmer
Canonical Perspective
Examples of canonical perspective:
In a recognition task, reaction time
correlated with the ratings.
Canonical views are recognized faster
at the entry level.
Why?
Canonical Viewpoint
Frequency hypothesis
Canonical Viewpoint
Frequency hypothesis: easiness of recognition is
related to the number of times we have see the
objects from each viewpoint.
For a computer, using its Google memory, a horse
looks like:
Canonical Viewpoint
Frequency hypothesis: easiness of recognition is
related to the number of times we have see the
objects from each viewpoint.
Canonical Viewpoint
Maximal information hypothesis: Some views
provide more information than others about the
objects.
Best views tend to show
multiple sides of the
object.
Canonical Viewpoint
Maximal information hypothesis:
Clocks are preferred as purely frontal
Canonical Viewpoint
Frequency hypothesis
Maximal information hypothesis
Probably both are correct.
Edelman & Bulthoff 92: created new objects to control familiarity.
1- When presenting all view points with the same frequency, observers had
preference for specific viewpoints.
2- When few viewpoints were presented, recognition was better for previously
seen viewpoints.
Object representations
Explicit 3D models: use volumetric
representation. Have an explicit model of
the 3D geometry of the object.
Object representations
Implicit 3D models: matching the input 2D
view to view-specific representations.
Object representations
Implicit 3D models: matching the input 2D
view to view-specific representations.
The object is represented as a collection of 2D
views (maybe the most frequent views seen in the
past).
Tarr & Pinker (89) show people are faster at
recognizing previously seen views, as if they were
storing them. People were also able to recognize
unseen views, so they also generalize to new
views. It is not just template matching.
Explicit 3D model
Explicit 3D model
Not all explicit 3D models were disappointing.
For some object classes, with accurate
geometric and appearance models, it is
possible to get remarkable results.
Implicit 3D models
Aspect Graphs
The nodes of the graph represent object views that are adjacent to each other
on the unit sphere of viewing directions but differ in some significant way.
The most common view relationship in aspect graphs is based on the
topological structure of the view, i.e., edges in the aspect graph arise from
transitions in the graph structure relating vertices, edges and faces of the
projected object. Joseph L. Mundy
Aspect Graphs
Car model
Screen model
Extended fragments
Extended fragments
Extended fragments
Extended fragments
Extended patches are extracted using short sequences.
Use Lucas-Kanade motion estimation to track patches across the sequence.
Learning
Once a large pool of extended fragments is created, there
is a training stage to select the most informative
If C and F
fragments.
are independent,
then I(C,F) = 0
For each fragment evaluate:
Class label
Fragment present/absent
1
P(C=1, F=1) = 3 / 10
P(C=1, F=0) =
P(C=0, F=1) =
P(C=0, F=0) =
Training does not require having different views of the same object.
View
invariant
features
View
specific
features
Torralba, Murphy, Freeman. PAMI 07
Strong learner
H response for
car as function
of assumed
view angle
Voting schemes
Towards Multi-View Object Class
Detection
Alexander Thomas
Vittorio Ferrari
Bastian Leibe
Tinne Tuytelaars
Bernt Schiele
Luc Van Gool
Features
Laboratorio
SIFT + visual words
http://people.csail.mit.edu/torralba/courses/seminarioUC3M/dia3/
Laboratorio3.m