Vous êtes sur la page 1sur 197

Antonio Torralba

Day 3
Reconocimiento de objetos II

Summer2010

3.1 Descriptores locales

Categories

Find a bottle:

Cant do
unless you do not
care about few errors

Instances

Find these two objects

Can nail it

Building a Panorama

M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003

How do we build a panorama?


We need to match (align) images
Global methods sensitive to occlusion, lighting, parallax
effects. So look for local features that match well.
How would you do it by eye?

Matching with Features


Detect feature points in both images

Matching with Features


Detect feature points in both images
Find corresponding pairs

Matching with Features


Detect feature points in both images
Find corresponding pairs
Use these pairs to align images

Matching with Features


Problem 1:
Detect the same point independently in both
images

no chance to match!

We need a repeatable detector

Matching with Features


Problem 2:
For each point correctly recognize the
corresponding one

We need a reliable and distinctive descriptor

More motivation
Feature points are used also for:
Image alignment (homography, fundamental matrix)
3D reconstruction
Motion tracking
Object recognition
Indexing and database retrieval
Robot navigation
other

Selecting Good Features


Whats a good feature?
Satisfies brightness constancylooks the same in both
images
Has sufficient texture variation
Does not have too much texture variation
Corresponds to a real surface patchsee below:
Bad feature
Right
eye
view

Left eye
view
Good feature

Does not deform too much over time

An introductory example:
Harris corner detector

C.Harris, M.Stephens. A Combined Corner and Edge Detector. 1988

The Basic Idea


We should easily localize the point by looking
through a small window
Shifting a window in any direction should give
a large change in intensity

Harris Detector: Basic Idea

flat region:
no change as shift
window in all
directions

edge:
no change as shift
window along the
edge direction

corner:
significant change as
shift window in all
directions

Harris Detector: Mathematics


Window-averaged change of intensity induced
by shifting the image data by [u,v]:

Window
function

Shifted
intensity

Window function w(x,y) =

Intensity

or
1 in window, 0 outside

Gaussian

Taylor series approx to shifted image


E(u,v) w(x, y)[I(x, y) + uIx + vIy I(x, y)]2
x,y

= w(x, y)[uIx + vIy ]2


x,y

Ix Ix
= w(x, y)(u v)
Ix Iy
x,y

Ix Iy u

Iy Iy v

Harris Detector: Mathematics


Expanding I(x,y) in a Taylor series expansion, we have, for small shifts [u,v], a
bilinear approximation:

where M is a 22 matrix computed from image derivatives:

M is also called structure tensor

Harris Detector: Mathematics


Intensity change in shifting window: eigenvalue analysis

1, 2 eigenvalues of M
direction of the
fastest change

Ellipse E(u,v) = const


Iso-intensity contour of E(u,v)

direction of
the slowest
change

(max)-1/2
(min)-1/2

Selecting Good Features

1 and 2 are large

Selecting Good Features

large 1, small 2

Selecting Good Features

small 1, small 2

Harris Detector: Mathematics


Classification of image
points using
eigenvalues of M:

Edge
2 >> 1

Corner
1 and 2 are large,
1 ~ 2;
E increases in all

directions

1 and 2 are small;


E is almost

constant in all
directions

Flat
region

Edge
1 >> 2
1

Harris Detector: Workflow

Harris Detector: Workflow

Ideal feature detector


Would always find the same point on an
object, regardless of changes to the image.
Ie, insensitive to changes in:
Scale
Lighting
Perspective imaging
Partial occlusion

Harris Detector: Some Properties


Rotation invariance?

Harris Detector: Some Properties


Rotation invariance

Ellipse rotates but its shape (i.e. eigenvalues) remains


the same
Corner response R is invariant to image rotation

Harris Detector: Some Properties


Invariant to image scale?

Harris Detector: Some Properties


Not invariant to image scale!

All points will be


classified as edges

Corner !

Harris Detector: Some Properties


Quality of Harris detector for different scale
changes
Repeatability rate:
# correspondences
# possible correspondences

C.Schmid et.al. Evaluation of Interest Point Detectors. IJCV 2000

Evaluation plots are from this paper

We want to:
detect the same interest points
regardless of image changes

Models of Image Change


Geometry
Rotation
Similarity (rotation + uniform scale)
Affine (scale dependent on direction)
valid for: orthographic camera, locally planar
object

Photometry
Affine intensity change (I a I + b)

Scale Invariant Detection


Consider regions (e.g. circles) of different sizes
around a point
Regions of corresponding sizes will look the same
in both images

Scale Invariant Detection


The problem: how do we choose corresponding
circles independently in each image?

Scale Invariant Detection


Solution:
Design a function on the region (circle), which is scale
invariant (the same for corresponding regions, even if
they are at different scales)
Example: average intensity. For corresponding
regions (even of different sizes) it will be the same.
For a point in one image, we can consider it as a function of
region size (circle radius)

Image 1

Image 2

scale = 1/2

region size

region size

Scale Invariant Detection


Common approach:
Take a local maximum of this function

Observation: region size, for which the maximum is achieved,


should be invariant to image scale.

Important: this scale invariant region size is


found in each image independently!
Image 1

Image 2

scale = 1/2

s1

region size

s2

region size

Scale Invariant Detection


A good function for scale detection:
has one stable sharp peak
f

b
ad

bad
region size

region size

Good
!
region size

For usual images: a good function would be a


one which responds to contrast (sharp local
intensity change)

Scale Invariant Detection


Functions for determining scale
Kernels:

(Laplacian)

(Difference of Gaussians)
where Gaussian
Note: both kernels are invariant
to scale and rotation

trace

scale

det

scale

From Lindeberg 1998

blob detection; Marr 1982; Voorhees and Poggio 1987; Blostein and Ahuja 1989;

Scale Invariant Detectors


Find local maximum of:
Harris corner detector in
space (image coordinates)
Laplacian in scale

SIFT (Lowe)2

Find local maximum of:


Difference of Gaussians in
space and scale

1 K.Mikolajczyk,

y
Harris

DoG

Laplacian

scale

scale
DoG

Harris-Laplacian1

C.Schmid. Indexing Based on Scale Invariant Interest Points. ICCV 2001


2 D.Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV 2004

CVPR 2003 Tutorial


Recognition and Matching
Based on Local Invariant
Features
David Lowe
Computer Science Department
University of British Columbia

Invariant Local Features


Image content is transformed into local feature
coordinates that are invariant to translation, rotation,
scale, and other imaging parameters

SIFT Features

Advantages of invariant local features


Locality: features are local, so robust to
occlusion and clutter (no prior segmentation)
Distinctiveness: individual features can be
matched to a large database of objects
Quantity: many features can be generated for
even small objects
Efficiency: close to real-time performance
Extensibility: can easily be extended to wide
range of differing feature types, with each
adding robustness

Scale invariance
Requires a method to repeatably select points in location
and scale:
The only reasonable scale-space kernel is a Gaussian
(Koenderink, 1984; Lindeberg, 1994)
An efficient choice is to detect peaks in the difference of
Gaussian pyramid (Burt & Adelson, 1983; Crowley &
Parker, 1984 but examining more scales)
Difference-of-Gaussian with constant ratio of scales is a
close approximation to Lindebergs scale-normalized
Laplacian (can be shown from the heat diffusion
equation)

Scale space processed one octave at a time

Key point localization


Detect maxima and minima of
difference-of-Gaussian in scale
space
Fit a quadratic to surrounding
values for sub-pixel and sub-scale
interpolation (Brown & Lowe,
2002)
Taylor expansion around point:

Offset of extremum (use finite


differences for derivatives):

Select canonical orientation


Create histogram of local
gradient directions computed
at selected scale
Assign canonical orientation
at peak of smoothed
histogram
Each key specifies stable 2D
coordinates (x, y, scale,
orientation)

Example of keypoint detection


Threshold on value at DOG peak and on ratio of principle
curvatures (Harris approach)
(a) 233x189 image
(b) 832 DOG extrema
(c) 729 left after peak
value threshold
(d) 536 left after testing
ratio of principle
curvatures

SIFT vector formation


Thresholded image gradients are sampled over 16x16
array of locations in scale space
Create array of orientation histograms
8 orientations x 4x4 histogram array = 128 dimensions

Sensitivity to number of histogram orientations

Feature stability to noise


Match features after random change in image scale &
orientation, with differing levels of image noise
Find nearest neighbor in database of 30,000 features

Feature stability to affine change


Match features after random change in image scale &
orientation, with 2% image noise, and affine distortion
Find nearest neighbor in database of 30,000 features

Distinctiveness of features
Vary size of database of features, with 30 degree affine
change, 2% image noise
Measure % correct for single nearest neighbor match

Ratio of distances reliable for matching

RECOGNITION MODELS

Families of recognition algorithms


Voting models

Bag of words models

Csurka, Dance, Fan, Willamowski, and


Bray 2004
Sivic, Russell, Freeman, Zisserman,
ICCV 2005

Viola and Jones, ICCV 2001


Heisele, Poggio, et. al., NIPS 01
Schneiderman, Kanade 2004
Vidal-Naquet, Ullman 2003

Constellation models

Fischler and Elschlager, 1973


Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000
Fergus, Perona, & Zisserman, CVPR 2003

Shape matching
Deformable models

Berg, Berg, Malik, 2005


Cootes, Edwards, Taylor, 2001

Rigid template models

Sirovich and Kirby 1987


Turk, Pentland, 1991
Dalal & Triggs, 2006

3.2 Bag of words models

Bag of Words
Independent features
Histogram representation

Compute
descriptor
e.g. SIFT [Lowe99]

Normalize
patch

Detect patches
[Mikojaczyk and Schmid 02]
[Mata, Chum, Urban & Pajdla, 02]
[Sivic & Zisserman, 03]

Local interest operator


or
Regular grid
Sivic et al. 2005

Sivic et al. 2005

128-D SIFT space

Sivic et al. 2005

Codewords

+
+
+

Vector quantization
128-D SIFT space

Sivic et al. 2005

Sivic et al. 2005

frequency

Histogram of features
assigned to each cluster

..
codewords
Sivic et al. 2005

Uses of BoW representation


Treat as feature vector for standard classifier
e.g SVM

Cluster BoW vectors over image collection


Discover visual themes

Hierarchical models
Decompose scene/object

BoW as input to classifier


SVM for object classification
Csurka, Bray, Dance & Fan, 2004

Nave Bayes
See 2007 edition of this course

Early bag of words models: mostly texture


recognition
Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik,
2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik,
Schmid & Ponce, 2003

Hierarchical Bayesian models for documents


(pLSA, LDA, etc.)
Hoffman 1999; Blei, Ng & Jordan, 2004; Teh, Jordan, Beal &
Blei, 2004

Object categorization
Csurka, Bray, Dance & Fan, 2004; Sivic, Russell, Efros,
Freeman & Zisserman, 2005; Sudderth, Torralba, Freeman &
Willsky, 2005;

Natural scene categorization


Vogel & Schiele, 2004; Fei-Fei & Perona, 2005; Bosch,
Zisserman & Munoz, 2006

Feature level
Spatial influence through correlogram features:
Savarese, Winn and Criminisi, CVPR 2006

Feature level
Generative models
Sudderth, Torralba, Freeman & Willsky, 2005, 2006
Hierarchical model of scene/objects/parts

Feature level
Generative models
Sudderth, Torralba, Freeman & Willsky, 2005, 2006
Niebles & Fei-Fei, CVPR 2007
P1

P2

P3

P4
w

Image
Bg

Scenes of Fixed Sets of Objects


Pr(object | scene)

Assumes a fixed
number of object
instances
Pr(part | object)

o
Reference positions
(ONE PER OBJECT)

covariance

context

K
J
Slide credit: Erik Sudderth

Street Scene Segmentations

1-2 minutes Gibbs sampling per image

Slide credit: Erik Sudderth

TDP for 3D Scenes


Global Density
Object category
Part size & shape
Transformation prior

G0

Transformed Densities
Object category
Part size & shape
Transformed locations

Gj

3D Scene Features
Object category
3D Location

2D Image Features

Appearance Descriptors
2D Pixel Coordinates

N J

Single-Part Office Scene Model

Background

Slide credit: Erik Sudderth

Bookshelves

Computer Screen
Desk

Feature level
Generative models
Discriminative methods
Lazebnik, Schmid & Ponce, 2006

3.3 Parts and structure models

Problemwithbagofwords

Allhaveequalprobabilityforbagofwordsmethods
Loca?oninforma?onisimportant
BoW+loca?ons?lldoesntgivecorrespondence

Model: Parts and Structure

Representa?on
Objectassetofparts
Genera?verepresenta?on
Model:
Rela?veloca?onsbetweenparts
Appearanceofpart
Issues:
Howtomodelloca?on
Howtorepresentappearance
Howtohandleocclusion/cluNer
Figure from [Fischler & Elschlager 73]

History of Parts and Structure


approaches

Fischler & Elschlager 1973

Yuille 91
Brunelli & Poggio 93
Lades, v.d. Malsburg et al. 93
Cootes, Lanitis, Taylor et al. 95
Amit & Geman 95, 99
Perona et al. 95, 96, 98, 00, 03, 04, 05
Felzenszwalb & Huttenlocher 00, 04
Crandall & Huttenlocher 05, 06
Leibe & Schiele 03, 04

Many papers since 2000

Sparse representation
+ Computationally tractable (105 pixels 101 -- 102 parts)
+ Generative representation of class
+ Avoid modeling global variability
+ Success in specific object recognition

- Throw away most image information


- Parts need to be distinctive to separate from other classes

Thecorrespondenceproblem
ModelwithPparts
ImagewithNpossibleassignmentsforeachpart
Considermappingtobe11

NP combinations!!!

Different connectivity structures


Fergus et al. 03
Fei-Fei et al. 03

O(N6)

Csurka 04
Vasconcelos 00

Crandall et al. 05
Fergus et al. 05

O(N2)

Crandall et al. 05

Felzenszwalb &
Huttenlocher 00

O(N2)
O(N3)

Bouchard & Triggs 05

Carneiro & Lowe 06

fromSparseFlexibleModelsofLocalFeatures
GustavoCarneiroandDavidLowe,ECCV2006

Howmuchdoesshapehelp?
Crandall,Felzenszwalb,HuNenlocherCVPR05
Shapevarianceincreaseswithincreasingmodelcomplexity
Dogetsomebenetfromshape

Appearancerepresenta?on
SIFT

Decisiontrees
[Lepetit and Fua CVPR 2005]

PCA

Figure from Winn &


Shotton, CVPR 06

LearnAppearance
Genera?vemodelsofappearance
CanlearnwithliNlesupervision
E.g.Fergusetal03

Discrimina?vetrainingofpartappearance
model
SVMpartdetectors
Felzenszwalb,Mcallester,Ramanan,CVPR2008
MuchbeNerperformance

HierarchicalRepresenta?ons
PixelsPixelgroupingsPartsObject
Multi-scale approach
increases number of
low-level features

Amit and Geman 98


Ullman et al.
Bouchard & Triggs 05
Zhu and Mumford
Jin & Geman 06
Zhu & Yuille 07
Fidler & Leonardis 07

Images from [Amit98]

Stochas)cGrammarofImages

S.C.Zhuetal.andD.Mumford

Context and Hierarchy in a Probabilistic Image Model


Jin & Geman (2006)

e.g. animals, trees,


rocks

e.g. contours,
intermediate objects
e.g. linelets,
curvelets, T
-junctions
e.g. discontinuities,
gradient

animal head instantiated


by tiger head

animal head instantiated


by bear head

A Hierarchical Compositional System for Rapid


Object Detection
Long Zhu, Alan L. Yuille, 2007.

Able to learn #parts at each level

Learning a Compositional Hierarchy of Object Structure

Parts model

The architecture
Learned parts

PartsandStructuremodels
Summary
Explicitno?onofcorrespondencebetween
imageandmodel
Ecientmethodsforlarge#partsand#
posi?onsinimage
Withpowerfulpartdetectors,cangetstateof
theartperformance
Hierarchicalmodelsallowformoreparts

3.4 Metodos discriminativos

Classifier: Nearest Neighbor


Shakhnarovich, Viola, Darrell, 2003

106 examples

Berg, Berg and Malik, 2005

Classifier: Neural Networks


Fukushimas Neocognitron, 1980
Rowley, Baluja, Kanade 1998
LeCun, Bottou, Bengio, Haffner 1998
Serre et al. 2005
Riesenhuber, M. and Poggio, T. 1999

LeNet convolutional architecture (LeCun 1998)

Classifier: Boosting
Viola & Jones 2001
Haar features via Integral Image
Cascade
Real-time performance

.
Torralba et al., 2004
Part-based Boosting
Each weak classifier is a part
Part location modeled by
offset mask

Classifier: Support Vector Machine


Guyon, Vapnik
Heisele, Serre, Poggio, 2001
..
Dalal & Triggs , CVPR 2005
HOG Histogram of
Oriented gradients

Learn weighting of
descriptor with linear
SVM
Image

HOG
descriptor

HOG descriptor weighted by


+ve SVM
-ve SVM
weights

Histograms of oriented gradients


Dalal & Trigs, 2006

Not a person

person

Adding parts
Felzenszwalb, McAllester, Ramanan. 2008.

Adding parts

Felzenszwalb, McAllester, Ramanan. 2008.

Felzenszwalb, McAllester, Ramanan. 2008.

3.5 Metodos multiclase

Multiclass object detection


the not so early days

Multiclass object detection


the not so early days
Using a set of independent binary classifiers was a common strategy:
Viola-Jones extension for dealing with rotations
- two cascades for each view

Schneiderman-Kanade multiclass object detection

(a) One detector for each class

There is nothing wrong with this approach if you have access to


lots of training data and you do not care about efficiency.

Some symptoms of one-vs-all


multiclass approaches
What is the best representation to detect a traffic sign?

Very regular object: template matching will do the job

Parts derived from


training a binary
classifier.

~100%
detection rate
with 0 false alarms

Some of these parts cannot be used for anything else than this object.

Some symptoms of one-vs-all


multiclass approaches
Part-based object representation (looking for meaningful parts):
A. Agarwal and D. Roth

M. Weber, M. Welling and P. Perona

These studies try to recover parts that are meaningful. But is this the
right thing to do? The derived parts may be too specific, and they are
not likely to be useful in a general system.

Some symptoms of one-vs-all


multiclass approaches
Computational cost grows linearly with Nclasses * Nviews * Nstyles

Shared features
Is learning the object class 1000 easier
than learning the first?

Can we transfer knowledge from one


object to another?
Are the shared properties interesting by
themselves?

Multitask learning
R. Caruana. Multitask Learning. ML 1997
MTL improves generalization by leveraging the domain-specific information
contained in the training signals of related tasks. It does this by training tasks in
parallel while using a shared representation.

vs.

Sejnowski & Rosenberg 1986; Hinton 1986; Le Cun et al. 1989; Suddarth &
Kergosien 1990; Pratt et al. 1991; Sharkey & Sharkey 1992;

Multitask learning
R. Caruana. Multitask Learning. ML 1997
Primary task: detect door knobs

Tasks used:
horizontal location of doorknob
single or double door
horizontal location of doorway center
width of doorway
horizontal location of left door jamb

horizontal location of right door jamb


width of left door jamb
width of right door jamb
horizontal location of left edge of door
horizontal location of right edge of door

Sharing invariances
S. Thrun. Is Learning the n-th Thing Any Easier Than Learning The First?
NIPS 1996

Knowledge is transferred between tasks via a learned model of the


invariances of the domain: object recognition is invariant to rotation,
translation, scaling, lighting, These invariances are common to all
object recognition tasks.
Toy world

With sharing

Without sharing

Convolutional Neural Network

Le Cun et al, 98
Translation invariance is already built into the network
The output neurons share all the intermediate levels

Sharing transformations
Miller, E., Matsakis, N., and Viola, P. (2000). Learning from one example
through shared densities on transforms. In IEEE Computer Vision and
Pattern Recognition.

Transformations are shared


and can be learnt from other tasks.

Models of object recognition


I. Biederman, Recognition-by-components: A theory of human image
understanding, Psychological Review, 1987.
M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in
cortex, Nature Neuroscience 1999.

T. Serre, L. Wolf and T. Poggio. Object recognition with features inspired


by visual cortex. CVPR 2005

Sharing in constellation models


(next Wednesday)

Pictorial Structures
Fischler & Elschlager, IEEE Trans. Comp. 1973

SVM Detectors
Heisele, Poggio, et. al., NIPS 2001

Constellation Model

Model-Guided Segmentation

Fergus, Perona, & Zisserman, CVPR 2003

Mori, Ren, Efros, & Malik, CVPR 2004

Reusable Parts
Krempp, Geman, & Amit Sequential Learning of Reusable Parts for Object
Detection. TR 2002

Goal: Look for a vocabulary of edges that reduces the number of


features.

Number of features

Examples of reused parts

Number of classes

Additive models and boosting


Independent binary classifiers:
Screen detector
Car detector
Face detector
Binary classifiers that share features:
Screen detector
Car detector
Face detector
Torralba, Murphy, Freeman. CVPR 2004. PAMI 2007

Specific feature
pedestrian
chair
Traffic light
sign
face
Background class

Non-shared feature: this feature


is too specific to faces.

Shared feature

shared feature

50 training samples/class
29 object classes
2000 entries in the dictionary
Class-specific features

Results averaged on 20 runs


Error bars = 80% interval

Shared features

Torralba, Murphy, Freeman. CVPR 2004. PAMI 2007

Generalization as a function of
object similarities
12 viewpoints

K = 2.1

Number of training samples per class

Area under ROC

Area under ROC

12 unrelated object classes

K = 4.8

Number of training samples per class

Torralba, Murphy, Freeman. CVPR 2004. PAMI 2007

Sharing patches
Bart and Ullman, 2004
For a new class, use only features similar to features that where good for other
classes:

Proposed Dog
features

Transfer Learning for Image Classification with Sparse


Prototype Representations
A. Quattoni, M. Collins, T. Darrell, CVPR 2008

Coefficients for
for feature 2

Coefficients for
classifier 2

Hierarchical Topic Models


Topic models typically use a
bag of words approx.:
Learning topics allows transfer
of information within a corpus of
related documents
Mixing proportions capture the
distinctive features of particular
documents

Pr(topic | doc)



z

K

x
N

Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003

Hierarchical Topic Models


Pr(x=word | z=topic) Pr(z=topic | doc)


Pr(x=word | doc) =
topic

Pr(topic | doc)

z

K

x
N

Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003

Hierarchical Topic Models


Pr(topic | doc)

bag of features models:

x
N

Object Recognition (Sivic et. al., ICCV 2005)


Scene Recognition (Fei-Fei et. al., CVPR 2005)

Pr(word | topic)
Latent Dirichlet Allocation (LDA)
Blei, Ng, & Jordan, JMLR 2003

Hierarchical Sharing and Context


E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. ICCV 2005.

Scenes share objects

Objects share parts

Parts share features

Learning Shared Parts

Objects are often locally similar in appearance


Discover parts shared across categories
How many total parts should we share?
How many parts should each category use?

HDP Object Model


Parts are distributions
over appearances and
locations

We learn the
number of parts.
Each object
uses a different
number of parts.
The model
assumes a
known number
of object
categories.

HDP Object Model

There is no context, so the model is happy in creating


impossible part combinations.

Sharing Parts: 16 Categories

Caltech 101 Dataset (Li & Perona)


Horses (Borenstein & Ullman)
Cat & dog faces (Vidal-Naquet & Ullman)

Bikes from Graz-02 (Opelt & Pinz)


Google

Visualization of Shared Parts

Pr(position | part)
Pr(appearance | part)

Visualization of Shared Parts

Pr(position | part)
Pr(appearance | part)

Detection Task

Detection Results

6 Training Images per Category

Detection vs. Training Set Size

Recognition Task

VS.

Recognition Results

6 Training Images per Category

Recognition vs. Training Set Size

Recognition performance decreases. By sharing features, the classes look more similar.

Some more references

Baxter 1996
Caruana 1997
Schapire, Singer, 2000
Thrun, Pratt 1997
Krempp, Geman, Amit, 2002
E.L.Miller, Matsakis, Viola, 2000
Mahamud, Hebert, Lafferty, 2001
Fink et al. 2003, 2004
LeCun, Huang, Bottou, 2004
Holub, Welling, Perona, 2005

3.6 Modelos 3D

2D frontal face detection

Amazing how far they have gotten with so little

People have the bad taste of not being


rotationally symmetric

Examples of un-collaborative subjects

Objects are not flat

Solution to deal with 3D variations:


do not deal with it
not-Dealing with rotations and pose:
Train a different
model for each view.

The combined detector is invariant to pose variations without an explicit 3D model.

So, how many classifiers?


Object classes

viewpoints

And why should we stop with pose?


Lets do the same with styles,
lighting conditions, etc, etc, etc

Need to detect Nclasses * Nviews * Nstyles, in clutter.


Lots of variability within classes, and across viewpoints.

Depth without objects


Random dot stereograms (Bela Julesz)

Julesz, 1971

3D is so important for humans that we


decided to grow two eyes in front of the
face instead of having one looking to the
front and another to the back.
(this is not something that Julesz said but he could, maybe
he did)

Objects 3D shape priors

by H Blthoff Max-Planck-Institut fr biologische Kybernetik in Tbingen


Video taken from http://www.michaelbach.de/ot/fcs_hollow-face
/index.html

3D drives perception of important


object attributes

by Roger Shepard (Turning the Tables)


Depth processing is automatic, and we can not shut it down

3D drives perception of important


object attributes

The two Towers of Pisa


Frederick Kingdom, Ali Yoonessi and Elena Gheorghiu of McGill Vision Research unit.

It is not all about objects

3D percept is driven by the scene, which imposes its ruling to the objects

Class experiment

Class experiment
Experiment 1: draw a horse (the entire
body, not just the head) in a white piece of
paper.
Do not look at your neighbor! You already
know how a horse looks like no need to
cheat.

Class experiment
Experiment 2: draw a horse (the entire
body, not just the head) but this time
chose a viewpoint as weird as possible.

Anonymous participant

3D object categorization
Wait: object categorization in humans is not
invariant to 3D pose

3D object categorization
Despite we can categorize all three
pictures as being views of a horse,
the three pictures do not look as
being equally typical views of
horses. And they do not seem to
be recognizable with the same
easiness.

by Greg Robbins

Canonical Perspective
Experiment (Palmer, Rosch & Chase 81):
participants are shown views of an object
and are asked to rate how much each one
looked like the objects they depict
(scale; 1=very much like, 7=very unlike)

2
From Vision Science, Palmer

Canonical Perspective
Examples of canonical perspective:
In a recognition task, reaction time
correlated with the ratings.
Canonical views are recognized faster
at the entry level.

Why?

From Vision Science, Palmer

Canonical Viewpoint
Frequency hypothesis

Maximal information hypothesis

Canonical Viewpoint
Frequency hypothesis: easiness of recognition is
related to the number of times we have see the
objects from each viewpoint.
For a computer, using its Google memory, a horse
looks like:

It is not a uniform sampling on viewpoints


(some artificial datasets might contain non natural statistics)

Canonical Viewpoint
Frequency hypothesis: easiness of recognition is
related to the number of times we have see the
objects from each viewpoint.

Canonical Viewpoint
Maximal information hypothesis: Some views
provide more information than others about the
objects.
Best views tend to show
multiple sides of the
object.

From Vision Science, Palmer

Canonical Viewpoint
Maximal information hypothesis:
Clocks are preferred as purely frontal

Canonical Viewpoint
Frequency hypothesis
Maximal information hypothesis
Probably both are correct.
Edelman & Bulthoff 92: created new objects to control familiarity.

1- When presenting all view points with the same frequency, observers had
preference for specific viewpoints.
2- When few viewpoints were presented, recognition was better for previously
seen viewpoints.

Object representations
Explicit 3D models: use volumetric
representation. Have an explicit model of
the 3D geometry of the object.

Appealing but hard to get it to work

Object representations
Implicit 3D models: matching the input 2D
view to view-specific representations.

Not very appealing but somewhat easy to get it to work

Object representations
Implicit 3D models: matching the input 2D
view to view-specific representations.
The object is represented as a collection of 2D
views (maybe the most frequent views seen in the
past).
Tarr & Pinker (89) show people are faster at
recognizing previously seen views, as if they were
storing them. People were also able to recognize
unseen views, so they also generalize to new
views. It is not just template matching.

Why do I explain all this?


As we build systems and develop
algorithms it is good to:
Get inspiration from what others have thought
Get intuitions about what can work, and how
things can fail.

Explicit 3D model

Object Recognition in the Geometric Era: a Retrospective, Joseph L. Mundy

Explicit 3D model
Not all explicit 3D models were disappointing.
For some object classes, with accurate
geometric and appearance models, it is
possible to get remarkable results.

A Morphable Model for the Synthesis


of 3D Faces

Blanz & Vetter, Siggraph 99

A Morphable Model for the Synthesis


of 3D Faces

Blanz & Vetter, Siggraph 99

We have not achieved yet the same level of


description for other object classes

Implicit 3D models

Aspect Graphs

The nodes of the graph represent object views that are adjacent to each other
on the unit sphere of viewing directions but differ in some significant way.
The most common view relationship in aspect graphs is based on the
topological structure of the view, i.e., edges in the aspect graph arise from
transitions in the graph structure relating vertices, edges and faces of the
projected object. Joseph L. Mundy

Aspect Graphs

Patch-based single view detector


Vidal-Naquet, Ullman (2003)

Car model

Screen model

For a single view


First we collect a set of part templates from a set of training
objects.
Vidal-Naquet, Ullman (2003)

Extended fragments

View-Invariant Recognition Using Corresponding Object Fragments


E. Bart, E. Byvatov, & S. Ullman

Extended fragments

View-Invariant Recognition Using Corresponding Object Fragments


E. Bart, E. Byvatov, & S. Ullman

Extended fragments

View-Invariant Recognition Using Corresponding Object Fragments


E. Bart, E. Byvatov, & S. Ullman

Extended fragments
Extended patches are extracted using short sequences.
Use Lucas-Kanade motion estimation to track patches across the sequence.

Learning
Once a large pool of extended fragments is created, there
is a training stage to select the most informative
If C and F
fragments.
are independent,
then I(C,F) = 0
For each fragment evaluate:

Class label

Fragment present/absent

Select the fragment B with

In the subsequent rounds, use

All these operations are easy to compute. It is just counting.

1
P(C=1, F=1) = 3 / 10

P(C=1, F=0) =
P(C=0, F=1) =
P(C=0, F=0) =

Training without sequences


Challenges:
- We do not know which fragments are in
correspondence (we can not use motion
estimation due to strong transformation)
Fragments that are in correspondence will have
detections that are correlated across viewpoints.
The same approach can be used for
arbitrary transformations

Bart & Ullman

Shared features for Multi-view object


detection

Training does not require having different views of the same object.

View
invariant
features

View
specific
features
Torralba, Murphy, Freeman. PAMI 07

Shared features for Multi-view


object detection
Sharing is not a tree. Depends also on 3D symmetries.

Torralba, Murphy, Freeman. PAMI 07

Multi-view object detection

Strong learner
H response for
car as function
of assumed
view angle

Torralba, Murphy, Freeman. PAMI 07

Voting schemes
Towards Multi-View Object Class
Detection
Alexander Thomas
Vittorio Ferrari
Bastian Leibe
Tinne Tuytelaars
Bernt Schiele
Luc Van Gool

Viewpoint-Independent Object Class Detection using 3D Feature Maps


Training dataset: synthetic objects

Features

Voting scheme and detection

Each cluster casts votes for the


voting bins of the discrete poses
contained in its internal list.

Liebelt, Schmid, Schertler. CVPR 2008

Laboratorio
SIFT + visual words

http://people.csail.mit.edu/torralba/courses/seminarioUC3M/dia3/
Laboratorio3.m

Vous aimerez peut-être aussi