Académique Documents
Professionnel Documents
Culture Documents
Submission
Essen1als
in
the
light
of
recent
developments
ILSVRC
Tutorial
@
CVPR-2015
7
June
2015
Karen
Simonyan
Outline
Architectures
Convolu1onal
Networks:
recap
The
importance
of
depth
in
image
representa1ons
very
deep
ConvNets
(VGG-Net
and
extensions)
Incep1on
modules
(GoogLeNet)
Training
Op1misa1on
Data
augmenta1on
Evalua1on
References
2
Convolu1onal
Networks
State-of-the-art
in
image
recogni1on
winner
of
ILSVRC
since
2012
ConvNet
-
hierarchical
image
representa1on
[LeCun
et
al.,
89,
98]
stack
of
conv.
layers,
interleaved
with
non-lineari1es
typically
followed
by
fully-connected
layers
ConvNet
schema.c
3
Convolu1onal
Networks
(2)
Important
conv.
layer
proper1es:
locality:
objects/parts
have
local
spa1al
support
weight
sharing:
transla1on
equivariance
5
AlexNet
Winner
of
ILSVRC-2012
([Krizhevsky
et
al.])
ConvNet
with
8
layers
(5
conv.
&
3
FC)
layer
output
size
7
Building
Very
Deep
Nets
(1)
Stack
several
layers
between
pooling
pooling
#conv.
layers
>>
#pooling
layers
conv
conv
#conv.
layers
does
not
aect
resolu1on
if
conv
each
layer
preserves
spa1al
resolu1on:
conv
pooling
conv.
stride
=
1
&
input
is
padded
resolu.on
More
generally,
interleave
reduc.on
resolu.on
reduc.on
8
Building
Very
Deep
Nets
(2)
Stack
of
small
(3x3)
conv.
layers
has
a
large
recep1ve
eld
two
3x3
layers
5x5
recep1ve
eld
three
3x3
layers
7x7
recep1ve
eld
faster
than
a
stack
of
large
conv.
layers
less
parameters
than
a
single
layer
with
large
kernels
5
9
Very
Deep
Nets
at
ILSVRC
Large
depth
and
small
lters
is
used
in
two
top-performing
ILSVRC-2014
submissions
GoogLeNet
(Incep1on)
[Szegedy
et
al.,
2014]
VGG-Net
[Simonyan
&
Zisserman,
2014]
as
well
as
the
follow-up
works
Delving
deep
into
rec1ers
(MSRA,
[He
at
al.,
2015])
Deep
Image
(Baidu,
[Wu
et
al.,
2015])
Incep1on
v2
(Google,
[Ioe
and
Szegedy,
2015])
10
13-layer
VGG-Net
image
conv-64
conv-512
Other
details
are
conven1onal:
conv-512
maxpool
5
max-pool
layers
conv-512
conv-128
maxpool
conv-256
conv-256
maxpool
conv-512
conv-512
maxpool
conv-512
conv-512
maxpool
FC-4096
FC-4096
FC-1000
soEmax
11-layer
conv-128
conv-128
maxpool
conv-256
conv-256
maxpool
conv-512
conv-512
maxpool
conv-512
conv-512
maxpool
FC-4096
FC-4096
FC-1000
soEmax
11-layer
conv-512
conv-512
conv-512
conv-512
maxpool
maxpool
conv-512
conv-512
conv-512
conv-512
maxpool
maxpool
FC-4096
FC-4096
FC-4096
FC-4096
FC-1000
FC-1000
soEmax
soEmax
11-layer 13-layer
14
VGG-Net
Incarna1ons
image
image
conv-64
conv-64
maxpool
conv-64
maxpool
conv-128
conv-128
Extra
layers
injected
into
maxpool
conv-128
maxpool
deeper
stacks
conv-256
conv-256
conv-256
conv-256
conv-256
rst
layers
capture
maxpool
maxpool
lower-level
primi1ves,
conv-512
conv-512
dont
need
to
be
very
conv-512
maxpool
conv-512
maxpool
conv-512
discrimina1ve
spa1al
resolu1on
is
conv-512
conv-512
conv-512
higher
in
the
rst
layers,
conv-512
conv-512
maxpool
maxpool
adding
extra
layers
there
is
computa1onally
FC-4096
FC-4096
FC-4096
FC-4096
prohibi1ve
FC-1000
FC-1000
soEmax
soEmax
11-layer 13-layer
15
VGG-Net
Incarna1ons
image
image
image
conv-64
conv-64
conv-64
maxpool
conv-64
conv-64
maxpool
maxpool
conv-128
conv-128
conv-128
maxpool
conv-128
conv-128
maxpool
maxpool
conv-256
conv-256
conv-256
conv-256
conv-256
conv-256
maxpool
maxpool
conv-256
maxpool
16
VGG-Net
Incarna1ons
image
image
image
conv-64
conv-64
conv-64
maxpool
conv-64
conv-64
maxpool
maxpool
conv-128
conv-128
conv-128
maxpool
conv-128
conv-128
maxpool
maxpool
conv-256
conv-256
conv-256
conv-256
conv-256
conv-256
conv-256
maxpool
maxpool
conv-256
maxpool
17
VGG-Net
Incarna1ons
image
image
image
image
conv-64
conv-64
conv-64
conv-64
maxpool
conv-64
conv-64
conv-64
maxpool
maxpool
maxpool
conv-128
conv-128
conv-128
conv-128
maxpool
conv-128
conv-128
conv-128
maxpool
maxpool
maxpool
conv-256
conv-256
conv-256
conv-256
conv-256
conv-256
conv-256
conv-256
maxpool
maxpool
conv-256
conv-256
maxpool
conv-256
maxpool
conv-512
conv-512
conv-512
conv-512
conv-512
conv-512
conv-512
conv-512
maxpool
maxpool
conv-512
conv-512
maxpool
conv-512
maxpool
conv-512
conv-512
conv-512
conv-512
conv-512
conv-512
conv-512
conv-512
maxpool
maxpool
conv-512
conv-512
maxpool
conv-512
maxpool
FC-4096
FC-4096
FC-4096
FC-4096
FC-4096
FC-4096
FC-4096
FC-4096
FC-1000
FC-1000
FC-1000
FC-1000
soEmax
soEmax
soEmax
soEmax
9.5
9.4
9
9
8.8
8.5
11
layers
13
layers
16
layers
19
layers
20
VGG-Net
Extensions
Deep
Image
(Baidu,
[Wu
at
al.,
2015])
VGG-16
and
VGG-19
models
with
more
channels
Delving
Deep
Into
Rec1ers
(MSRA,
[He
et
al.,
2015])
2-conv/1
pool/2
1-conv/2
aggressive
downsampling:
2-conv/1
7x7
conv.
with
stride
2
(cf.
GoogLeNet)
pool/2
pool/2
4-conv/1
6-conv/1
pool/2
pool/2
6-layer
stacks
instead
of
4-layer
4-conv/1
6-conv/1
pool/2
pool/2
4-conv/1
6-conv/1
pool/2
SP
pool
Spa1al
Pyramid
pooling
[He
at
al.,
2014]
3-layer
3-layer
VGG-19
MSRA-22
21
Parametric
ReLU
Ac1va1on
func1on:
23
Prerequisite:
1x1
Convolu1on
Doesnt
capture
spa1al
context,
only
operates
across
channels
Performs
linear
projec1on
of
one
pixels
features
can
be
used
for
dimensionality
reduc.on:
WR cout cin
x
Fin R cin wh =
Fout R cout wh
Incep.on
module:
nave
version
25
Incep1on
Module
Computa1on
1me
&
number
of
parameters
reduced
by
1x1
convolu1on
dimensionality
reduc1on
also
increases
depth
Allows
for
increasing
#channels
without
large
penalty
single
Incep1on
module
depth:
3
Incep.on
module
with
dim.
reduc.on
26
Incep1on
Net
v2
depth:
34
(10
Incep1on
modules,
3
conv.,
1
FC)
aggressive
spa1al
downsampling
rst
layers
quickly
decrease
resolu1on
by
8
lots
of
depth
in
further
stacks
Incep.onNet
v2
Incep1onNet
less
deep
in
the
rst
blocks,
but
deeper
in
the
following
ones
Instead
of
pooling
Incep1on
with
stride
2
(pooling
is
inside)
28
Outline:
Training
Op1misa1on
Regularisa1on
Ini1alisa1on
Batch
normalisa1on
29
Op1misa1on
Learning
objec1ve
mul1nomial
logis1c
regression
(soamax
loss)
30
Learning
Rate
Very
important
to
set
it
properly
too
low
training
is
slow,
too
high
training
diverges
Conven1onal
strategy
start
with
a
reasonably
high
learning
rate
(e.g.
0.01)
divide
it
by
constant
factor
(e.g.
10)
when
the
valida1on
error
plateaus
val.
error
itera.on
31
Regularisa1on
Training
suers
from
over-yng,
even
on
ILSVRC
32
Ini1alisa1on
Sample
from
zero-mean
normal
distribu1on
with
xed
variance,
e.g.
N(0; 0.01)
works
ne
for
shallow
nets
deeper
nets
suer
for
vanishing/exploding
gradient
problem
Adap1vely
choose
variance
for
each
layer
1
preserve
gradient
magnitude
[Glorot
&
Bengio,
2010]:
=
N in
FC
layers:
Nin
=
#input
channels
conv.
layers:
Nin
=
#input
channels
size2
compensate
for
ReLU
2
[He
et
al.,
2015]:
=
N in
itera.on
36
Random
Crop
Augmenta1on
Randomly
sample
a
xed-size
sub-image
(224x224)
the
crop
is
a
ConvNet
input
essen1al
component
of
most
256
224
ImageNet
submissions
since
AlexNet
Original
image
is
rescaled
to
a
certain
N256
smallest
side
aects
the
scale
of
image
sta1s1cs
384
224
seen
by
a
ConvNet
single-scale:
256xN
or
384xN
mul.-scale:
randomly
sample
the
size
N384
for
each
image
from
256xN
to
512xN
Random
horizontal
ips
37
Photometric
Distor1on
Augmenta1on
Random
RGB
shia
[AlexNet]
38
Outline:
Evalua1on
Mul1-crop
evalua1on
Dense
evalua1on
fully-convolu1onal
nets
Model
ensembles
39
Mul1-Crop
Evalua1on
Network
is
trained
on
xed-size
(224x224)
crops
Full
image
is
normally
larger,
so
1le
the
image
with
crops
evaluate
the
net
and
average
predic1ons
More
crops
higher
accuracy,
but
slower
Single-scale:
5
crops
x
2
ips
=
10
crops
[AlexNet]
Mul1-scale
rescale
image
to
several
sizes,
sample
crops
in
each
[Howard,
2013]:
3
scales,
90
crops;
[GoogLeNet]:
4
scales,
144
crops
disadvantage:
slow,
as
need
to
evaluate
ConvNet
from
scratch
40
Dense
Evalua1on
ConvNets
can
be
applied
to
an
image
of
any
size
Network
should
be
fully-convolu1onal
fully-connected
layers
expect
xed-resolu1on
input
so
should
be
converted
to
conv.
Conversion
(on
the
example
of
VGG-Net)
assume
FC
layer
has
input
512x7x7,
output
is
4096-D
can
be
seen
as
conv.
layer
with
7x7
recep1ve
eld,
512
input
channels
&
4096
output:
512x7x7
->
4096x1x1
Output
of
full-conv.
net
is
a
class
score
map,
should
be
pooled
with
global
pooling
to
produce
a
vector
of
scores
Used
in
OverFeat
[Sermanet
et
al.,
2013]
&
VGG-Net
41
Eect
of
Scale
(VGG-Net)
Top-5
Classica.on
Error
(Val.
Set)
10
9.5
9.4
9
train/test
scales
9
single/single
8.5
8.8
8.8
mul1/single
8
8.2
8.1
mul1/mul1
8
7.5
7.5
7.5
7
13
layers
16
layers
19
layers
45
Object
Detec1on
(In
Brief)
Common
approach:
generate
a
large
number
of
bounding
box
proposals
classify
them
using
visual
features
ConvNet
features
work
very
well!
R-CNN
[Girshick
et
al.,
2013]
Fast
R-CNN
[Girshick
et
al.,
2015]
for
each
proposal,
predicts
its
class
and
precise
bbox
loca1on
re-uses
conv.
features,
no
need
to
re-compute
Proposals
Selec1ve
search
Mul1-Box
Faster
R-CNN
46
Infrastructure
Good
infrastructure
is
just
as
important
A
number
of
o-the-shelf
deep
learning
packages
Torch,
Cae,
Theano,
MatConvNet
Using
GPUs
is
a
must
most
packages
use
the
same
low-level
back-ends,
e.g.
cuDNN
or
cuBLAS,
so
speed
is
comparable
Mul1-GPU
training
helps
a
lot
available
in
packages
above
47
Summary
Deep
ConvNets
an
essen1al
component
of
top
ILSVRC
submissions
since
2012
Depth
is
important
Other
essen1als:
extensive
augmenta1on
at
mul1ple
scales
dropout,
batch
normalisa1on,
weight
decay
48
References
Y.
LeCun,
B.
Boser,
J.
S.
Denker,
D.
Henderson,
R.
E.
Howard,
W.
Hubbard,
and
L.
D.
Jackel.
Backpropaga1on
applied
to
handwrigen
zip
code
recogni1on.
Neural
Computa1on
1989.
Y.
LeCun,
L.
Bogou,
Y.
Bengio,
and
P.
Haner.
Gradient-based
learning
applied
to
document
recogni1on.
Proceedings
of
the
IEEE
1998.
X.
Glorot
and
Y.
Bengio.
Understanding
the
diculty
of
training
deep
feedforward
neural
networks.
AISTATS
2010.
A.
Krizhevsky,
I.
Sutskever,
and
G.
E.
Hinton.
ImageNet
Classica1on
with
Deep
Convolu1onal
Neural
Networks.
NIPS
2012.
M.
Lin,
Q.
Chen,
and
S.
Yan.
Network
In
Network.
ICLR
2014.
P.
Sermanet,
D.
Eigen,
X.
Zhang,
M.
Mathieu,
R.
Fergus,
and
Y.
LeCun.
OverFeat:
Integrated
Recogni1on,
Localiza1on
and
Detec1on
using
Convolu1onal
Networks.
ICLR
2014.
A.
G.
Howard.
Some
improvements
on
deep
convolu1onal
neural
network
based
image
classica1on.
ICLR
2014.
D.
Erhan,
C.
Szegedy,
A.
Toshev,
and
D.
Anguelov.
Scalable
Object
Detec1on
using
Deep
Neural
Networks.
CVPR
2014.
M.
D.
Zeiler
and
R.
Fergus.
Visualizing
and
understanding
convolu1onal
networks.
ECCV,
2014.
K.
Simonyan
and
A.
Zisserman.
Very
Deep
Convolu1onal
Networks
for
Large-Scale
Image
Recogni1on.
ICLR
2015.
R.
Wu,
S.
Yan,
Y.
Shan,
Q.
Dang,
and
G.
Sun.
Deep
Image:
Scaling
up
Image
Recogni1on.
Arxiv
2015.
K.
He,
X.
Zhang,
S.
Ren,
and
J.
Sun.
Delving
Deep
into
Rec1ers:
Surpassing
Human-Level
Performance
on
ImageNet
Classica1on.
Arxiv
2015.
C.
Szegedy,
W.
Liu,
Y.
Jia,
P.
Sermanet,
S.
Reed,
D.
Anguelov,
D.
Erhan,
V.
Vanhoucke,
and
A.
Rabinovich.
Going
Deeper
With
Convolu1ons.
CVPR
2015.
S.
Ioe
and
C.
Szegedy.
Batch
Normaliza1on:
Accelera1ng
Deep
Network
Training
by
Reducing
Internal
Covariate
Shia.
ICML
2015.
R.
Girshick.
Fast
R-CNN.
Arxiv
2015.
49