Recent

ILSVRC
Submission Essen1als
in the light of recent developments

ILSVRC Tutorial @ CVPR-2015
7 June 2015
Karen Simonyan

Outline
Architectures
Convolu1onal Networks: recap
The importance of depth in image representa1ons
very deep ConvNets (VGG-Net and extensions)
Incep1on modules (GoogLeNet)
Training
Op1misa1on
Data augmenta1on
Evalua1on
References
2
Convolu1onal Networks
State-of-the-art in image recogni1on
winner of ILSVRC since 2012
ConvNet - hierarchical image representa1on
[LeCun et al., 89, 98]
stack of conv. layers, interleaved with non-lineari1es
typically followed by fully-connected layers
ConvNet schema.c 3
Convolu1onal Networks (2)
Important conv. layer proper1es:
locality: objects/parts have local spa1al support
weight sharing: transla1on equivariance
Conv. layers operate across all channels, not just one

Each layer is followed by non-linearity (ac1va1on
func1on), e.g. ReLU: max(W*x, 0)
Some layers are followed by spa1al pooling
max- or sum-pooling
invariance to
local transla1on
4
Convolu1onal Networks (3)
Supervised training by back-propaga1on
gradient descent & chain rule
End-to-end training
all layers learnt jointly, no hand-craaing
But some engineering is s1ll needed to put
together an architecture
number of layers, feature channels, etc.
some guidelines will be provided in this talk
5
AlexNet
Winner of ILSVRC-2012 ([Krizhevsky et al.])
ConvNet with 8 layers (5 conv. & 3 FC)
layer output size
input image 3x224x224

With depth:
spa1al resolu1on is gradually
conv-96x11x11/4 96x56x56
reduced
maxpool/2 96x28x28
number of channels (feature
conv-256x5x5 256x28x28
dimension) is increased
maxpool/2 256x14x14 higher-level representa1ons,
conv-384x3x3 384x14x14 more spa1al invariance
conv-384x3x3 384x14x14
conv-256x3x3 256x14x14
maxpool/2 256x7x7
full-4096 4096
full-4096 4096
full-1000 1000 6
Deeper is Beger
Each weight layer performs a linear opera1on,
followed by non-linearity
a single layer can be seen as a linear classier itself
More layers more non-lineari1es
leads to a more discrimina1ve model
What limits the number of layers?
many models use pooling aaer each conv. layer
input image resolu1on sets the limit: log (s) for sxs input
computa1onal complexity
7
Building Very Deep Nets (1)
Stack several layers between pooling
pooling
#conv. layers >> #pooling layers conv
conv
#conv. layers does not aect resolu1on if conv
each layer preserves spa1al resolu1on: conv
pooling
conv. stride = 1 & input is padded
resolu.on
More generally, interleave reduc.on
deep mul1-layer blocks with deep

mul.-layer
conv
conv
resolu1on reduc1on layers processing conv
resolu.on
reduc.on
8
Building Very Deep Nets (2)
Stack of small (3x3) conv. layers
has a large recep1ve eld
two 3x3 layers 5x5 recep1ve eld
three 3x3 layers 7x7 recep1ve eld
faster than a stack of large conv. layers
less parameters than a single layer with large kernels
5
1st 3x3 conv. layer

5
2nd 3x3 conv. layer
9
Very Deep Nets at ILSVRC
Large depth and small lters is used in two
top-performing ILSVRC-2014 submissions
GoogLeNet (Incep1on) [Szegedy et al., 2014]
VGG-Net [Simonyan & Zisserman, 2014]
as well as the follow-up works
Delving deep into rec1ers (MSRA, [He at al., 2015])
Deep Image (Baidu, [Wu et al., 2015])
Incep1on v2 (Google, [Ioe and Szegedy, 2015])
10
13-layer
VGG-Net image
conv-64
Straighrorward implementa1on conv-64

maxpool
of very deep nets: conv-128

conv-128
stacks of conv. layers w/o pooling maxpool
3x3 conv. kernels very small conv-256

conv-256
conv. stride 1 no skipping maxpool
conv-512
Other details are conven1onal: conv-512
maxpool
5 max-pool layers conv-512
no normalisa1on layers conv-512

maxpool
3 fully-connected layers FC-4096

FC-4096
FC-1000
soEmax
11
VGG-Net Incarna1ons
image
conv-64
maxpool
conv-128
maxpool
conv-256
conv-256
maxpool
conv-512
conv-512
maxpool
conv-512
conv-512
maxpool
FC-4096
FC-4096
FC-1000
soEmax
11-layer
Started from 11 layers 12

VGG-Net Incarna1ons
image
conv-64
conv-64
maxpool
conv-128
conv-128
maxpool
conv-256
conv-256
maxpool
conv-512
conv-512
maxpool
conv-512
conv-512
maxpool
FC-4096
FC-4096
FC-1000
soEmax
11-layer
Started from 11 layers & injected more conv. layers 13

VGG-Net Incarna1ons
image image
conv-64 conv-64
maxpool conv-64
maxpool
conv-128 conv-128
maxpool conv-128
maxpool
conv-256 conv-256
conv-256 conv-256
maxpool maxpool
conv-512 conv-512
conv-512 conv-512
maxpool maxpool
conv-512 conv-512
conv-512 conv-512
maxpool maxpool
FC-4096 FC-4096
FC-4096 FC-4096
FC-1000 FC-1000
soEmax soEmax
11-layer 13-layer
14
VGG-Net Incarna1ons
image image
conv-64 conv-64
maxpool conv-64
maxpool
conv-128 conv-128 Extra layers injected into
maxpool conv-128
maxpool deeper stacks
conv-256
conv-256
conv-256
conv-256
conv-256 rst layers capture
maxpool maxpool
lower-level primi1ves,
conv-512 conv-512
dont need to be very
conv-512
maxpool
conv-512
maxpool
conv-512
discrimina1ve
spa1al resolu1on is
conv-512 conv-512
conv-512 higher in the rst layers,
conv-512 conv-512
maxpool maxpool adding extra layers
there is computa1onally
FC-4096
FC-4096
FC-4096
FC-4096 prohibi1ve
FC-1000 FC-1000
soEmax soEmax
11-layer 13-layer
15
VGG-Net Incarna1ons
image image image
conv-64 conv-64 conv-64
maxpool conv-64 conv-64
maxpool maxpool
maxpool maxpool
maxpool maxpool conv-256
maxpool

maxpool

maxpool
FC-4096 FC-4096 FC-4096

FC-4096 FC-4096 FC-4096
FC-1000 FC-1000 FC-1000
soEmax soEmax soEmax
11-layer 13-layer 16-layer
16
VGG-Net Incarna1ons
image image image
maxpool maxpool
maxpool maxpool
conv-256
maxpool

conv-512
maxpool

conv-512
maxpool
FC-4096 FC-4096 FC-4096

FC-4096 FC-4096 FC-4096
FC-1000 FC-1000 FC-1000
soEmax soEmax soEmax
11-layer 13-layer 16-layer
17
VGG-Net Incarna1ons
image image image image
conv-64 conv-64 conv-64 conv-64
maxpool conv-64 conv-64 conv-64
maxpool maxpool maxpool
maxpool conv-128 conv-128 conv-128
maxpool maxpool maxpool
maxpool maxpool conv-256 conv-256
maxpool conv-256
maxpool
maxpool conv-512
maxpool
maxpool conv-512
maxpool
FC-4096 FC-4096 FC-4096 FC-4096
FC-4096 FC-4096 FC-4096 FC-4096
FC-1000 FC-1000 FC-1000 FC-1000
soEmax soEmax soEmax soEmax
11-layer 13-layer 16-layer 19-layer
16- and 19-layer models are publicly available 18

Eect of VGG-Net Depth
Top-5 Classica.on Error (Val. Set)
10.5
10.4
10
9.5
9.4
9
9
8.8
8.5
11 layers 13 layers 16 layers 19 layers
Error decreases with depth

Plateaus aaer 16 layers
could be due to training specics
19
VGG-Net Layer Pagern
image
Mul1-layer stacks conv-64
conv-64
2-conv/1
(conv. layers, stride=1) maxpool

conv-128
pool/2
2-conv/1
interleaved with resolu1on reduc1on conv-128
maxpool pool/2
(max-pooling, stride=2) conv-256
conv-256
conv-256 4-conv/1
conv-256
maxpool pool/2
Other very deep nets conv-512
conv-512
4-conv/1
(incl. GoogLeNet) follow conv-512
conv-512
pool/2
same/similar pagern
maxpool
conv-512
conv-512
4-conv/1
conv-512
conv-512
maxpool pool/2
FC-4096
FC-4096 3-fc
FC-1000
soEmax
20
VGG-Net Extensions
Deep Image (Baidu, [Wu at al., 2015])
VGG-16 and VGG-19 models with more channels
Delving Deep Into Rec1ers (MSRA, [He et al., 2015])
2-conv/1
pool/2 1-conv/2
aggressive downsampling:
2-conv/1 7x7 conv. with stride 2 (cf. GoogLeNet)
pool/2 pool/2
4-conv/1 6-conv/1
pool/2 pool/2
6-layer stacks instead of 4-layer
4-conv/1 6-conv/1
pool/2 pool/2
4-conv/1 6-conv/1
pool/2 SP pool Spa1al Pyramid pooling [He at al., 2014]
3-layer 3-layer
VGG-19 MSRA-22 21
Parametric ReLU
Ac1va1on func1on:
ai is learnable with back-prop

per-channel or per-layer
learnable ac1va1on func1on!
ReLU
Generalises
ReLU (ai=0)
leaky ReLU (ai=0.01)
0.5%/0.2% top-1/top-5 error
reduc1on
PReLU
[He et al., 2015] 22
GoogLeNet (Incep1on)
Developed concurrently with VGG-Net
Some design choices are similar:
very deep (22 layers)
small lters
3x3, 5x5, 7x7 (1st layer only) in [Szegedy et al., 2014]
3x3 and 7x7 (1st layer only) in [Ioe & Szegedy, 2015]
But more computa1onally and
parameter-ecient, due to the
mul1-branch Incep1on modules
23
Prerequisite: 1x1 Convolu1on
Doesnt capture spa1al context, only operates
across channels
Performs linear projec1on of one pixels features
can be used for dimensionality reduc.on:
WR cout cin
x Fin R cin wh = Fout R cout wh
Also increases the depth

computa1onally- and parameter-cheap
used in Network in Network architecture
[Lin et al., 2014]
24
Incep1on Module
Conv. lters of dierent size alongside each other
resul1ng feature maps are concatenated
lter sizes: 1x1, 3x3, 5x5 & max/avg-pooling
in Incep1on v2 [Ioe & Szegedy, 2015] 5x5 replaced with two 3x3
most output channels are computed with fast layers, e.g.
1024 (pool) + 352 (1x1 conv) + 320 (3x3 conv) + 224 (5x5 conv) = 1920 (out)
fast slow
Incep.on module:
nave version
25
Incep1on Module
Computa1on 1me & number of parameters reduced by
1x1 convolu1on dimensionality reduc1on
also increases depth
Allows for increasing #channels without large penalty
single Incep1on module depth: 3
Incep.on module
with dim. reduc.on
26
Incep1on Net v2
depth: 34 (10 Incep1on modules, 3 conv., 1 FC)
aggressive spa1al downsampling
rst layers quickly decrease resolu1on by 8
lots of depth in further stacks
[Ioe & Szegedy, 2015] 27

Architectures: Comparison
2-conv/1
pool/2 1-conv/2 1-conv/2
2-conv/1
pool/2 pool/2 pool/2
4-conv/1 6-conv/1 2-conv/1
pool/2 pool/2 pool/2
4-conv/1 6-conv/1 2-Incep.on/1 (6-conv)
pool/2 pool/2 1-Incep.on/2 (3-conv)
4-conv/1 6-conv/1 4-Incep.on/1 (12-conv)
pool/2 SP pool 1-Incep.on/2 (3-conv)
3-layer 3-layer 2-Incep.on/1 (6-conv)
pool/7
VGG-19 MSRA-22
1-layer
Incep.onNet v2
Incep1onNet
less deep in the rst blocks, but deeper in the following ones
Instead of pooling Incep1on with stride 2 (pooling is inside) 28
Outline: Training
Op1misa1on
Regularisa1on
Ini1alisa1on
Batch normalisa1on
29
Op1misa1on
Learning objec1ve
mul1nomial logis1c regression (soamax loss)
A plethora of gradient-based op1misa1on methods

in common: gradients are computed with back-prop
then, weights can be updated in dierent ways:
SGD, ADAGRAD, RMSPROP, etc.
SGD with momentum works very well in prac1ce

but important to get hyper-parameters right
30
Learning Rate
Very important to set it properly
too low training is slow, too high training diverges
Conven1onal strategy
start with a reasonably high learning rate (e.g. 0.01)
divide it by constant factor (e.g. 10)
when the valida1on error plateaus
val.
error
itera.on 31
Regularisa1on
Training suers from over-yng, even on ILSVRC
Two simple and eec1ve techniques in most

submissions since AlexNet
weight decay (L2 norm penalty)
dropout
Batch normalisa1on [Ioe & Szegedy, 2015]

regularises and speeds-up training
32
Ini1alisa1on
Sample from zero-mean normal distribu1on with xed
variance, e.g. N(0; 0.01)
works ne for shallow nets
deeper nets suer for vanishing/exploding gradient problem
Adap1vely choose variance for each layer
1
preserve gradient magnitude [Glorot & Bengio, 2010]: =
N in
FC layers: Nin = #input channels
conv. layers: Nin = #input channels size2
compensate for ReLU
2
[He et al., 2015]: =
N in
Supervised pre-training MSRA
init deep with shallow [VGG-Net]

33
Batch Normalisa1on
The distribu1on of ac1va1ons changes during training,
making training harder
Whitening of neural net inputs is a standard pre-

processing technique
Batch normalisa1on [Ioe & Szegedy, 2015] performs

normalisa1on of outputs of each layer to zero mean and
unit variance
can be seen as diagonal whitening
performed aaer each weight layer before ReLU
34
Batch Normalisa1on (2)
accuracy
itera.on
scale and shia parameters are learnt

doing backprop through batchnorm is important
nets with batchnorm need less regularisa1on
smaller/zero dropout & weight decay 35
Data Augmenta1on
ILSVRC is s1ll too small for large ConvNets
over-yng in spite of regularisa1on
Data augmenta1on (jigering) - increases the
amount of training data
Transforms original images in a way which
preserves their label
is realis1c
Helpful for both training and evalua1on
36
Random Crop Augmenta1on
Randomly sample a xed-size sub-image (224x224)
the crop is a ConvNet input
essen1al component of most
256 224
ImageNet submissions since AlexNet
Original image is rescaled to a certain N256
smallest side
aects the scale of image sta1s1cs
384 224
seen by a ConvNet
single-scale: 256xN or 384xN
mul.-scale: randomly sample the size N384
for each image from 256xN to 512xN
Random horizontal ips
37
Photometric Distor1on Augmenta1on
Random RGB shia [AlexNet]
Randomly adjust contrast, brightness, and colour

[Howard, 2013]
Vigneyng and lens distor1on [Deep Image]
38
Outline: Evalua1on
Mul1-crop evalua1on
Dense evalua1on
fully-convolu1onal nets

Model ensembles
39
Mul1-Crop Evalua1on
Network is trained on xed-size (224x224) crops
Full image is normally larger, so
1le the image with crops
evaluate the net and average predic1ons
More crops higher accuracy, but slower
Single-scale: 5 crops x 2 ips = 10 crops [AlexNet]
Mul1-scale
rescale image to several sizes, sample crops in each
[Howard, 2013]: 3 scales, 90 crops; [GoogLeNet]: 4 scales, 144 crops
disadvantage: slow, as need to evaluate ConvNet from scratch
40
Dense Evalua1on
ConvNets can be applied to an image of any size
Network should be fully-convolu1onal
fully-connected layers expect xed-resolu1on input
so should be converted to conv.
Conversion (on the example of VGG-Net)
assume FC layer has input 512x7x7, output is 4096-D
can be seen as conv. layer with 7x7 recep1ve eld, 512 input
channels & 4096 output: 512x7x7 -> 4096x1x1
Output of full-conv. net is a class score map, should be
pooled with global pooling to produce a vector of scores
Used in OverFeat [Sermanet et al., 2013] & VGG-Net
41
Eect of Scale (VGG-Net)
10
9.5
9.4
9 train/test scales
9 single/single
8.5 8.8 8.8
mul1/single
8 8.2 8.1 mul1/mul1
8
7.5
7.5 7.5
7

13 layers 16 layers 19 layers
Dense evalua1on results

Using mul1ple scales is important
mul1-scale training outperforms single-scale
mul1-scale tes1ng further improves the results 42
Evalua1on: Dense vs Mul1-Crop
7.6
7.4 7.5 7.5 7.5

7.4 networks
7.2 16-layer
7.2 7.2
7.1 7.1 19-layer
7
16 & 19-layer
6.8
6.8
6.6
dense 150 crops dense & 150 crops
Dense evalua1on is on par with mul1-crop

Dense & mul1-crop are complementary
Combining predic1ons from 2 nets is benecial, but slow
43
Model Ensembles
Training mul1ple models and combining their
predic1ons improves the accuracy
average soa-max posteriors
Used in all top-performing submissions to ILSVRC
Models dont need to be the same
can simply combine your best models developed by the
submission 1me
Examples of ensembles improvement:
VGG-Net: error decreases from 7.1% (1 net) to 6.8% (2 nets)
GoogLeNet: from 7.9% (1 net) to 6.7% (7 nets)
Incep1onNet v2 (batchnorm): from 5.8% to 4.8% (6 nets)
44
Object Localisa1on (In Brief)
ILSVRC localisa1on task: classify and localise a single
object (which is guaranteed to be in the image)
Object detec1on approaches would require adapta1on
not all the objects are annotated in the training set
Object bounding box regression with ConvNets [OverFeat]
last layer predicts a bounding box
class-agnos1c [OverFeat] 224x224
for each class [VGG] crop 0
Ini1alised with classica1on nets
object box
Fine-tuning of all layers
45
Object Detec1on (In Brief)
Common approach:
generate a large number of bounding box proposals
classify them using visual features
ConvNet features work very well!
R-CNN [Girshick et al., 2013]
Fast R-CNN [Girshick et al., 2015]
for each proposal, predicts its class and precise bbox loca1on
re-uses conv. features, no need to re-compute
Proposals
Selec1ve search
Mul1-Box
Faster R-CNN 46
Infrastructure
Good infrastructure is just as important
A number of o-the-shelf deep learning packages
Torch, Cae, Theano, MatConvNet
Using GPUs is a must
most packages use the same low-level back-ends, e.g.
cuDNN or cuBLAS, so speed is comparable
Mul1-GPU training helps a lot
available in packages above
47
Summary
Deep ConvNets an essen1al component of top
ILSVRC submissions since 2012
Depth is important
Other essen1als:
extensive augmenta1on at mul1ple scales
dropout, batch normalisa1on, weight decay
Next talk will cover the implementa1on side
48
References
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropaga1on applied to
handwrigen zip code recogni1on. Neural Computa1on 1989.
Y. LeCun, L. Bogou, Y. Bengio, and P. Haner. Gradient-based learning applied to document recogni1on. Proceedings of
the IEEE 1998.
X. Glorot and Y. Bengio. Understanding the diculty of training deep feedforward neural networks. AISTATS 2010.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classica1on with Deep Convolu1onal Neural Networks.
NIPS 2012.
M. Lin, Q. Chen, and S. Yan. Network In Network. ICLR 2014.
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated Recogni1on, Localiza1on
and Detec1on using Convolu1onal Networks. ICLR 2014.
A. G. Howard. Some improvements on deep convolu1onal neural network based image classica1on. ICLR 2014.
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable Object Detec1on using Deep Neural Networks. CVPR 2014.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolu1onal networks. ECCV, 2014.
K. Simonyan and A. Zisserman. Very Deep Convolu1onal Networks for Large-Scale Image Recogni1on. ICLR 2015.
R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep Image: Scaling up Image Recogni1on. Arxiv 2015.
K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rec1ers: Surpassing Human-Level Performance on ImageNet
Classica1on. Arxiv 2015.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper
With Convolu1ons. CVPR 2015.
S. Ioe and C. Szegedy. Batch Normaliza1on: Accelera1ng Deep Network Training by Reducing Internal Covariate Shia.
ICML 2015.
R. Girshick. Fast R-CNN. Arxiv 2015.
49

Recent

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Recent

Transféré par

Droits d'auteur :

Formats disponibles

ILSVRC

Conv. layers operate across all channels, not just one

input image 3x224x224

deep mul1-layer blocks with deep

resolu1on reduc1on layers processing conv

1st 3x3 conv. layer

2nd 3x3 conv. layer

Straighrorward implementa1on conv-64

of very deep nets: conv-128

3x3 conv. kernels very small conv-256

no normalisa1on layers conv-512

3 fully-connected layers FC-4096

Started from 11 layers 12

Started from 11 layers & injected more conv. layers 13

conv-512 conv-512 conv-512

conv-512 conv-512 conv-512

FC-4096 FC-4096 FC-4096

11-layer 13-layer 16-layer

conv-512 conv-512 conv-512

conv-512 conv-512 conv-512

FC-4096 FC-4096 FC-4096

11-layer 13-layer 16-layer

11-layer 13-layer 16-layer 19-layer

16- and 19-layer models are publicly available 18

Error decreases with depth

(conv. layers, stride=1) maxpool

ai is learnable with back-prop

Also increases the depth

[Ioe & Szegedy, 2015] 27

A plethora of gradient-based op1misa1on methods

SGD with momentum works very well in prac1ce

Two simple and eec1ve techniques in most

Batch normalisa1on [Ioe & Szegedy, 2015]

Supervised pre-training MSRA

init deep with shallow [VGG-Net]

Whitening of neural net inputs is a standard pre-

Batch normalisa1on [Ioe & Szegedy, 2015] performs

scale and shia parameters are learnt

Randomly adjust contrast, brightness, and colour

Vigneyng and lens distor1on [Deep Image]

Dense evalua1on results

7.4 7.5 7.5 7.5

Dense evalua1on is on par with mul1-crop

Next talk will cover the implementa1on side

Vous aimerez peut-être aussi