Vous êtes sur la page 1sur 120

Deep Learning within

Computer Vision
a whirlwind tour of the key principles

Toby Breckon
Engineering and Computer Science
Durham University
www.durham.ac.uk/toby.breckon/mltutorial/ toby.breckon@durham.ac.uk
Slide material acknowledgements (some material): Lee (UC Davies), Grauman (UT Austin), Lazebnik (Illinois), Fei-Fei (Stanford),
Fergus (Stoney Brook), Huang (Illinois), Lee (Michigan), Ranzato (Facebook A.I. Research), Sermanet (Google), Vedaldi (Oxford), Hinton (Toronto), Fisher (HIPR2, Edinburgh) + additional URL/acknowledgement on individual slides

BMVA Computer Vision Deep Learning : 1


Summer School 2018
Let’s start at the very beginning ...

BMVA Computer Vision


Summer School 2018 Deep Learning : 2
Machine Learning ?

Why Machine Learning?
– we cannot program everything
– some tasks are difficult to define algorithmically

– especially in computer vision


…. visual sensing has few rules


Well-defined learning problems ?
– easy to learn Vs. difficult to learn
..... varying complexity of visual patterns


An example: learning to recognise objects ...
Image: DK

BMVA Computer Vision


Summer School 2018 Deep Learning : 3
Learning ? - in humans

BMVA Computer Vision


Summer School 2018 Deep Learning : 4
Learning ? - in computers

BMVA Computer Vision


Summer School 2018 Deep Learning : 5
Machine Learning

Definition:


A set of methods for the automated analysis of structure in
data. …. two main strands of
work, (i) unsupervised learning ….
and (ii) supervised learning.

….similar to ... data mining, but ... focus .. more on


autonomous machine performance, ….
rather than enabling humans to learn from the data.

[Dictionary of Image Processing & Computer Vision, Fisher et al., 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 6
Supervised Vs. Unsupervised

Supervised c1
– knowledge of output - learning with the
presence of an “expert” / teacher c2
• data is labelled with a class or value
• Goal: predict class or value label
c3

e.g. Neural Network, Support Vector Machines, Decision
Trees, Bayesian Classifiers .... ….


Unsupervised
– no knowledge of output class or value
• data is unlabelled or value un-known
• Goal: determine data patterns/groupings
– Self-guided learning algorithm

(internal self-evaluation against some criteria)

e.g. k-means, genetic algorithms, clustering approaches ... …. ?
BMVA Computer Vision
Summer School 2018 Deep Learning : 7
(some) Feature Representation
(e.g. SIFT, HOG, histogram, Bag of Words, PCA ...)

… in the big picture person

cat

dog

Machine cow
….
Learning ….
= ….
“Decision
OR

car
or
Pixels / Voxels / Samples

Prediction” rhino

….

position
“style”
depth

BMVA Computer Vision


Summer School 2018 Deep Learning : 8
Common Machine Learning Tasks


Object Classification
what object ?
http://pascallin.ecs.soton.ac.uk/challenges/VOC/


Object Detection
{people | vehicle | … intruder ….}
object or no-object ?


Instance Recognition ? {face | vehicle plate| gait …. → biometrics}
who (or what) is it ?


Sub-category analysis
which object type ?
{gender | type | species | age …...}


Sequence { Recognition | Classification } ?
what is happening / occurring ?

BMVA Computer Vision


Summer School 2018 Deep Learning : 9
Types of Machine Learning Problem

Classification

Predict (classify) sample → discrete set of class labels

e.g. classes {object 1, object 2 … } for recognition task
…. ?

e.g. classes {object, !object} for detection task


Regression (traditionally less common in comp. vis.)


Predict sample → associated numerical value (variable)

e.g. distance to target based on shape features

Linear and non-linear attribute to value relationships


Association & clustering

grouping a set of instances by attribute similarity

e.g. image segmentation
[Ess et al, 2009]

BMVA Computer Vision


Summer School 2018 Deep Learning : 10
Simple Regression Example – Head Pose Estimation

[ video ]


Input: image features (HOG)

Output: { yaw | pitch }


varying illumination + vibration
http://www.youtube.com/embed/UcF_otQSMEc?rel=0
[Walger / Breckon, 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 11
Complex Regression Example – Full-Body Pose Estimation


Input: raw image

Output: 17 pose keypoints
PoseNet (Google Research):
https://github.com/tensorflow/tfjs-models/tree/master/posenet

Live demo (browser):


https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html

BMVA Computer Vision [Papandreou et al. 2018]


Summer School 2018 Deep Learning : 12
The move from shallow to deep …
(< ~2013) (> ~2013)
pre deep learning / pre abyssi post deep learning / post abyssi

BMVA Computer Vision


Summer School 2018 Deep Learning : 13
Traditional (shallow) Approaches

Image / Video
Pixels Object
Hand-designed Trainable
Class
feature extraction classifier

• Features are not learned



vectors of shape measures, edge distributions, colours distributions, feature points,
HOG, visual words, ... etc.

… i.e. calculated summary “numerical” descriptors
.. see extra slides

• Trainable classifier is often generic


(e.g. SVM kernel, Decision Forest)
BMVA Computer Vision
Summer School 2018 Deep Learning : 14
Deep Learning – end to end approaches

• Learn a feature hierarchy all the way from pixels (or voxels)
to classifier
• Each layer extracts “features” from the output of previous
layer
• Layers have similar structure, performing varying functions
• Train (i.e. optimize) all layers jointly

Image/
Video Layer 1 Layer 2 Layer 3 Classifier
Pixels

BMVA Computer Vision


Summer School 2018 Deep Learning : 15
“Shallow” vs. “deep” architectures
Traditional recognition: “Shallow” architecture

Image/
Video Hand-designed Trainable Object
Pixels feature extracton classifer Class
(or output
predicton)

(modern) Deep learning: “Deep” architecture

Image/
Video …
Simple Object
Layer 1 Layer N
Pixels classifer Class
(or output
predicton)

BMVA Computer Vision


Summer School 2018 Deep Learning : 16
A typical end-to-end deep learning
convolutional neural network ….

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012

• “AlexNet”: seminal ImageNet Challenge winner (2012)


• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• More data (106 vs. 103 images)
• GPU implementaton (50x speedup over CPU)
• Trained on two GPUs for a week
• Algorithms: beter regularizaton for training (DropOut)
BMVA Computer Vision
Summer School 2018 Deep Learning : 17
Steel
drumThe Image Classification Challenge:
1,000 object classes
1,431,167 images
% error

[ Human - Russakovsky et al. IJCV 2015]

4/3/201
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
8
Summer School 2018 Deep Learning : 18
AlexNet’s performance on this
benchmark task was the research
event (“discovery”) that led to the
shallow to deep transformation in
computer vision ...*

*
although CNN were not created overnight, and have their origins in [LeCun, 1998] among others

BMVA Computer Vision


Summer School 2018 Deep Learning : 19
pre abyssi → post abyssi*

*
here meaning post (after) the “discovery” of deep learning by the computer vision research community

Note: we are currently in what may become known as the “deep-age”, if not perhaps the “dark-age” (?) of computer vision (see final slides)
BMVA Computer Vision
Summer School 2018 Deep Learning : 20
This talk ...
[ itself a shallow overview of deep learning approaches ]


Is about … 
Is not about …
– understanding the core – how to use {tensorflow |
concepts, well pytorch | keras | mxnet ….
tensor-py-flow-net (!?) … }

– bringing everyone up to
– hyper-parameter tuning
speed on where we are and
how we got here
• re: deep learning – specific advanced concepts
that have been published on
arXiv in the time I have been
– understanding the limitations talking …
of current understanding

+ the challenges that lie


ahead
BMVA Computer Vision
Summer School 2018
… sorry. Deep Learning : 21
Deep learning can do some clever stuff ...
A GA B B' ....
DB ladv
lrec
A'' GB A B'' Input RGB
128 128 128 128
Restyled RGB

128

128
(I) GA->B(I)
lrec

64
64
x9
ladv

32

32
DA
A' GB C B

C' Restyled RGB Output Depth

64
DC ladv GB->C(I) GB->C[GA->B(I)]

128
128
lrec

256
256
Training

512

512
C Testing

1024
512

1024
512

1024
512
512
512 1024

[ video ]

e.g. Monocular depth prediction via style transfer
[Atapour / Breckon, CVPR 2018] - https://github.com/atapour/monocularDepth-Inference
BMVA Computer Vision
Summer School 2018 Deep Learning : 22
Key question – why does this stuff
work so well ?
(let’s examine some of the fundamentals)

BMVA Computer Vision


Summer School 2018 Deep Learning : 23
Key Principle
Final
Classification

Each layer in a deep network performs a ….


different transformation (function) to
map from input to output – these vary from Feature maps
{convolution, pooling, sub-sampling, non- ….
linear mapping (fully connected), ….}.
Pooling
….
Within a traditional (shallow) Neural
Network, the perceptron activation Non-linearity
functions are all the same (but we vary
….
the weights).
Convolution (Learned)
This provides the network with a larger ….
parameterization space to represent
(complex) input to output relationships. Input Image

BMVA Computer Vision


Summer School 2018 Deep Learning : 24
The rise of the ….

Convolutional Neural Networks (CNN)


• Mult-layer Neural network with:
- Local connectvity
- Shared weight parameters across spatal positons
• Stack multple stages of feature extractors
operatng directly on the image
• Higher stages compute more global, more
invariant feature representaton Task: digit recognition

• Final classifcaton layer at the end

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.

BMVA Computer Vision


Summer School 2018 Deep Learning : 25
Clarification:

Convolutional Neural Networks (CNN)


(i.e. networks using convolution layers)

are a subset of the generalized


Deep Learning extension to
Neural Networks

(i.e. Deep Neural Networks)


CNN are a specific to images (or similar densely sampled signals)**
** we'll concentrate on those here
A deep multi-layer architecture ...

BMVA Computer Vision


Summer School 2018 Deep Learning : 27
Final
Convolutonal Neural Networks (CNN) Classification
• Feed-forward network:
….
– Convolve input (feature extracton)
– Non-linearity (rectfed linear) Feature maps
– Pooling (local max) ….

Pooling
Supervised learning (with labels) ….

Train convoluton flters by backpropogaton Non-linearity


….

Convolution (Learned)
….

Input Image
Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )

BMVA Computer Vision


Summer School 2018 Deep Learning : 28
Convolution Layer(s)

Convolution in the first layer


immediately reduces the
complexity of the input in a
structured and meaningful way
(based on learnt weights)

Can share parameters across filters to reduce to size of


parameter set

BMVA Computer Vision


Summer School 2018 Deep Learning : 29
Convolution Layer(s)

• Convolutional
– dependencies are local
– translation invariance
– few parameters (filter weights)
– filter stride can be > 1 (faster, less memory)

t .
.
.

Input (image) Intermediate Feature Map


BMVA Computer Vision
Summer School 2018 Deep Learning : 30
RECAP: from Low Level Vision

Aside : image convolution


(in general image processing)


… is essentially the localised weighted sum of the image and
a convolution kernel (mask weights) over a N x M pixel
neighbourhood, at a given location within the image (x,y).

smoothing

Input Image

[ used in image filtering operations (e.g. smoothing) ]


Output Image
Image source: developer.apple.com

BMVA Computer Vision


Summer School 2018 Deep Learning : 31
RECAP: from Low Level Features

Convolution is very powerful ...


original filter (3 x 3)
blur

1 1 1
1 1 1
1 1 1
sharpen

0 -1 0
1 5 1
0 -1 0
edges

1 2 1
0 0 0
-1 -2 -1

Different weights have do a wide range of effects on input data ….


BMVA Computer Vision
Summer School 2018 Deep Learning : 32
hence → multiple layers of
convolution
(“convolutions upon convolutions”)

can approximate and provide most


feature extraction and image pre-
filtering (de-noising) approaches ...
BMVA Computer Vision
Summer School 2018 Deep Learning : 33
Convolution Layer(s)

Output - Feature Map

Input - image / layer

….

Output - Feature Map

Input - image / layer

….
Produces a structured
intermediate feature map from the
input image (or previous layer in
the network)
Output -
Input - image / layer Feature Map

BMVA Computer Vision


Summer School 2018 Deep Learning : 34
Non-linearity Layer(s)
(a.k.a. fully connected)


Provides a non-linear input to
output mapping via a (traditional)
activation function approach Previous
Layer, size N
Next
Layer, size M
layer of neurons
…..
– maybe either sub-sampling (N→ M) or fully
connected (N → M, N=M) via per element
activation function, e.g.
• Tanh
• Sigmoid: 1/(1+exp(-x))
• Rectified linear
(most common)
» Simplifies backprop
» Makes learning faster
» Avoids output saturation
issues
BMVA Computer Vision
Summer School 2018 Deep Learning : 35
Pooling layer(s)

Pools the input layer to form new intermediate output layer.

By “pooling” (e.g., taking max / sum) filter responses at different


locations we gain robustness to the variance of the spatial
location of features and reduce input dimensionality.

BMVA Computer Vision


Summer School 2018 Deep Learning : 36
Pooling layer(s)

• Performs localized sum() or max() over sub windows/regions



non-overlapping Vs. overlapping regions

Role of pooling:

Invariance to small transformatons

Larger receptve felds (see more of input)

max()

sum()

BMVA Computer Vision


Summer School 2018 Deep Learning : 37
CNN – example architecture (various layers)

Example CNN design:


https://sites.google.com/site/5kk73gpu2013/assignment/cnn


Number of layers, type of layer, number of nodes/maps per
layer – all down to the (human) designer
– many variants have emerged

CNN training – efficient backpropogation
as per traditional neural network approaches (with Dropout)
BMVA Computer Vision
Summer School 2018 Deep Learning : 38
Seminal
Deep Learning Architectures

ImageNet Classification with Deep Convolutional Neural Networks


A Krizhevsky I Sutskever, G Hinton (2012) - “AlexNet”

Going Deeper with Convolutions, C Szegedy et al (2014) - “GoogLeNet”

BMVA Computer Vision


Summer School 2018 Deep Learning : 39
→ Contemporary architectures ….

VGG -16 … (deeper, smaller 3x3 convolutions throughout)

Simonyan Zisserman, 2014: http://www.robots.ox.ac.uk/~vgg/research/very_deep/


ResNet …(residual blocks connecting input to the output)

K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016.
Figures: http://book.paddlepaddle.org/03.image_classification/

BMVA Computer Vision


…. and more and more ...
Summer School 2018 Deep Learning : 40
Varying Deep CNN Architectures
(and applications) all based on ... Final
Classification
• Feed-forward network: ….

– Convolve input (feature extracton) Feature maps


– Non-linearity (rectfed linear) ….
– Pooling (local max)
Pooling
… all trained by backpropogaton ….

Non-linearity
Repeated multiple times over network architecture
….

Convolution (Learned)
….

Input

Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )

BMVA Computer Vision


Summer School 2018 Deep Learning : 41
ASIDE: deep learners should know this

Train via Backpropagation


Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.


Neural Network Training: weight modifications are made in
the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence
“Backpropagation”


Key Algorithmic Steps
– Initialize weights (to small random values*) in the network
– Propagate the inputs forward
• (by applying activation function) at each node
– Backpropagate the error backwards
• (by updating weights and biases)
– Terminating condition
• when validation error is very small or enough iterations
Backpropogation details beyond scope/time (see accompanying reading)
.. see extra slides
BMVA Computer Vision
Summer School 2018 Deep Learning : 42
So overall deep network layers
provide …
feature extraction
data de-noising
dimensionality reduction
feature pooling
spatial invariance
non-linear input → output mapping

… specifically optimized towards a


given problem *
BMVA Computer Vision (* that is represented by a given set of defined examples)
Summer School 2018 Deep Learning : 43
… which really covers most desirable
aspects for most computer vision
problems we encounter.
[if you think about it]

BMVA Computer Vision


Summer School 2018 Deep Learning : 44
Key question – is it really that
easy ?
(images in → results out ?!?)

BMVA Computer Vision


Summer School 2018 Deep Learning : 45
Is it really that simple ?

https://www.youtube.com/watch?v=mxKlUO_tjcg [ video ]

BMVA Computer Vision Cao et al, CVPR, 2017


Summer School 2018 https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation Deep Learning : 46
Remember, it's all about ..

BMVA Computer Vision


Summer School 2018 Deep Learning : 47
Learning from the Data

Training data: used to train the system
– i.e. build the rules / learnt target function
– split into training (back-propogated) and
validation (when to stop back-propogating)
….
– specific examples (used to learn)


Test data: used to test performance of the system
– unseen by the system during training
– specific examples (used to evaluate)
….

BMVA Computer Vision


Summer School 2018 e.g. face gender classification Deep Learning : 48
Simple ? - Well almost …..

provided we avoid
the pitfalls on the way

(i.e. follow good practice and do good science)


.. see extra slides

BMVA Computer Vision


Summer School 2018 Deep Learning : 49
We must avoid over-fitting …..

(i.e. over-learning)

BMVA Computer Vision


Summer School 2018 Deep Learning : 50
Principle of Occam's Razor

Occam's Razor

“entia non sunt multiplicanda praeter
necessitatem” (latin!)

“entities should not be multiplied beyond
necessity” (english)


“All things being equal, the simplest
solution tends to be the best one”


For Machine Learning : prefer the
simplest {model | hypothesis | …. tree | 14th-century English logician
projection | network } that fits the data William of Ockham

BMVA Computer Vision


Summer School 2018 Deep Learning : 51
Graphical Example: function approximation (via regression)

Degree of Polynomial Model

Function f()

Learning Model
(approximation of f())

Training Samples
(from function)

Source: [PRML, Bishop, 2006]

BMVA Computer Vision


Summer School 2018 Deep Learning : 52
Increased Complexity

Function f()

Learning Model
(approximation of f())

Training Samples
(from function)

Source: [PRML, Bishop, 2006]

BMVA Computer Vision


Summer School 2018 Deep Learning : 53
Increased Complexity
Good Approximation

Function f()

Learning Model
(approximation of f())

Training Samples
(from function)

Source: [PRML, Bishop, 2006]

BMVA Computer Vision


Summer School 2018 Deep Learning : 54
Over-fitting!

Function f()

Learning Model
(approximation of f())

Training Samples
(from function)

Poor approximation
Source: [PRML, Bishop, 2006]
(as the model M=9 is not the simplest that fits the data!)
BMVA Computer Vision
Summer School 2018 Deep Learning : 55
How to spot over-fitting ...

Performance on the training data improves

Performance on the unseen test data decreases

Increasing model complexity or training iterations


BMVA Computer Vision
Summer School 2018 Deep Learning : 56
Key question – what about all this
terminology ?
( the “language” of the abyssi era )

(the basis of deep learning; things to know + understand)

BMVA Computer Vision


Summer School 2018 Deep Learning : 57
Loss what ?

BMVA Computer Vision


Summer School 2018 Deep Learning : 58
Loss Functions: how good is our net ?
Suppose: 3 training examples, 3 classes.
With some parameters, W the scores A loss function tells how
are: good our current classifier is

Given a dataset of examples

Where is image and


is (integer) label
cat 3.2 1.3 2.2 Loss over the dataset is a
sum of loss over examples:
car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
April 10, 2018

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 59
Loss Functions: how good is our net ?
Suppose: 3 training examples, 3 classes. Multi-class Support Vector
With some parameters, W the scores Machine (SVM) loss:
are: Given an example
where is the image and
where is the (integer) label,

and using the shorthand for


the scores vector:

the SVM loss has the form:

cat 3.2 1.3 2.2


car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
April 10, 2018

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 60
Loss Functions: how good is our net ?
Suppose: 3 training examples, 3 classes. Multi-class SVM loss = “Hinge loss”
With some parameters, W the scores
are:

cat 3.2 1.3 2.2


car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
the hinge loss has the form:
April 10, 2018

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 61
Regularization: prevent the net over-fitting

= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training
data (i.e. overfitting)
→ Occam’s Razor
Simple examples
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 62
Regularization: prevent the net over-fitting

= regularization strength
(hyperparameter)

Data loss: Model predictions Regularization: Prevent the model


should match training data from doing too well on training
data (i.e. overfitting)

Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 63
e.g. L2 Regularization: how and why ?
Expresses a preference for weight equality:
L2 Regularization

L2 regularization likes to
“spread out” the weights

Where several W may have same classification result:

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 64
Softmax(): mapping output activation
scores to probabilities
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function

Probabilities Probabilities
must be >= 0 must sum to 1

cat 3.2 24.5 0.13


exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
probabilities

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 65
Softmax(): mapping output activation
scores to probabilities
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function

Probabilities Probabilities
must be >= 0 must sum to 1

cat 3.2 24.5 0.13 compare


1.00
exp
normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
probabilities Correct (probabilities)

Cross-entropy loss (diff. in correct probabilities)

[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 66
… in every node: activation functions

BMVA Computer Vision


Summer School 2018 Deep Learning : 67
- k

Output
Of
Layer
 To Layer N+1
N-1

Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

BMVA Computer Vision


Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 68
… in every node: activation functions
- k

Output
Of
Layer
 To Layer N+1
N-1

- Use ReLU. Be careful with your learning rates


- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

BMVA Computer Vision


Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 69
Activation Functions – ReLU variations

[Mass et al., 2013]


[He et al., 2015]
- Does not saturate
Leaky ReLU - Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die” (like ReLU → 0 output)

also Parametric Rectifier (PReLU)

backprop parameter
BMVA Computer Vision
Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 70
… but how many network
architectures are we training?

(are all the node weights updated in each backpropogation cycle?)

BMVA Computer Vision


Summer School 2018 Deep Learning : 71
Dropout: efficient and robust training
for large neural nets (Hinton et al., 2012 http://arxiv.org/abs/1207.0580)


Consider a neural net with H hidden units


Each time we present a training
example within backpropogation, we
randomly omit each hidden unit in {all
| some} layers with probability 0.5.

→ we are randomly sampling from 2H


different architectures


At test time – use all hidden units but
halve all the outgoing weights
– computes an approximate mean of the
predictions of all 2H models.

BMVA Computer Vision


Summer School 2018 [ This slide – adapted from: G. Hinton] Deep Learning : 72
Get the setup correct ...

BMVA Computer Vision


Summer School 2018 Deep Learning : 73
Input pre-processing – image data
e.g. consider CIFAR-10
dataset with [32,32,3] images

Zero-centre all the image pixel data inputs so


Otherwise – major issue!
backpropogation gradients are both +ve and -ve
How -
- Subtract the mean image (e.g. AlexNet)
Not common to normalize
(CIFAR - mean image = [32,32,3] array)
variance, to do PCA
- Subtract per-channel mean (e.g. VGGNet) or whitening

(CIFAR - mean along each channel = 3 numbers) Remember to zero-


centre inputs at run-
time (test) also!
BMVA Computer Vision
Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 74
Data Augmentation – generate more data!
“cat”
Load image
and label
Compute
loss
CNN

Transform image


Use random mix/combinations of : flipping, translation,
rotation, stretching, shearing, illumination changes
(log/exp/gamma transform), lens distortions, …(go crazy!)

BMVA Computer Vision


Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 75
And finally ...

BMVA Computer Vision


Summer School 2018 Deep Learning : 76
Hyper-parameters: within deep learning

choices about the algorithm that
we set rather than learn
– e.g.
• drop-out rate
• weight initialization
• Backprop parameters ...
• …
• cross-validation folds (?) https://chrisalbon.com/machine_learning/model_selection/hyperparameter_tuning_using_grid_search/


Highly problem-dependent.
– must explore the hyper-parameter
space (via glorified “trial and error”)
to find optimal set

BMVA Computer Vision


Summer School 2018 Deep Learning : 77
Key question – why didn’t we wake
up and realize this deep learning
stuff sooner ?
(I mean, look at all that pre abyssi work – ??? )

BMVA Computer Vision


Summer School 2018 Deep Learning : 78
Earlier shallow neural nets had some limitations


When to we terminate backpropagation ?


How do we select the parameters ?
– learning rate (weight up-dates)
– network topology (number of hidden nodes / number of layers)
– choice of activation function


How can we be sure what is the network learning?
– How can we be sure the correct (classification) function is being
learned ?
c.f. AI folk-lore “the tanks story”


Are the network weights optimal ?
– maybe in a local minima in the weight space
BMVA Computer Vision
Summer School 2018 Deep Learning : 79
Key Enablers
(i.e. what changed to make all this happens)


As of now (~2012 → 2016+) we now have three key things
that made deep learning possible: Validation classification

Validation classification
Validation classification

– data (lots of it available)


~14 million labeled images, 20k classes

– low-cost, high-performance GPU hardware


(to train larger networks than before)

– key algorithmic insights to backpropogation training


(to train networks more efficiently + guard against local
minima and overfitting)

BMVA Computer Vision


Summer School 2018 Deep Learning : 80
But what about those over-fitting and
local minima problems?
(for deep networks)

BMVA Computer Vision


Summer School 2018 Deep Learning : 81
Neural Networks – a deep reprize

Recent work shows local minima are less of a problem than thought
(Pascanu et al., 2014, Dauphin et. al., 2014, Choromanska et al., 2015)

– local minima dominate low dimensions but saddle points (ridges)


dominate high dimensions
– most local minima are close to global minima in high dimensions
• … and deep neural networks use a very high dimensional weight space


Advances in training: use of drop-out to regularize the weights in the globally
connected layers (which contain most of the parameters)
– dropout: half of the hidden units in a layer are randomly removed for each
training example.
– effect: stops hidden units from relying too much on other hidden units
(hence reduces likelihood of over-fitting)

BMVA Computer Vision


Summer School 2018 Deep Learning : 82
But what if I have limited data
examples for my problem … ?
(“deep learning” needs “big data”)

BMVA Computer Vision


Summer School 2018 Deep Learning : 83
Transfer Learning

First train the network on a related task where sufficient data is available
– e.g. train on ImageNet for image classification

[Figure: Ackay / Breckon, 2017]


Use these weights from this training cycle as the initialization for a
second training cycle with the limited data from your task
– adjusting output layer for the number of classes
– e.g. limited images of rare tropical fish / x-ray images of guns

→ Essentially we transfer the knowledge from one task to the other


BMVA Computer Vision
Summer School 2018 Deep Learning : 84
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)

Object Detection
CNN pretrained
(Fast R-CNN) Image Captioning: CNN + RNN
on ImageNet

April 24, 2018

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Generating Image Descriptions”, CVPR 2015

Word vectors pretrained


Figure copyright IEEE, 2015.

with word2vec
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015.

BMVA Computer Vision


Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 85
Example: X-ray image object detection

Camera Laptop Gun Gun Component Knives Ceramic Knives mAP

AlexNet 97.23 99.70 97.30 89.64 93.19 94.50 95.26


GoogLeNet 97.14 92.56 99.50 97.70 95.50 98.40 98.40


Deep CNN (via transfer learning): Features → Classification (end to end)
– 95% (True+) over 6 object categories, FP (see above) mAP = mean Average Precision
(over all classes)

Transfer Learning Using Convolutional Neural Networks For Object Classification Within X-Ray Baggage Security Imagery (S. Akcay,
M.E. Kundegorski, M. Devereux, T.P. Breckon), In Proc. International Conference on Image Processing, IEEE, 2016.
But these deep networks seem to
take the whole image for
classification ?

What about detection ?

BMVA Computer Vision


Summer School 2018 Deep Learning : 87
Region-based CNN (R-CNN)

[Figure: Ackay / Breckon, 2017]


Learn both a Region Proposal Network (RPN)
– likely object locations, given the image

… and then classify those regions via existing CNN
architecture (jointly trained) References:
RCNN [Girshick et al. CVPR 2014]
Fast RCNN [Girshick, ICCV 2015]
Faster RCNN [Ren et al., 2015]
BMVA Computer Vision
Summer School 2018 https://www.youtube.com/watch?v=WZmSMkK9VuA Deep Learning : 88
Pedestrian detection with CNN

https://www.youtube.com/watch?v=uKU2pzpGUlM [Sermanet et al., 2013]


BMVA Computer Vision
Summer School 2018 Deep Learning : 89
Key question – how can we be sure
of what the network is learning ?
(remember – this was a problem in the good old days*)

*henceforth known as a the pre abyssi era of computer vision

BMVA Computer Vision


Summer School 2018 Deep Learning : 90
… we can map network actvaton back to
the input pixel space
• What input patern originally caused a
de-Convolution Network Convolution Neural Network
given actvaton in the feature maps?
– hence trace back through the network

de-Convolution Network layer (left) is


attached to a Convolution Neural Network layer (right)

A de-Convoluton Network layer can


be used to reconstruct an
approximate version of the features
from the layer beneath.
- hence we can recover an
approximate feature visualizaton

Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 91
De-convolving CNNs
This allows us to “Transpose” the architecture to go from
activations back to image

c2 c3 c4 c5 f6 f7 f8 …
c1
output

fT6 fT 7 fT8
cT2 c T3 c T4 c T5
c T1
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 92
Example CNN : Layer 2

Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 93
Example CNN : Layer 3

Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 94
So … we can see some internal
representations that may convince us that
deep networks are doing a good job

(after all they appear to be learning “the right stuff”**)

[** but we could be suffering from confirmation bias - “a tendency to search for or interpret
information in a way that confirms one's preconceptions” ]
(CNNs: we already think they are working, so therefore we look for evidence to confirm this)

BMVA Computer Vision


Summer School 2018 Deep Learning : 95
Are they always right ? - fooling CNNs
Original Image Small Changes New Image Original Image Small Changes New Image
(correctly predicted by CNN) (via JPEG DCT changes) (incorrectly predicted by CNN) (correctly predicted by CNN) (via JPEG DCT changes) (incorrectly predicted by CNN)

Reference: Intriguing properties of neural networks [Szegedy ICLR 2014]


Press article: http://www.i-programmer.info/news/105-artificial-intelligence/7352-the-flaw-lurking-in-every-deep-neural-net.html
BMVA Computer Vision
Summer School 2018 Deep Learning : 96
What is happening here ?

Decision boundary and feature space representation that we
(like to) think the CNN has learned is in fact not optimal
– and in fact far from perfect

Vs.

Example Data (2 labels) Multiple decision boundaries exist – how do we know which we have ?


Key questions remain : what is being learnt ?
– and how confident can we be it is being learnt ?

BMVA Computer Vision


Summer School 2018 Source: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html Deep Learning : 97
… in today’s terminology we call these
adversarial examples

… which brings us on to a whole new sub-


topic of deep learning

BMVA Computer Vision


Summer School 2018 Deep Learning : 98
Generative Adversarial Networks (GAN)
[so much to say, so little time]

[Goodfellow et al., 2014


https://arxiv.org/abs/1406.2661]

train two models:
– one to generate some sort of fake examples from random noise
(or some conditioned distribution)
– one to discern fake model examples from real examples

Many applications in improving deep network performance
BMVA Computer Vision
Summer School 2018 Deep Learning : 99
Many researchers increasing see CNN
(and wider deep learning techniques) as a
“black box” approach

We are only beginning to understand how they


work and why*

BMVA Computer Vision


Summer School 2018
* so are we perhaps still in the “dark age” of deep learning Deep Learning : 100
Hence network visualization is an
important research topic.
(we try to ascertain the decision boundary and feature representation in use)

CNN visualization attempts to


understand what is being learnt.

BMVA Computer Vision


Summer School 2018 Deep Learning : 101
Core message: CNNs, and deep learning,
have led the resurgence of Neural Networks
and generally now outperform other
methods in complex image classification
and many other tasks

However – clearly some limitations remain

BMVA Computer Vision


Summer School 2018 Deep Learning : 102
Are they the answer to all our
(computer vision) problems ?

BMVA Computer Vision


Summer School 2018 Deep Learning : 103
Beyond Classification ...
(applications in computer vision beyond classification include :
Detection Segmentation, Regression, Pose estimation, Image
Synthesis ...)

BMVA Computer Vision


Summer School 2018 Deep Learning : 104
Example: semantic pixel labelling via
SegNet ….


Application to Semantic Image Segmentation via CNN
– use of per complex network to perform per-pixel classification by
object type (i.e. semantic pixel labelling)
– Encoder ↔ Decoder architecture (but same concepts of layer types)
http://www.youtube.com/embed/e9bHTlYFwhg?rel=0

Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation." arXiv preprint arXiv:1511.00561, 2015. http://arxiv.org/abs/1511.00561

BMVA Computer Vision


Summer School 2018
http://mi.eng.cam.ac.uk/projects/segnet/ Deep Learning : 105
Labeling Pixels: Edge Detecton

DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection


BMVA Computer Vision [Bertasius et al. CVPR 2015]
Summer School 2018 Deep Learning : 106
CNN as a Similarity Measure for Matching

Stereo vision matching [Zbontar and LeCun CVPR 2015]


Compare image patches [Zagoruyko and Komodakis 2015] Face detection/FaceNet
[Schroff et al. 2015]

Optic Flow - FlowNet [Fischer et al 2015]


Match ground and aerial images
BMVA Computer Vision [Lin et al. CVPR 2015]
Summer School 2018 Deep Learning : 107
CNN for Image Restoraton/Enhancement

Image super-resolution Non-blind deconvolution (de-blurring)


[Dong et al. ECCV 2014] [Xu et al. NIPS 2014]

Non-uniform blur estimation - [Sun et al. CVPR 2015]


BMVA Computer Vision
Summer School 2018 Deep Learning : 108
CNN for Image Generaton (synthesis)

Video: http://lmb.informatik.uni-freiburg.de/Publications/2015/DB15/Generate_Chairs_mov_morphing.avi

Learning to Generate Chairs with Convolutional Neural Networks [Dosovitskiy et al. CVPR 2015]

BMVA Computer Vision


Summer School 2018 Deep Learning : 109
Using CNN actvaton(s) as features ….

[Donahue et al. ICML 2013]

CNN Features off-the-shelf:


an Astounding Baseline for Recognition
[Razavian et al. 2014]

BMVA Computer Vision


Summer School 2018 Deep Learning : 110
Recent trends ...

BMVA Computer Vision


Summer School 2018 Deep Learning : 111
Images ↔ text ….
No errors Minor errors Somewhat related

[Vinyals et al., 2015]


[Karpathy and Fei-Fei,
2015]

A white teddy bear sitting in A man in a baseball A woman is holding a


the grass uniform throwing a ball cat in her hand

All images are CC0 Public domain:


https://pixabay.com/en/luggage-antique-cat-1643010/
https://pixabay.com/en/teddy-plush-bears-cute-teddy-bear-1623436/
https://pixabay.com/en/surf-wave-summer-sport-litoral-1668716/
https://pixabay.com/en/woman-female-model-portrait-adult-983967/
https://pixabay.com/en/handstand-lake-meditation-496008/
https://pixabay.com/en/baseball-player-shortstop-infield-1045263/

A man riding a wave on A cat sitting on a A woman standing on a


top of a surfboard suitcase on the floor beach holding a surfboard Captions generated by Justin Johnson using Neuraltalk2

BMVA Computer Vision


Summer School 2018
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 112
Style Transfer ….

BMVA Computer Vision


Summer School 2018
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 113
Efficient Networks – MobileNets (et al.)

http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/


MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam
(Google)
https://arxiv.org/abs/1704.04861


MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
Liang-Chieh Chen
CVPR 2018
https://openaccess.thecvf.com/content_cvpr_2018/CameraReady/3427.pdf

BMVA Computer Vision


Summer School 2018 Deep Learning : 114
So - Why does deep learning work
so well ?
(compared to other pre-abyssi techniques)

BMVA Computer Vision


Summer School 2018 Deep Learning : 115
My answer:

larger parameter space


optimized with more data
trained to avoid over-fitting

(a ~10 billion+ parameter space can possibly represent any other ML technique as a subset)

BMVA Computer Vision


Summer School 2018 Deep Learning : 116
Further Reading – post abyssi textbooks

Deep Learning - http://www.deeplearningbook.org/

Goodfellow / Bengio / Courville

MIT Press
2016

Available as HTML online


(free)

BMVA Computer Vision


Summer School 2018 Deep Learning : 117
Further Reading – pre abyssi textbooks

Bayesian Reasoning and Machine
Learning
– David Barber
http://www.cs.ucl.ac.uk/staff/d.barber/brml/
(Cambs. Univ. Press, 2012)


Computer Vision: Models, Learning,
and Inference
– Simon Prince
(Springer, 2012)
http://www.computervisionmodels.com/

… both very probability driven, both available as free PDF online


(woo, hoo!)

BMVA Computer Vision


Summer School 2018 Deep Learning : 118
Further Reading – key papers

Y. LeCun, Y. Bengio, and G. Hinton. "Deep learning." Nature 521.7553 (2015): 436.
http://www.csri.utoronto.ca/~hinton/absps/NatureDeepReview.pdf

Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural Networks 61
(2015): 85-117. (via http://arxiv.org/pdf/1404.7828 )

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional
Neural Networks, NIPS 2012 (http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf )

Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks."
Computer vision–ECCV 2014. Springer International Publishing, 2014. 818-833.
http://ftp.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf


Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and
Bengio, Y., 2014. Generative adversarial nets. In Advances in neural information processing
systems (pp. 2672-2680).
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region
proposal networks. InAdvances in neural information processing systems 2015 (pp. 91-99).
http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

BMVA Computer Vision


Summer School 2018 Deep Learning : 119
That's all folks ...
Slides, examples, demo code, links + extra supporting slides @ www.durham.ac.uk/toby.breckon/mltutorial/
BMVA Computer Vision Deep Learning : 120
Summer School 2018

Vous aimerez peut-être aussi