Bmva Ss 2018 Breckon Deepmachinelearning PDF

Deep Learning within
Computer Vision
a whirlwind tour of the key principles
Toby Breckon
Engineering and Computer Science
Durham University
www.durham.ac.uk/toby.breckon/mltutorial/ toby.breckon@durham.ac.uk
Slide material acknowledgements (some material): Lee (UC Davies), Grauman (UT Austin), Lazebnik (Illinois), Fei-Fei (Stanford),
Fergus (Stoney Brook), Huang (Illinois), Lee (Michigan), Ranzato (Facebook A.I. Research), Sermanet (Google), Vedaldi (Oxford), Hinton (Toronto), Fisher (HIPR2, Edinburgh) + additional URL/acknowledgement on individual slides
BMVA Computer Vision Deep Learning : 1

Summer School 2018
Let’s start at the very beginning ...
BMVA Computer Vision

Summer School 2018 Deep Learning : 2
Machine Learning ?

Why Machine Learning?
– we cannot program everything
– some tasks are difficult to define algorithmically
– especially in computer vision

…. visual sensing has few rules

Well-defined learning problems ?
– easy to learn Vs. difficult to learn
..... varying complexity of visual patterns

An example: learning to recognise objects ...
Image: DK

Learning ? - in humans

Learning ? - in computers

Machine Learning

Definition:
●
A set of methods for the automated analysis of structure in
data. …. two main strands of
work, (i) unsupervised learning ….
and (ii) supervised learning.
….similar to ... data mining, but ... focus .. more on

autonomous machine performance, ….
rather than enabling humans to learn from the data.
[Dictionary of Image Processing & Computer Vision, Fisher et al., 2014]

Supervised Vs. Unsupervised

Supervised c1
– knowledge of output - learning with the
presence of an “expert” / teacher c2
• data is labelled with a class or value
• Goal: predict class or value label
c3
●
e.g. Neural Network, Support Vector Machines, Decision
Trees, Bayesian Classifiers .... ….

Unsupervised
– no knowledge of output class or value
• data is unlabelled or value un-known
• Goal: determine data patterns/groupings
– Self-guided learning algorithm
●
(internal self-evaluation against some criteria)
●
e.g. k-means, genetic algorithms, clustering approaches ... …. ?
(some) Feature Representation
(e.g. SIFT, HOG, histogram, Bag of Words, PCA ...)
… in the big picture person
cat
dog
Machine cow
….
Learning ….
= ….
“Decision
OR
car
or
Pixels / Voxels / Samples
Prediction” rhino
….
position
“style”
depth

Common Machine Learning Tasks

Object Classification
what object ?
http://pascallin.ecs.soton.ac.uk/challenges/VOC/

Object Detection
{people | vehicle | … intruder ….}
object or no-object ?

Instance Recognition ? {face | vehicle plate| gait …. → biometrics}
who (or what) is it ?

Sub-category analysis
which object type ?
{gender | type | species | age …...}

Sequence { Recognition | Classification } ?
what is happening / occurring ?

Types of Machine Learning Problem

Classification
●
Predict (classify) sample → discrete set of class labels
●
e.g. classes {object 1, object 2 … } for recognition task
…. ?
●
e.g. classes {object, !object} for detection task

Regression (traditionally less common in comp. vis.)
●
Predict sample → associated numerical value (variable)
●
e.g. distance to target based on shape features
●
Linear and non-linear attribute to value relationships

Association & clustering
●
grouping a set of instances by attribute similarity
●
e.g. image segmentation
[Ess et al, 2009]

Simple Regression Example – Head Pose Estimation
[ video ]

Input: image features (HOG)

Output: { yaw | pitch }

varying illumination + vibration
http://www.youtube.com/embed/UcF_otQSMEc?rel=0
[Walger / Breckon, 2014]

Complex Regression Example – Full-Body Pose Estimation

Input: raw image

Output: 17 pose keypoints
PoseNet (Google Research):
https://github.com/tensorflow/tfjs-models/tree/master/posenet
Live demo (browser):

https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html
BMVA Computer Vision [Papandreou et al. 2018]

The move from shallow to deep …
(< ~2013) (> ~2013)
pre deep learning / pre abyssi post deep learning / post abyssi

Traditional (shallow) Approaches
Image / Video
Pixels Object
Hand-designed Trainable
Class
feature extraction classifier
• Features are not learned

●
vectors of shape measures, edge distributions, colours distributions, feature points,
HOG, visual words, ... etc.
●
… i.e. calculated summary “numerical” descriptors
.. see extra slides
• Trainable classifier is often generic

(e.g. SVM kernel, Decision Forest)
Deep Learning – end to end approaches
• Learn a feature hierarchy all the way from pixels (or voxels)
to classifier
• Each layer extracts “features” from the output of previous
layer
• Layers have similar structure, performing varying functions
• Train (i.e. optimize) all layers jointly
Image/
Video Layer 1 Layer 2 Layer 3 Classifier
Pixels

“Shallow” vs. “deep” architectures
Traditional recognition: “Shallow” architecture
Image/
Video Hand-designed Trainable Object
Pixels feature extracton classifer Class
(or output
predicton)
(modern) Deep learning: “Deep” architecture
Image/
Video …
Simple Object
Layer 1 Layer N
Pixels classifer Class
(or output
predicton)

A typical end-to-end deep learning
convolutional neural network ….
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
• “AlexNet”: seminal ImageNet Challenge winner (2012)

• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• More data (106 vs. 103 images)
• GPU implementaton (50x speedup over CPU)
• Trained on two GPUs for a week
• Algorithms: beter regularizaton for training (DropOut)
Steel
drumThe Image Classification Challenge:
1,000 object classes
1,431,167 images
% error
[ Human - Russakovsky et al. IJCV 2015]
4/3/201
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]
8
AlexNet’s performance on this
benchmark task was the research
event (“discovery”) that led to the
shallow to deep transformation in
computer vision ...*
*
although CNN were not created overnight, and have their origins in [LeCun, 1998] among others

pre abyssi → post abyssi*
*
here meaning post (after) the “discovery” of deep learning by the computer vision research community
Note: we are currently in what may become known as the “deep-age”, if not perhaps the “dark-age” (?) of computer vision (see final slides)
This talk ...
[ itself a shallow overview of deep learning approaches ]

Is about … 
Is not about …
– understanding the core – how to use {tensorflow |
concepts, well pytorch | keras | mxnet ….
tensor-py-flow-net (!?) … }
– bringing everyone up to
– hyper-parameter tuning
speed on where we are and
how we got here
• re: deep learning – specific advanced concepts
that have been published on
arXiv in the time I have been
– understanding the limitations talking …
of current understanding
+ the challenges that lie

ahead
Summer School 2018
… sorry. Deep Learning : 21
Deep learning can do some clever stuff ...
A GA B B' ....
DB ladv
lrec
A'' GB A B'' Input RGB
128 128 128 128
Restyled RGB
128
128
(I) GA->B(I)
lrec
64
64
x9
ladv
32
32
DA
A' GB C B
C' Restyled RGB Output Depth
64
DC ladv GB->C(I) GB->C[GA->B(I)]
128
128
lrec
256
256
Training
512
512
C Testing
1024
512
1024
512
1024
512
512
512 1024
[ video ]

e.g. Monocular depth prediction via style transfer
[Atapour / Breckon, CVPR 2018] - https://github.com/atapour/monocularDepth-Inference
Key question – why does this stuff
work so well ?
(let’s examine some of the fundamentals)

Key Principle
Final
Classification
Each layer in a deep network performs a ….

different transformation (function) to
map from input to output – these vary from Feature maps
{convolution, pooling, sub-sampling, non- ….
linear mapping (fully connected), ….}.
Pooling
….
Within a traditional (shallow) Neural
Network, the perceptron activation Non-linearity
functions are all the same (but we vary
….
the weights).
Convolution (Learned)
This provides the network with a larger ….
parameterization space to represent
(complex) input to output relationships. Input Image

The rise of the ….
Convolutional Neural Networks (CNN)

• Mult-layer Neural network with:
- Local connectvity
- Shared weight parameters across spatal positons
• Stack multple stages of feature extractors
operatng directly on the image
• Higher stages compute more global, more
invariant feature representaton Task: digit recognition
• Final classifcaton layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.

Clarification:
Convolutional Neural Networks (CNN)

(i.e. networks using convolution layers)
are a subset of the generalized

Deep Learning extension to
Neural Networks
(i.e. Deep Neural Networks)

CNN are a specific to images (or similar densely sampled signals)**
** we'll concentrate on those here
A deep multi-layer architecture ...

Final
Convolutonal Neural Networks (CNN) Classification
• Feed-forward network:
….
– Convolve input (feature extracton)
– Non-linearity (rectfed linear) Feature maps
– Pooling (local max) ….
Pooling
Supervised learning (with labels) ….
Train convoluton flters by backpropogaton Non-linearity

….
….
Input Image
Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )

Convolution Layer(s)
Convolution in the first layer

immediately reduces the
complexity of the input in a
structured and meaningful way
(based on learnt weights)
Can share parameters across filters to reduce to size of

parameter set

• Convolutional
– dependencies are local
– translation invariance
– few parameters (filter weights)
– filter stride can be > 1 (faster, less memory)
t .
.
.
Input (image) Intermediate Feature Map

RECAP: from Low Level Vision
Aside : image convolution

(in general image processing)

… is essentially the localised weighted sum of the image and
a convolution kernel (mask weights) over a N x M pixel
neighbourhood, at a given location within the image (x,y).
smoothing
Input Image
[ used in image filtering operations (e.g. smoothing) ]

Output Image
Image source: developer.apple.com

RECAP: from Low Level Features
Convolution is very powerful ...

original filter (3 x 3)
blur
1 1 1
1 1 1
1 1 1
sharpen
0 -1 0
1 5 1
0 -1 0
edges
1 2 1
0 0 0
-1 -2 -1
Different weights have do a wide range of effects on input data ….

hence → multiple layers of
convolution
(“convolutions upon convolutions”)
can approximate and provide most

feature extraction and image pre-
filtering (de-noising) approaches ...
Output - Feature Map
Input - image / layer
….
Output - Feature Map
Input - image / layer
….
Produces a structured
intermediate feature map from the
input image (or previous layer in
the network)
Output -
Input - image / layer Feature Map

Non-linearity Layer(s)
(a.k.a. fully connected)

Provides a non-linear input to
output mapping via a (traditional)
activation function approach Previous
Layer, size N
Next
Layer, size M
layer of neurons
…..
– maybe either sub-sampling (N→ M) or fully
connected (N → M, N=M) via per element
activation function, e.g.
• Tanh
• Sigmoid: 1/(1+exp(-x))
• Rectified linear
(most common)
» Simplifies backprop
» Makes learning faster
» Avoids output saturation
issues
Pooling layer(s)
Pools the input layer to form new intermediate output layer.
By “pooling” (e.g., taking max / sum) filter responses at different

locations we gain robustness to the variance of the spatial
location of features and reduce input dimensionality.

Pooling layer(s)
• Performs localized sum() or max() over sub windows/regions

●
non-overlapping Vs. overlapping regions
●
Role of pooling:
●
Invariance to small transformatons
●
Larger receptve felds (see more of input)
max()
sum()

CNN – example architecture (various layers)
Example CNN design:

https://sites.google.com/site/5kk73gpu2013/assignment/cnn

Number of layers, type of layer, number of nodes/maps per
layer – all down to the (human) designer
– many variants have emerged

CNN training – efficient backpropogation
as per traditional neural network approaches (with Dropout)
Seminal
Deep Learning Architectures
ImageNet Classification with Deep Convolutional Neural Networks

A Krizhevsky I Sutskever, G Hinton (2012) - “AlexNet”
Going Deeper with Convolutions, C Szegedy et al (2014) - “GoogLeNet”

→ Contemporary architectures ….

VGG -16 … (deeper, smaller 3x3 convolutions throughout)
Simonyan Zisserman, 2014: http://www.robots.ox.ac.uk/~vgg/research/very_deep/

ResNet …(residual blocks connecting input to the output)
K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016.
Figures: http://book.paddlepaddle.org/03.image_classification/

…. and more and more ...
Varying Deep CNN Architectures
(and applications) all based on ... Final
Classification
• Feed-forward network: ….
– Convolve input (feature extracton) Feature maps

– Non-linearity (rectfed linear) ….
– Pooling (local max)
Pooling
… all trained by backpropogaton ….
Non-linearity
Repeated multiple times over network architecture
….
….
Input
Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )

ASIDE: deep learners should know this
Train via Backpropagation

Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.

Neural Network Training: weight modifications are made in
the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence
“Backpropagation”

Key Algorithmic Steps
– Initialize weights (to small random values*) in the network
– Propagate the inputs forward
• (by applying activation function) at each node
– Backpropagate the error backwards
• (by updating weights and biases)
– Terminating condition
• when validation error is very small or enough iterations
Backpropogation details beyond scope/time (see accompanying reading)
.. see extra slides
So overall deep network layers
provide …
feature extraction
data de-noising
dimensionality reduction
feature pooling
spatial invariance
non-linear input → output mapping
… specifically optimized towards a

given problem *
BMVA Computer Vision (* that is represented by a given set of defined examples)
… which really covers most desirable
aspects for most computer vision
problems we encounter.
[if you think about it]

Key question – is it really that
easy ?
(images in → results out ?!?)

Is it really that simple ?
https://www.youtube.com/watch?v=mxKlUO_tjcg [ video ]
BMVA Computer Vision Cao et al, CVPR, 2017

Summer School 2018 https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation Deep Learning : 46
Remember, it's all about ..

Learning from the Data

Training data: used to train the system
– i.e. build the rules / learnt target function
– split into training (back-propogated) and
validation (when to stop back-propogating)
….
– specific examples (used to learn)

Test data: used to test performance of the system
– unseen by the system during training
– specific examples (used to evaluate)
….

Summer School 2018 e.g. face gender classification Deep Learning : 48
Simple ? - Well almost …..
provided we avoid
the pitfalls on the way
(i.e. follow good practice and do good science)

.. see extra slides

We must avoid over-fitting …..
(i.e. over-learning)

Principle of Occam's Razor

Occam's Razor
●
“entia non sunt multiplicanda praeter
necessitatem” (latin!)
●
“entities should not be multiplied beyond
necessity” (english)
●
“All things being equal, the simplest
solution tends to be the best one”

For Machine Learning : prefer the
simplest {model | hypothesis | …. tree | 14th-century English logician
projection | network } that fits the data William of Ockham

Graphical Example: function approximation (via regression)
Degree of Polynomial Model
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Source: [PRML, Bishop, 2006]

Increased Complexity
Function f()
Learning Model
Training Samples
(from function)

Increased Complexity
Good Approximation
Function f()
Learning Model
Training Samples
(from function)

Over-fitting!
Function f()
Learning Model
Training Samples
(from function)
Poor approximation
(as the model M=9 is not the simplest that fits the data!)
How to spot over-fitting ...

Performance on the training data improves

Performance on the unseen test data decreases
Increasing model complexity or training iterations

Key question – what about all this
terminology ?
( the “language” of the abyssi era )
(the basis of deep learning; things to know + understand)

Loss what ?

Loss Functions: how good is our net ?
Suppose: 3 training examples, 3 classes.
With some parameters, W the scores A loss function tells how
are: good our current classifier is
Given a dataset of examples
Where is image and

is (integer) label
cat 3.2 1.3 2.2 Loss over the dataset is a
sum of loss over examples:
car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
April 10, 2018
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
Suppose: 3 training examples, 3 classes. Multi-class Support Vector
With some parameters, W the scores Machine (SVM) loss:
are: Given an example
where is the image and
where is the (integer) label,
and using the shorthand for

the scores vector:
the SVM loss has the form:
cat 3.2 1.3 2.2

car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
April 10, 2018
Suppose: 3 training examples, 3 classes. Multi-class SVM loss = “Hinge loss”
With some parameters, W the scores
are:
cat 3.2 1.3 2.2

car 5.1 4.9 2.5
frog -1.7 2.0 -3.1
the hinge loss has the form:
April 10, 2018
Regularization: prevent the net over-fitting
= regularization strength
(hyperparameter)
Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training
data (i.e. overfitting)
→ Occam’s Razor
Simple examples
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):
Regularization: prevent the net over-fitting
= regularization strength
(hyperparameter)
Data loss: Model predictions Regularization: Prevent the model

should match training data from doing too well on training
data (i.e. overfitting)
Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature
e.g. L2 Regularization: how and why ?
Expresses a preference for weight equality:
L2 Regularization
L2 regularization likes to
“spread out” the weights
Where several W may have same classification result:
Softmax(): mapping output activation
scores to probabilities
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13

exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
probabilities
Softmax(): mapping output activation
scores to probabilities
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function
Probabilities Probabilities
must be >= 0 must sum to 1
cat 3.2 24.5 0.13 compare

1.00
exp
normalize
car 5.1 164.0 0.87 0.00
frog -1.7 0.18 0.00 0.00
probabilities Correct (probabilities)
Cross-entropy loss (diff. in correct probabilities)
… in every node: activation functions

- k
Output
Of
Layer
 To Layer N+1
N-1
Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU

Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 68
… in every node: activation functions
- k
Output
Of
Layer
 To Layer N+1
N-1
- Use ReLU. Be careful with your learning rates

- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

Activation Functions – ReLU variations
[Mass et al., 2013]

[He et al., 2015]
- Does not saturate
Leaky ReLU - Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die” (like ReLU → 0 output)
also Parametric Rectifier (PReLU)
backprop parameter
… but how many network
architectures are we training?
(are all the node weights updated in each backpropogation cycle?)

Dropout: efficient and robust training
for large neural nets (Hinton et al., 2012 http://arxiv.org/abs/1207.0580)

Consider a neural net with H hidden units

Each time we present a training
example within backpropogation, we
randomly omit each hidden unit in {all
| some} layers with probability 0.5.
→ we are randomly sampling from 2H

different architectures

At test time – use all hidden units but
halve all the outgoing weights
– computes an approximate mean of the
predictions of all 2H models.

Summer School 2018 [ This slide – adapted from: G. Hinton] Deep Learning : 72
Get the setup correct ...

Input pre-processing – image data
e.g. consider CIFAR-10
dataset with [32,32,3] images
Zero-centre all the image pixel data inputs so

Otherwise – major issue!
backpropogation gradients are both +ve and -ve
How -
- Subtract the mean image (e.g. AlexNet)
Not common to normalize
(CIFAR - mean image = [32,32,3] array)
variance, to do PCA
- Subtract per-channel mean (e.g. VGGNet) or whitening
(CIFAR - mean along each channel = 3 numbers) Remember to zero-

centre inputs at run-
time (test) also!
Data Augmentation – generate more data!
“cat”
Load image
and label
Compute
loss
CNN
Transform image

Use random mix/combinations of : flipping, translation,
rotation, stretching, shearing, illumination changes
(log/exp/gamma transform), lens distortions, …(go crazy!)

And finally ...

Hyper-parameters: within deep learning

choices about the algorithm that
we set rather than learn
– e.g.
• drop-out rate
• weight initialization
• Backprop parameters ...
• …
• cross-validation folds (?) https://chrisalbon.com/machine_learning/model_selection/hyperparameter_tuning_using_grid_search/

Highly problem-dependent.
– must explore the hyper-parameter
space (via glorified “trial and error”)
to find optimal set

Key question – why didn’t we wake
up and realize this deep learning
stuff sooner ?
(I mean, look at all that pre abyssi work – ??? )

Earlier shallow neural nets had some limitations

When to we terminate backpropagation ?

How do we select the parameters ?
– learning rate (weight up-dates)
– network topology (number of hidden nodes / number of layers)
– choice of activation function

How can we be sure what is the network learning?
– How can we be sure the correct (classification) function is being
learned ?
c.f. AI folk-lore “the tanks story”

Are the network weights optimal ?
– maybe in a local minima in the weight space
Key Enablers
(i.e. what changed to make all this happens)

As of now (~2012 → 2016+) we now have three key things
that made deep learning possible: Validation classification
Validation classification
Validation classification
– data (lots of it available)

~14 million labeled images, 20k classes
– low-cost, high-performance GPU hardware

(to train larger networks than before)
– key algorithmic insights to backpropogation training

(to train networks more efficiently + guard against local
minima and overfitting)

But what about those over-fitting and
local minima problems?
(for deep networks)

Neural Networks – a deep reprize

Recent work shows local minima are less of a problem than thought
(Pascanu et al., 2014, Dauphin et. al., 2014, Choromanska et al., 2015)
– local minima dominate low dimensions but saddle points (ridges)

dominate high dimensions
– most local minima are close to global minima in high dimensions
• … and deep neural networks use a very high dimensional weight space

Advances in training: use of drop-out to regularize the weights in the globally
connected layers (which contain most of the parameters)
– dropout: half of the hidden units in a layer are randomly removed for each
training example.
– effect: stops hidden units from relying too much on other hidden units
(hence reduces likelihood of over-fitting)

But what if I have limited data
examples for my problem … ?
(“deep learning” needs “big data”)

Transfer Learning

First train the network on a related task where sufficient data is available
– e.g. train on ImageNet for image classification
[Figure: Ackay / Breckon, 2017]

Use these weights from this training cycle as the initialization for a
second training cycle with the limited data from your task
– adjusting output layer for the number of classes
– e.g. limited images of rare tropical fish / x-ray images of guns
→ Essentially we transfer the knowledge from one task to the other

Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained
(Fast R-CNN) Image Captioning: CNN + RNN
on ImageNet
April 24, 2018
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for

Generating Image Descriptions”, CVPR 2015
Word vectors pretrained

Figure copyright IEEE, 2015.
with word2vec
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015.

Example: X-ray image object detection
Camera Laptop Gun Gun Component Knives Ceramic Knives mAP
AlexNet 97.23 99.70 97.30 89.64 93.19 94.50 95.26

GoogLeNet 97.14 92.56 99.50 97.70 95.50 98.40 98.40
●
Deep CNN (via transfer learning): Features → Classification (end to end)
– 95% (True+) over 6 object categories, FP (see above) mAP = mean Average Precision
(over all classes)
Transfer Learning Using Convolutional Neural Networks For Object Classification Within X-Ray Baggage Security Imagery (S. Akcay,
M.E. Kundegorski, M. Devereux, T.P. Breckon), In Proc. International Conference on Image Processing, IEEE, 2016.
But these deep networks seem to
take the whole image for
classification ?
What about detection ?

Region-based CNN (R-CNN)
[Figure: Ackay / Breckon, 2017]

Learn both a Region Proposal Network (RPN)
– likely object locations, given the image

… and then classify those regions via existing CNN
architecture (jointly trained) References:
RCNN [Girshick et al. CVPR 2014]
Fast RCNN [Girshick, ICCV 2015]
Faster RCNN [Ren et al., 2015]
Summer School 2018 https://www.youtube.com/watch?v=WZmSMkK9VuA Deep Learning : 88
Pedestrian detection with CNN
https://www.youtube.com/watch?v=uKU2pzpGUlM [Sermanet et al., 2013]

Key question – how can we be sure
of what the network is learning ?
(remember – this was a problem in the good old days*)
*henceforth known as a the pre abyssi era of computer vision

… we can map network actvaton back to
the input pixel space
• What input patern originally caused a
de-Convolution Network Convolution Neural Network
given actvaton in the feature maps?
– hence trace back through the network
de-Convolution Network layer (left) is

attached to a Convolution Neural Network layer (right)
A de-Convoluton Network layer can

be used to reconstruct an
approximate version of the features
from the layer beneath.
- hence we can recover an
approximate feature visualizaton
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

De-convolving CNNs
This allows us to “Transpose” the architecture to go from
activations back to image
c2 c3 c4 c5 f6 f7 f8 …
c1
output
fT6 fT 7 fT8
cT2 c T3 c T4 c T5
c T1

Example CNN : Layer 2

Example CNN : Layer 3

So … we can see some internal
representations that may convince us that
deep networks are doing a good job
(after all they appear to be learning “the right stuff”**)
[** but we could be suffering from confirmation bias - “a tendency to search for or interpret
information in a way that confirms one's preconceptions” ]
(CNNs: we already think they are working, so therefore we look for evidence to confirm this)

Are they always right ? - fooling CNNs
Original Image Small Changes New Image Original Image Small Changes New Image
(correctly predicted by CNN) (via JPEG DCT changes) (incorrectly predicted by CNN) (correctly predicted by CNN) (via JPEG DCT changes) (incorrectly predicted by CNN)
Reference: Intriguing properties of neural networks [Szegedy ICLR 2014]

Press article: http://www.i-programmer.info/news/105-artificial-intelligence/7352-the-flaw-lurking-in-every-deep-neural-net.html
What is happening here ?

Decision boundary and feature space representation that we
(like to) think the CNN has learned is in fact not optimal
– and in fact far from perfect
Vs.
Example Data (2 labels) Multiple decision boundaries exist – how do we know which we have ?

Key questions remain : what is being learnt ?
– and how confident can we be it is being learnt ?

Summer School 2018 Source: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html Deep Learning : 97
… in today’s terminology we call these
adversarial examples
… which brings us on to a whole new sub-

topic of deep learning

Generative Adversarial Networks (GAN)
[so much to say, so little time]
[Goodfellow et al., 2014

https://arxiv.org/abs/1406.2661]

train two models:
– one to generate some sort of fake examples from random noise
(or some conditioned distribution)
– one to discern fake model examples from real examples

Many applications in improving deep network performance
Many researchers increasing see CNN
(and wider deep learning techniques) as a
“black box” approach
We are only beginning to understand how they

work and why*

Summer School 2018
* so are we perhaps still in the “dark age” of deep learning Deep Learning : 100
Hence network visualization is an
important research topic.
(we try to ascertain the decision boundary and feature representation in use)
CNN visualization attempts to

understand what is being learnt.

Core message: CNNs, and deep learning,
have led the resurgence of Neural Networks
and generally now outperform other
methods in complex image classification
and many other tasks
However – clearly some limitations remain

Are they the answer to all our
(computer vision) problems ?

Beyond Classification ...
(applications in computer vision beyond classification include :
Detection Segmentation, Regression, Pose estimation, Image
Synthesis ...)

Example: semantic pixel labelling via
SegNet ….

Application to Semantic Image Segmentation via CNN
– use of per complex network to perform per-pixel classification by
object type (i.e. semantic pixel labelling)
– Encoder ↔ Decoder architecture (but same concepts of layer types)
http://www.youtube.com/embed/e9bHTlYFwhg?rel=0
Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation." arXiv preprint arXiv:1511.00561, 2015. http://arxiv.org/abs/1511.00561

Summer School 2018
http://mi.eng.cam.ac.uk/projects/segnet/ Deep Learning : 105
Labeling Pixels: Edge Detecton
DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection

BMVA Computer Vision [Bertasius et al. CVPR 2015]
CNN as a Similarity Measure for Matching
Stereo vision matching [Zbontar and LeCun CVPR 2015]

Compare image patches [Zagoruyko and Komodakis 2015] Face detection/FaceNet
[Schroff et al. 2015]
Optic Flow - FlowNet [Fischer et al 2015]

Match ground and aerial images
BMVA Computer Vision [Lin et al. CVPR 2015]
CNN for Image Restoraton/Enhancement
Image super-resolution Non-blind deconvolution (de-blurring)

[Dong et al. ECCV 2014] [Xu et al. NIPS 2014]
Non-uniform blur estimation - [Sun et al. CVPR 2015]

CNN for Image Generaton (synthesis)
Video: http://lmb.informatik.uni-freiburg.de/Publications/2015/DB15/Generate_Chairs_mov_morphing.avi
Learning to Generate Chairs with Convolutional Neural Networks [Dosovitskiy et al. CVPR 2015]

Using CNN actvaton(s) as features ….
[Donahue et al. ICML 2013]
CNN Features off-the-shelf:

an Astounding Baseline for Recognition
[Razavian et al. 2014]

Recent trends ...

Images ↔ text ….
No errors Minor errors Somewhat related
[Vinyals et al., 2015]

[Karpathy and Fei-Fei,
2015]
A white teddy bear sitting in A man in a baseball A woman is holding a

the grass uniform throwing a ball cat in her hand
All images are CC0 Public domain:

https://pixabay.com/en/luggage-antique-cat-1643010/
https://pixabay.com/en/teddy-plush-bears-cute-teddy-bear-1623436/
https://pixabay.com/en/surf-wave-summer-sport-litoral-1668716/
https://pixabay.com/en/woman-female-model-portrait-adult-983967/
https://pixabay.com/en/handstand-lake-meditation-496008/
https://pixabay.com/en/baseball-player-shortstop-infield-1045263/
A man riding a wave on A cat sitting on a A woman standing on a

top of a surfboard suitcase on the floor beach holding a surfboard Captions generated by Justin Johnson using Neuraltalk2

Summer School 2018
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 112
Style Transfer ….

Summer School 2018
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 113
Efficient Networks – MobileNets (et al.)
http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/

MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam
(Google)
https://arxiv.org/abs/1704.04861

MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
Liang-Chieh Chen
CVPR 2018
https://openaccess.thecvf.com/content_cvpr_2018/CameraReady/3427.pdf

So - Why does deep learning work
so well ?
(compared to other pre-abyssi techniques)

My answer:
larger parameter space

optimized with more data
trained to avoid over-fitting
(a ~10 billion+ parameter space can possibly represent any other ML technique as a subset)

Further Reading – post abyssi textbooks

Deep Learning - http://www.deeplearningbook.org/
Goodfellow / Bengio / Courville
MIT Press
2016
Available as HTML online

(free)

Further Reading – pre abyssi textbooks

Bayesian Reasoning and Machine
Learning
– David Barber
http://www.cs.ucl.ac.uk/staff/d.barber/brml/
(Cambs. Univ. Press, 2012)

Computer Vision: Models, Learning,
and Inference
– Simon Prince
(Springer, 2012)
http://www.computervisionmodels.com/
… both very probability driven, both available as free PDF online

(woo, hoo!)

Further Reading – key papers

Y. LeCun, Y. Bengio, and G. Hinton. "Deep learning." Nature 521.7553 (2015): 436.
http://www.csri.utoronto.ca/~hinton/absps/NatureDeepReview.pdf

Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural Networks 61
(2015): 85-117. (via http://arxiv.org/pdf/1404.7828 )

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional
Neural Networks, NIPS 2012 (http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf )

Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks."
Computer vision–ECCV 2014. Springer International Publishing, 2014. 818-833.
http://ftp.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and
Bengio, Y., 2014. Generative adversarial nets. In Advances in neural information processing
systems (pp. 2672-2680).
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region
proposal networks. InAdvances in neural information processing systems 2015 (pp. 91-99).
http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

That's all folks ...
Slides, examples, demo code, links + extra supporting slides @ www.durham.ac.uk/toby.breckon/mltutorial/
BMVA Computer Vision Deep Learning : 120
Summer School 2018

Bmva Ss 2018 Breckon Deepmachinelearning PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Bmva Ss 2018 Breckon Deepmachinelearning PDF

Transféré par

Droits d'auteur :

Formats disponibles

Deep Learning within

BMVA Computer Vision Deep Learning : 1

BMVA Computer Vision

– especially in computer vision

BMVA Computer Vision

BMVA Computer Vision

BMVA Computer Vision

….similar to ... data mining, but ... focus .. more on

[Dictionary of Image Processing & Computer Vision, Fisher et al., 2014]

BMVA Computer Vision

… in the big picture person

BMVA Computer Vision

BMVA Computer Vision

BMVA Computer Vision

BMVA Computer Vision

Live demo (browser):

BMVA Computer Vision [Papandreou et al. 2018]

BMVA Computer Vision

• Features are not learned

• Trainable classifier is often generic

BMVA Computer Vision

(modern) Deep learning: “Deep” architecture

BMVA Computer Vision

• “AlexNet”: seminal ImageNet Challenge winner (2012)

[ Human - Russakovsky et al. IJCV 2015]

BMVA Computer Vision

+ the challenges that lie

C' Restyled RGB Output Depth

BMVA Computer Vision

Each layer in a deep network performs a ….

BMVA Computer Vision

Convolutional Neural Networks (CNN)

• Final classifcaton layer at the end

BMVA Computer Vision

Convolutional Neural Networks (CNN)

are a subset of the generalized

(i.e. Deep Neural Networks)

BMVA Computer Vision

Train convoluton flters by backpropogaton Non-linearity

BMVA Computer Vision

Convolution in the first layer

Can share parameters across filters to reduce to size of

BMVA Computer Vision

Input (image) Intermediate Feature Map

Aside : image convolution

[ used in image filtering operations (e.g. smoothing) ]

BMVA Computer Vision

Convolution is very powerful ...

Different weights have do a wide range of effects on input data ….

can approximate and provide most

Output - Feature Map

Input - image / layer

Output - Feature Map

Input - image / layer

BMVA Computer Vision

Pools the input layer to form new intermediate output layer.

By “pooling” (e.g., taking max / sum) filter responses at different

BMVA Computer Vision

• Performs localized sum() or max() over sub windows/regions

BMVA Computer Vision

Example CNN design:

ImageNet Classification with Deep Convolutional Neural Networks

Going Deeper with Convolutions, C Szegedy et al (2014) - “GoogLeNet”