Académique Documents
Professionnel Documents
Culture Documents
Computer Vision
a whirlwind tour of the key principles
Toby Breckon
Engineering and Computer Science
Durham University
www.durham.ac.uk/toby.breckon/mltutorial/ toby.breckon@durham.ac.uk
Slide material acknowledgements (some material): Lee (UC Davies), Grauman (UT Austin), Lazebnik (Illinois), Fei-Fei (Stanford),
Fergus (Stoney Brook), Huang (Illinois), Lee (Michigan), Ranzato (Facebook A.I. Research), Sermanet (Google), Vedaldi (Oxford), Hinton (Toronto), Fisher (HIPR2, Edinburgh) + additional URL/acknowledgement on individual slides
Well-defined learning problems ?
– easy to learn Vs. difficult to learn
..... varying complexity of visual patterns
An example: learning to recognise objects ...
Image: DK
●
A set of methods for the automated analysis of structure in
data. …. two main strands of
work, (i) unsupervised learning ….
and (ii) supervised learning.
Unsupervised
– no knowledge of output class or value
• data is unlabelled or value un-known
• Goal: determine data patterns/groupings
– Self-guided learning algorithm
●
(internal self-evaluation against some criteria)
●
e.g. k-means, genetic algorithms, clustering approaches ... …. ?
BMVA Computer Vision
Summer School 2018 Deep Learning : 7
(some) Feature Representation
(e.g. SIFT, HOG, histogram, Bag of Words, PCA ...)
cat
dog
Machine cow
….
Learning ….
= ….
“Decision
OR
car
or
Pixels / Voxels / Samples
Prediction” rhino
….
position
“style”
depth
Object Classification
what object ?
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
Object Detection
{people | vehicle | … intruder ….}
object or no-object ?
Instance Recognition ? {face | vehicle plate| gait …. → biometrics}
who (or what) is it ?
Sub-category analysis
which object type ?
{gender | type | species | age …...}
Sequence { Recognition | Classification } ?
what is happening / occurring ?
Regression (traditionally less common in comp. vis.)
●
Predict sample → associated numerical value (variable)
●
e.g. distance to target based on shape features
●
Linear and non-linear attribute to value relationships
Association & clustering
●
grouping a set of instances by attribute similarity
●
e.g. image segmentation
[Ess et al, 2009]
[ video ]
Input: image features (HOG)
Output: { yaw | pitch }
varying illumination + vibration
http://www.youtube.com/embed/UcF_otQSMEc?rel=0
[Walger / Breckon, 2014]
Input: raw image
Output: 17 pose keypoints
PoseNet (Google Research):
https://github.com/tensorflow/tfjs-models/tree/master/posenet
Image / Video
Pixels Object
Hand-designed Trainable
Class
feature extraction classifier
• Learn a feature hierarchy all the way from pixels (or voxels)
to classifier
• Each layer extracts “features” from the output of previous
layer
• Layers have similar structure, performing varying functions
• Train (i.e. optimize) all layers jointly
Image/
Video Layer 1 Layer 2 Layer 3 Classifier
Pixels
Image/
Video Hand-designed Trainable Object
Pixels feature extracton classifer Class
(or output
predicton)
Image/
Video …
Simple Object
Layer 1 Layer N
Pixels classifer Class
(or output
predicton)
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
4/3/201
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
8
Summer School 2018 Deep Learning : 18
AlexNet’s performance on this
benchmark task was the research
event (“discovery”) that led to the
shallow to deep transformation in
computer vision ...*
*
although CNN were not created overnight, and have their origins in [LeCun, 1998] among others
*
here meaning post (after) the “discovery” of deep learning by the computer vision research community
Note: we are currently in what may become known as the “deep-age”, if not perhaps the “dark-age” (?) of computer vision (see final slides)
BMVA Computer Vision
Summer School 2018 Deep Learning : 20
This talk ...
[ itself a shallow overview of deep learning approaches ]
Is about …
Is not about …
– understanding the core – how to use {tensorflow |
concepts, well pytorch | keras | mxnet ….
tensor-py-flow-net (!?) … }
– bringing everyone up to
– hyper-parameter tuning
speed on where we are and
how we got here
• re: deep learning – specific advanced concepts
that have been published on
arXiv in the time I have been
– understanding the limitations talking …
of current understanding
128
128
(I) GA->B(I)
lrec
64
64
x9
ladv
32
32
DA
A' GB C B
64
DC ladv GB->C(I) GB->C[GA->B(I)]
128
128
lrec
256
256
Training
512
512
C Testing
1024
512
1024
512
1024
512
512
512 1024
[ video ]
e.g. Monocular depth prediction via style transfer
[Atapour / Breckon, CVPR 2018] - https://github.com/atapour/monocularDepth-Inference
BMVA Computer Vision
Summer School 2018 Deep Learning : 22
Key question – why does this stuff
work so well ?
(let’s examine some of the fundamentals)
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
Pooling
Supervised learning (with labels) ….
Convolution (Learned)
….
Input Image
Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )
• Convolutional
– dependencies are local
– translation invariance
– few parameters (filter weights)
– filter stride can be > 1 (faster, less memory)
t .
.
.
… is essentially the localised weighted sum of the image and
a convolution kernel (mask weights) over a N x M pixel
neighbourhood, at a given location within the image (x,y).
smoothing
Input Image
1 1 1
1 1 1
1 1 1
sharpen
0 -1 0
1 5 1
0 -1 0
edges
1 2 1
0 0 0
-1 -2 -1
….
….
Produces a structured
intermediate feature map from the
input image (or previous layer in
the network)
Output -
Input - image / layer Feature Map
Provides a non-linear input to
output mapping via a (traditional)
activation function approach Previous
Layer, size N
Next
Layer, size M
layer of neurons
…..
– maybe either sub-sampling (N→ M) or fully
connected (N → M, N=M) via per element
activation function, e.g.
• Tanh
• Sigmoid: 1/(1+exp(-x))
• Rectified linear
(most common)
» Simplifies backprop
» Makes learning faster
» Avoids output saturation
issues
BMVA Computer Vision
Summer School 2018 Deep Learning : 35
Pooling layer(s)
max()
sum()
Number of layers, type of layer, number of nodes/maps per
layer – all down to the (human) designer
– many variants have emerged
CNN training – efficient backpropogation
as per traditional neural network approaches (with Dropout)
BMVA Computer Vision
Summer School 2018 Deep Learning : 38
Seminal
Deep Learning Architectures
ResNet …(residual blocks connecting input to the output)
K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016.
Figures: http://book.paddlepaddle.org/03.image_classification/
Non-linearity
Repeated multiple times over network architecture
….
Convolution (Learned)
….
Input
Neural Network Training: weight modifications are made in
the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence
“Backpropagation”
Key Algorithmic Steps
– Initialize weights (to small random values*) in the network
– Propagate the inputs forward
• (by applying activation function) at each node
– Backpropagate the error backwards
• (by updating weights and biases)
– Terminating condition
• when validation error is very small or enough iterations
Backpropogation details beyond scope/time (see accompanying reading)
.. see extra slides
BMVA Computer Vision
Summer School 2018 Deep Learning : 42
So overall deep network layers
provide …
feature extraction
data de-noising
dimensionality reduction
feature pooling
spatial invariance
non-linear input → output mapping
https://www.youtube.com/watch?v=mxKlUO_tjcg [ video ]
Test data: used to test performance of the system
– unseen by the system during training
– specific examples (used to evaluate)
….
provided we avoid
the pitfalls on the way
(i.e. over-learning)
●
“All things being equal, the simplest
solution tends to be the best one”
For Machine Learning : prefer the
simplest {model | hypothesis | …. tree | 14th-century English logician
projection | network } that fits the data William of Ockham
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Poor approximation
Source: [PRML, Bishop, 2006]
(as the model M=9 is not the simplest that fits the data!)
BMVA Computer Vision
Summer School 2018 Deep Learning : 55
How to spot over-fitting ...
Performance on the training data improves
Performance on the unseen test data decreases
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 59
Loss Functions: how good is our net ?
Suppose: 3 training examples, 3 classes. Multi-class Support Vector
With some parameters, W the scores Machine (SVM) loss:
are: Given an example
where is the image and
where is the (integer) label,
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 60
Loss Functions: how good is our net ?
Suppose: 3 training examples, 3 classes. Multi-class SVM loss = “Hinge loss”
With some parameters, W the scores
are:
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 61
Regularization: prevent the net over-fitting
= regularization strength
(hyperparameter)
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 62
Regularization: prevent the net over-fitting
= regularization strength
(hyperparameter)
Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 63
e.g. L2 Regularization: how and why ?
Expresses a preference for weight equality:
L2 Regularization
L2 regularization likes to
“spread out” the weights
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 64
Softmax(): mapping output activation
scores to probabilities
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function
Probabilities Probabilities
must be >= 0 must sum to 1
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 65
Softmax(): mapping output activation
scores to probabilities
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function
Probabilities Probabilities
must be >= 0 must sum to 1
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
BMVA Computer Vision
Summer School 2018 Deep Learning : 66
… in every node: activation functions
Output
Of
Layer
To Layer N+1
N-1
tanh Maxout
ReLU ELU
Output
Of
Layer
To Layer N+1
N-1
backprop parameter
BMVA Computer Vision
Summer School 2018 [ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung] Deep Learning : 70
… but how many network
architectures are we training?
Consider a neural net with H hidden units
Each time we present a training
example within backpropogation, we
randomly omit each hidden unit in {all
| some} layers with probability 0.5.
At test time – use all hidden units but
halve all the outgoing weights
– computes an approximate mean of the
predictions of all 2H models.
Transform image
Use random mix/combinations of : flipping, translation,
rotation, stretching, shearing, illumination changes
(log/exp/gamma transform), lens distortions, …(go crazy!)
Highly problem-dependent.
– must explore the hyper-parameter
space (via glorified “trial and error”)
to find optimal set
When to we terminate backpropagation ?
How do we select the parameters ?
– learning rate (weight up-dates)
– network topology (number of hidden nodes / number of layers)
– choice of activation function
How can we be sure what is the network learning?
– How can we be sure the correct (classification) function is being
learned ?
c.f. AI folk-lore “the tanks story”
Are the network weights optimal ?
– maybe in a local minima in the weight space
BMVA Computer Vision
Summer School 2018 Deep Learning : 79
Key Enablers
(i.e. what changed to make all this happens)
As of now (~2012 → 2016+) we now have three key things
that made deep learning possible: Validation classification
Validation classification
Validation classification
Advances in training: use of drop-out to regularize the weights in the globally
connected layers (which contain most of the parameters)
– dropout: half of the hidden units in a layer are randomly removed for each
training example.
– effect: stops hidden units from relying too much on other hidden units
(hence reduces likelihood of over-fitting)
Use these weights from this training cycle as the initialization for a
second training cycle with the limited data from your task
– adjusting output layer for the number of classes
– e.g. limited images of rare tropical fish / x-ray images of guns
Object Detection
CNN pretrained
(Fast R-CNN) Image Captioning: CNN + RNN
on ImageNet
with word2vec
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015.
●
Deep CNN (via transfer learning): Features → Classification (end to end)
– 95% (True+) over 6 object categories, FP (see above) mAP = mean Average Precision
(over all classes)
Transfer Learning Using Convolutional Neural Networks For Object Classification Within X-Ray Baggage Security Imagery (S. Akcay,
M.E. Kundegorski, M. Devereux, T.P. Breckon), In Proc. International Conference on Image Processing, IEEE, 2016.
But these deep networks seem to
take the whole image for
classification ?
Learn both a Region Proposal Network (RPN)
– likely object locations, given the image
… and then classify those regions via existing CNN
architecture (jointly trained) References:
RCNN [Girshick et al. CVPR 2014]
Fast RCNN [Girshick, ICCV 2015]
Faster RCNN [Ren et al., 2015]
BMVA Computer Vision
Summer School 2018 https://www.youtube.com/watch?v=WZmSMkK9VuA Deep Learning : 88
Pedestrian detection with CNN
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
c2 c3 c4 c5 f6 f7 f8 …
c1
output
fT6 fT 7 fT8
cT2 c T3 c T4 c T5
c T1
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
[** but we could be suffering from confirmation bias - “a tendency to search for or interpret
information in a way that confirms one's preconceptions” ]
(CNNs: we already think they are working, so therefore we look for evidence to confirm this)
Vs.
Example Data (2 labels) Multiple decision boundaries exist – how do we know which we have ?
Key questions remain : what is being learnt ?
– and how confident can we be it is being learnt ?
Application to Semantic Image Segmentation via CNN
– use of per complex network to perform per-pixel classification by
object type (i.e. semantic pixel labelling)
– Encoder ↔ Decoder architecture (but same concepts of layer types)
http://www.youtube.com/embed/e9bHTlYFwhg?rel=0
Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation." arXiv preprint arXiv:1511.00561, 2015. http://arxiv.org/abs/1511.00561
Video: http://lmb.informatik.uni-freiburg.de/Publications/2015/DB15/Generate_Chairs_mov_morphing.avi
Learning to Generate Chairs with Convolutional Neural Networks [Dosovitskiy et al. CVPR 2015]
http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam
(Google)
https://arxiv.org/abs/1704.04861
MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
Liang-Chieh Chen
CVPR 2018
https://openaccess.thecvf.com/content_cvpr_2018/CameraReady/3427.pdf
(a ~10 billion+ parameter space can possibly represent any other ML technique as a subset)
MIT Press
2016
Computer Vision: Models, Learning,
and Inference
– Simon Prince
(Springer, 2012)
http://www.computervisionmodels.com/
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and
Bengio, Y., 2014. Generative adversarial nets. In Advances in neural information processing
systems (pp. 2672-2680).
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region
proposal networks. InAdvances in neural information processing systems 2015 (pp. 91-99).
http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf