Académique Documents
Professionnel Documents
Culture Documents
Deep Learning
Sargur Srihari
1
Machine Learning Srihari
Topics
2
Machine Learning Srihari
3
Machine Learning Srihari
ML Technology
Limitations of Conventional ML
5
Machine Learning Srihari
Representation learning
Image Example
• Input is an array of pixel values
• First stage is presence or absence of edges at particular locations and
orientations of image
• Second layer detects motifs by spotting particular arrangements of
edges, regardless of small variations in edge positions
• Third layer assembles motifs into larger combinations that
corresponds to parts of familiar objects
• Subsequent layers would detect objects as combinations of these
parts
• Key aspect of deep learning:
• These layers of features are not designed by human engineers
• They are learned from data using a general purpose learning
procedure
7
Machine Learning Srihari
Supervised Learning
Gradient Descent
10
Machine Learning Srihari
11
Machine Learning Srihari
(τ )
• Where the gradient vector ∇E (w ) consists of the vector of
derivatives evaluated using back-propagation
⎡ ∂E ⎤
⎢ ⎥
⎢ (1) ⎥ There are W= M(D+1)+K(M+1) elements
⎢ ∂w11 ⎥
⎢ ⎥ in the vector
⎢ . ⎥ (τ )
Gradient ∇E (w ) is a W x 1 vector
⎢ ⎥
⎢ ∂E ⎥
⎢ (1) ⎥
d ⎢ ∂wMD ⎥
∇E(w) = E(w) = ⎢ ⎥
dw ⎢ ∂E ⎥
⎢ ⎥
⎢ (2)
∂w11 ⎥
⎢ ⎥
⎢ ⎥
⎢ . ⎥
⎢ ∂E ⎥
⎢ ⎥
⎢ (2)
∂wKM ⎥
⎢⎣ ⎥⎦ 12
Machine Learning Srihari
Usually finds a good set of weights quickly compared to elaborate optimization techniques.
a j = ∑ w ji zi and zj=h(aj)
i
A Simple Example
• Two-layer network
• Sum-of-squared error
• Output units: linear activation
functions, i.e., multiple regression
yk=ak
Standard Sum of Squared Error
• Hidden units have logistic sigmoid
2
activation function 1 K
E n = ∑( yk − tk )
2 k=1
h(a)=tanh (a)
where e a − e−a yk: activation of output unit k
tanh(a) = a −a
e +e t : corresponding target
€ k
for input xk
simple form for derivative
h'(a) = 1− h(a) 2
€
Machine Learning Srihari
z j = tanh(a j )
M
y k = ∑ w (2)
kj z j
j= 0
• Output differences
δk = y k − t k
δ j = h'(a j )∑ w kjδk
• Backward Propagation €
(δ s for hidden units)
K
k
δ = (1− z )∑ w δ
j
2
j kj k h'(a) = 1− h(a) 2
k=1
€
• Derivatives wrt first layer and second layer weights
€
€
∂E n ∂E n
= δ j xi = δk z j €
∂w (1)
ji ∂w (2)
kj
∂E ∂En
• Batch method =∑
∂w ji n ∂w ji
€
Machine Learning Srihari
17
Machine Learning Srihari
18
Machine Learning Srihari
20
Machine Learning Srihari
• hyberbolic tangent,
f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and
• logistic function,
f(z) = 1/(1 + exp(−z))
21
Machine Learning Srihari
Neural Network
with 2 Inputs, 2 HiddenUnits, 1 Output unit
x1
x2
x1 y1 z
x2 y2
• Input Data (red and blue curves) are not linearly separable
Selectivity-Invariance dilemma
24
Machine Learning Srihari
Pre-training
• Unsupervised learning to create layers of feature
detectors
• No need for labelled data
• Objective of learning in each layer of feature
detectors:
• Ability to reconstruct or model activities of feature detectors
(or raw inputs) in layer below
• By pre-training weights of a deep network could be initialized
to sensible values
• Final layer of output units added at top of network
and whole deep system fine-tuned using back-
propagation
• Worked well in handwritten digit recognition
27
• When data was limited
Machine Learning Srihari
28
Machine Learning Srihari
29
Machine Learning Srihari
30
Machine Learning Srihari
31
Machine Learning Srihari
32
Machine Learning Srihari
33
Machine Learning Srihari
34
Machine Learning Srihari
35
Machine Learning Srihari
used
1. Layer of convolutional units
2x2
5x5
• which consider overlapping pixels
units
regions
2. Layer of subsampling units
Each pixel patch This plane has
• Several feature maps and sub- is 5 x 5 10 x 10=100
sampling neural network units
(called a feature map).
• Gradual reduction of spatial resolution Weights are same for
compensated by increasing no. of features different planes.
So only 25 weights
• Final layer has softmax output are needed.
Due to weight sharing
• Whole network trained using this is equivalent
to convolution.
backpropagation Different features have
different feature maps
36
Machine Learning Srihari
37
Machine Learning Srihari
39
Machine Learning Srihari
40
Machine Learning Srihari
41
Machine Learning Srihari
42
Machine Learning Srihari
ConvNet History
• Neocognitron
• Similar architecture
• Did not have end-to-end supervised learning using Backprop
• ConvNet with probabilistic model was used for OCR
and handwritten check reading
43
Machine Learning Srihari
Applications of ConvNets
• Applied with great success to images: detection, segmentation
and recognition of objects and regions
• Tasks where abundant data was available
• Traffic sign recognition
• Segmentation of biological images
• Connectomics
• Detection of faces, text, pedestrians, human bodies in natural images
• Major recent practical success is face recognition
• Images can be labeled at pixel level
• Applications in autonomous mobile robots, and self-driving cars
• Other applications gaining importance
• Natural language understanding
• Speech recognition
Machine Learning Srihari
45
Machine Learning Srihari
46
Machine Learning Srihari
Distributed Representations
54
Machine Learning Srihari
55
Machine Learning Srihari
Boltzmann Machine
• Named after Boltzmann Distribution (Or Gibbs Distribution)
• Gives probability that a system will be in a certain state given its
temperature
• Where E is energy and kT (constant of the distribution) is a product of
Boltzmann’s constant and thermodynamic temperature
• Training 56
Machine Learning Srihari
57