Académique Documents
Professionnel Documents
Culture Documents
Ambedkar Dukkipati
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
March 28, 2019
A.Dukkipati 1
Introduction
A Snap Shot of Deep Learning
Features
A.Dukkipati 2
A Snap Shot of Deep Learning (cont...)
Popular Models
A.Dukkipati 3
A Snap Shot of Deep Learning (cont...)
Tools
I PyTorch
I Theano
I Caffe
I TensorFlow
A.Dukkipati 4
A Snap Shot of Deep Learning (cont...)
Consequences
A.Dukkipati 5
Shallow Vs. Deep
A.Dukkipati 6
Shallow Vs. Deep (Cont...)
A.Dukkipati 8
Can we also emulate the same?
A.Dukkipati 9
Can we also emulate the same? (Cont...)
A.Dukkipati 10
What has changed now?
A.Dukkipati 11
What has changed now? (Cont...)
A.Dukkipati 12
What has changed now? (Cont...)
A.Dukkipati 13
Perceptron: History
History:
A.Dukkipati 14
Perceptron: History
A.Dukkipati 15
Deep Learning: Advantages
I Nonlinearity
I Input-Output mapping
I Adaptivity
I Fault Tolerance
I VLSI implementability
A.Dukkipati 16
Where do we start?
I Perceptron
I Feed forward deep networks and Back propagation
algorithm
I later CNNs, LSTMs, RBMs (if time permits)
A.Dukkipati 17
Hyperplane based classifiers and
Perceptron
Linear as Optimization
A.Dukkipati 18
Linear Regression
Problem Set-Up
A.Dukkipati 20
Logistic Regression(Contd...)
Problem
Find a function f such that,
µ = f (xn )
Model
1
µn = f (xn ) = σ(w| xn ) =
1 + exp(−w| xn )
exp(w| xn )
=
1 + exp(w| xn )
A.Dukkipati 21
Logistic Regression(Contd...)
Sigmoid Function
A.Dukkipati 22
Logistic Regression(contd...)
Loss Function
Here we use cross entropy loss instead of squared loss.
I
− log(µ ) when yn = 1
n
L(yn , f (xn )) =
− log(1 − µn ) when yn = 0
= −yn log(µn ) − (1 − yn ) log(1 − µn )
= −yn log(σ(w| xn )) − (1 − yn ) log(1 − σ(w| xn ))
PN |x
I L(w) = − n=1 [yn w n − log(1 + exp(w| xn ))]
Here L(w) is the loss over all the data.
A.Dukkipati 23
Logistic Regression(contd...)
N
∂L X
= (µn − yn )xn
∂w
n=1
.
(t)
I Also: Replace predicted label probability µn by predicted
(t)
binary label ŷn , where
1 if µ(t) > 0.5 or w(t)| x > 0
(t) n n
ŷn =
0 if µ(t)
n < 0.5 or w
(t) |
xn < 0
A.Dukkipati 25
Stochastic Gradient Decent(contd...)
w| x = 0
I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
A.Dukkipati 27
Hyperplane based classification
I Classification rule
y = sign(w| x + b)
w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1
A.Dukkipati 28
Hyperplane based classification
A.Dukkipati 29
The Perceptron Algorithm (Rosenblatt, 1958)
A.Dukkipati 30
Perceptron Algorithm
A.Dukkipati 31
Perceptron Convergence Theorem (Block and Novikoff )
A.Dukkipati 32
Feed Forward Neural Networks
Some Basic Features of Multilayer Perceptrons (or Feed-
forward Deep Neural Networks
A.Dukkipati 33
Why hidden Layers?
A.Dukkipati 34
Why hidden Layers? (Cont...)
A.Dukkipati 35
Two important steps in training neural network
1 Forward step:
I Input is fed to the first layer.
I Input signal is propagated through the network layer by
layer.
I Synaptic weights of the network are fixed i.e. no learning
happens in this step.
I Error is calculated at the output layer by comparing the
observed output with "desired output" (Ground truth)
2 Backward step:
I The observed error at the output layer is propagated
"backwards", layer by layer. (How?)
I Error is propagated "backwards", layer by layer.
I In this step, successive adjustments are made to the
synaptic weights.
A.Dukkipati 36
Propagation of information in neural network
A.Dukkipati 37
Computation of signals
2 Computation of gradient
I gradients of the "error surface" w.r.t weights (we will see
this later how)
A.Dukkipati 38
Error
I D = {(x(n), z(n))}N
n=1 be a training sample where x(n) is
an input and z(n) is the desired output.
I x(n) ∈ RD , we write x(n) = (x1 (n), ..., xD (n).
I z(n) ∈ RM , we write z(n) = (z1 (n), ..., zM (n)).
I Suppose the output of the network is
y(n) = (y1 (n), ..., yM (n) when x(n) is the input.
I Error at the j th output neuron is
A.Dukkipati 39
Error (contd. . . )
A.Dukkipati 40
The Backpropagation Algorithm
A.Dukkipati 41
The Backpropagation Algorithm
A.Dukkipati 42
The Backpropagation Algorithm (cont...)
A.Dukkipati 43
The Backpropagation Algorithm (cont...)
A.Dukkipati 44
The Backpropagation Algorithm (cont...)
The derivative
I E is a function of ej
I ej is a function of yj (yj is the output)
I yj is a function of vj (vj is the local field)
I vj is a function of wji
A.Dukkipati 45
The Backpropagation Algorithm (cont...)
I The derivative is
I =⇒
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)
A.Dukkipati 46
The Backpropagation Algorithm (contd. . . )
I We have
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)
I Hence, the update rule is
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) + ηej (n)ϕ0j (vj (n))xi (n)
A.Dukkipati 47
The Backpropagation Algorithm (contd. . . )
Local Gradient
∂E(n)
δj (n) = −
∂vj (n)
∂E(n) ∂ej (n) ∂yj (n)
=− = ej (n)ϕ0j (vj (n))
∂ej (n) ∂yj (n) ∂vj (n)
A.Dukkipati 48
BPA: Case 1: Neuron j is an output node
I Update rule
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient
A.Dukkipati 49
BPA: Case 2: Neuron j is a hidden node
A.Dukkipati 50
BPA: Case 2: Neuron j is a hidden node (contd. . . )
Strategy
I First compute the local gradient δj (n) for j th hidden
neuron.
I How we will see ...
I Then use the update that is similar to output neuron
∆w = Learning rate × Local gradient × Input
= ηδj (n)xi (n)
A.Dukkipati 51
BPA: Case 2: Neuron j is a hidden node (contd. . . )
Note: If this had been the output neuron, we would have had
A.Dukkipati 52
BPA: Case 2: Neuron j is a hidden node (contd. . . )
∂E(n) ∂E(n) 0
δj (n) = − =− ϕ (vj (n))
∂vj (n) ∂yj (n) j
∂E(n)
I Let us compute ∂yj (n)
X
I We have E(n) = 1
2 e2k (n)
k∈C
| {z }
Summation over all the output neurons
I Then
∂E(n) X ∂ek (n)
= ek (n)
∂yj (n) ∂yj (n)
k∈C
X ∂ek (n) ∂vk (n)
= ek (n)
∂vk (n) ∂yj (n)
k∈C
A.Dukkipati 53
BPA: Case 2: Neuron j is a hidden node (contd. . . )
I We have
∂ek (n)
I
∂vk (n)= −ϕ0 (vk (n))
We have vk (n) = m
P
l=1 wkl (n)yl (n)
I
A.Dukkipati 54
BPA: Case 2: Neuron j is a hidden node (contd. . . )
∂E(n) X
=− ek (n)ϕk (vk (n))wkj (n)
∂yj (n)
k∈C
X
=− δk (n)wkj (n)
k∈C
A.Dukkipati 55
BPA: Case 2: Neuron j is a hidden node (contd. . . )
∂E(n) X
We have =− δk (n)wkj (n)
∂yj (n)
k
∂yj (n)
and = ϕ0j (vj (n))
∂vj (n)
X
Hence, δj (n) = ϕ0j (vj (n)) δk (n)wkj (n)
k
A.Dukkipati 56
BPA: Case 2: Neuron j is a hidden node (contd. . . )
A.Dukkipati 57
BPA: Update Rule Summary
A.Dukkipati 58
BPA: Update Rule Summary
A.Dukkipati 59
Online Vs Batch Learning
Batch Learning
A.Dukkipati 60
Online Vs Batch Learning (Cont...)
Batch Learning
A.Dukkipati 61
Online Vs Batch Learning
Online Learning
M M
1X 2 1X
E= ej (n) = (zj (n) − yj (n))2
2 2
j=1 j=1
A.Dukkipati
I 63
Activation Functions
Sigmoid Function
A.Dukkipati 65
Activation Functions (contd. . . )
A.Dukkipati 66
Activation Functions (contd. . . )
I
A.Dukkipati
Shares some disadvantages of sigmoid. 67
Activation Functions (contd. . . )
A.Dukkipati 68
Activation Functions (contd. . . )
f (x + y) = f (x) + f (y)
But ϕ(−1) + ϕ(+1) 6= ϕ(0)
A.Dukkipati 69
Activation Functions (contd. . . )
ReLU (contd...)
I It is a non-bounded function.
I Change from sigmoid to ReLU as an activation function in
hidden layer is possible - hidden neurons need not have
bounded values.
I The issue with sigmoid function is that it has very small
values (near zero) everywhere except near 0.
A.Dukkipati 70
Activation Functions (contd. . . )
ReLU (contd...)
A.Dukkipati 71
Activation Functions (contd. . . )
ReLU (contd...)
I The reason is in the case of sigmoid ϕ0 (.) is always less than
1 with most values being 0.
=⇒ This imbalance in gradient magnitude makes it
difficult to change the parameters of the neural networks
with stochastic descent.
A.Dukkipati 72
Activation Functions (contd. . . )
ReLU (contd...)
I This problem can be adjusted by the use of rectified linear
activation function because dervative of ReLU can have
many non-zero values.
=⇒ Which in turn means that magnitude of the gradient
is more balanced throughout the network.
I Dying ReLU: A neuron in the network is permanently dead
due to inability to fire in forward pass.
=⇒ When activation is zero in the forward pass all the
weights will get zero gradient.
=⇒ In backpropagation, the weights of neurons never get
updated.
I Using ReLU in RNNs can blow up the computations to
infinity as activations are not bounded.
A.Dukkipati 73
Rate of Learning
A.Dukkipati 74
Rate of Learning (contd. . . )
A.Dukkipati 75
Stopping Criteria
A.Dukkipati 76
Stopping Criteria (contd. . . )
A.Dukkipati 77
Summary of Backpropagation Learning Algorithm
I η : learning rate
I δj (n) : local gradient
I xi (n) : Input to the j th neuron
Local gradient:
A.Dukkipati 78
XOR Problem
0⊕0=0
1⊕0=0
0⊕1=1
1⊕0=1
n o n o
I (0,0),(1,1) and (0,1),(1,0) are not linearly separable.
Hence single layer neural network cannot solve this
problem.
A.Dukkipati 79
XOR Problem (contd. . . )
A.Dukkipati 80
XOR Problem (contd. . . )
A.Dukkipati 81
XOR Problem (contd. . . )
A.Dukkipati 83
Universal Approximation Theorem
A.Dukkipati 84