Deep PDF

ML Lecture Slides
Ambedkar Dukkipati
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
March 28, 2019
A.Dukkipati 1
Introduction
A Snap Shot of Deep Learning
Features
I Go beyond the curve fitting...

I Amazing results with raw data...
I Pay and get the tagged data...
A.Dukkipati 2
A Snap Shot of Deep Learning (cont...)
Popular Models
I Feed Forward Neural Networks and Convolutional Neural

Networks
I Recurrent Neural Networks and Long Short Term Memory
Networks
I Restricted Boltzmann Machines and Deep Boltzmann
Machines
I Autoencoders
I Generative Adversarial Networks, Variational Autoencoders
I Memory networks and Neural Turing machines
I a lot more....
A.Dukkipati 3
Tools
I PyTorch
I Theano
I Caffe
I TensorFlow
A.Dukkipati 4
Consequences
I Brought “AI” back into computer science ***forey***

I If it works, we accept....we do not mind waiting for “why”
A.Dukkipati 5
Shallow Vs. Deep
Until recently most machine learning and signal processing

techniques had exploited “Shallow-Structured Architectures”
I Shallow: typically contain at most one or two layers of

nonlinear function transformations
I Gaussian mixture models
I Conditional random fields
I Linear or nonlinear dynamical systems
I Maximum entropy models
I Support vector machines
I Logistic regression
I Kernel regression
I Multilayer perceptron with single hidden layers.
I Deep: More nonlinear “hidden” layers
A.Dukkipati 6
Shallow Vs. Deep (Cont...)
I Shallow architectures have been effective in solving many

simple or well constrained problems
I Shallow architectures have limited modeling and
“representation power” can cause difficulties when dealing
with more complicated real world applications involving
natural signals:
I human speech,
I natural language,
I natural images and scenes
I These shallow architectures work well given very good
“hand crafted features” (may require signal processing
techniques)
I Advantage is that training is easy and may end up mostly
with a “convex optimization problem”.
A.Dukkipati 7
Human Perception and Evidence for layered hierarchical
systems
Human information processing mechanism (eg. vision and

audio) suggest the need of deep architectures for extracting
complex structures and building internal representations from
rich sensory inputs.
I Human speech production and perception systems are

equipped with “layered hierarchical structures” in
transforming the information from the waveform level to
the linguistic level.
I Human visual systems on the perception side is hierarchical
but also “generative.”
A.Dukkipati 8
Can we also emulate the same?
I The concept of deep learning is obtained from neural

networks
I Feedforward neural networks or MLPs with many hidden
layers, which are often referred to as deep neural networks
(DNNs)
I Back propagation is popularized in 1980 and has been well
known algorithm for learning parameters.
A.Dukkipati 9
Can we also emulate the same? (Cont...)
I Unfortunately BP alone did not work because nonconvex

nature of resulting optimization problems
I The bigger problem: Vanishing gradient problem
I This has steered away most of the ML researchers from
natural networks to shallow models that have convex loss
functions like
I support vector machines (SVM),
I conditional random fields (CRF) and
I maximum entropy models (MaxEnt)
for which global optimum can be efficiently obtained at the
cost of reduced modeling power.
A.Dukkipati 10
What has changed now?
The optimization difficulties associated with the deep models

was empirically alleviated when a “reasonably efficient”
unsupervised learning algorithms were introduced by Hinton
(2006)
A.Dukkipati 11
What has changed now? (Cont...)
I DBN is composed of a stack of restricted Boltzmann

machines (RBM)
I A greedy, layer by layer algorithm optimizes DBM weights
at the at the time complexity linear to the size and depth
of the networks.
I DBMs can be used to initialize the training of Deep Neural
Networks (DNN)
I Advantages of DBMs:
I Supply of good initialization for DNNs
I Learning algorithm makes effective use of unlabeled data
A.Dukkipati 12
What has changed now? (Cont...)
Most importantly GPUs and Tools like Torch and

TensorFlow
A.Dukkipati 13
Perceptron: History
History:
I McCulloch and Pitts (1943) introduced idea of neural

networks as computing machines
I Hebb (1949) postulated the first rule for self organizing
maps
I Rosenblatt (1958) invented perceptron that algorithmically
described neural networks
A.Dukkipati 14
Perceptron: History
A.Dukkipati 15
Deep Learning: Advantages
I Nonlinearity
I Input-Output mapping
I Adaptivity
I Fault Tolerance
I VLSI implementability
A.Dukkipati 16
Where do we start?
I Perceptron
I Feed forward deep networks and Back propagation
algorithm
I later CNNs, LSTMs, RBMs (if time permits)
A.Dukkipati 17
Hyperplane based classifiers and
Perceptron
Linear as Optimization
Supervised Learning Problem
I Given data {(xn , yn )}N

n=1 find f : X → Y that best
approximates the relation between X and Y .
I Determine f such a way that loss l(y, f (x)) is minimum.
I f and l are specific to the problem and the method that we
choose.
A.Dukkipati 18
Linear Regression
I Data: {(xn , yn )}N

n=1
I xn ∈ RD is a D dimensional input
I yn ∈ R is the output
Aim is to find a hyperplane that fits best these points.
I Here hyperplane is a model of choice i.e.,
D
X
f (x) = x j wj + b = w | x + b
j=1
I Here w1 , . . . , wd and b are model parameters

I Best is determined by some loss function
N
X
Loss(w) = [yn − f (xn )]2
n=1
I Aim : Determine the model parameters that minimize the
loss.
A.Dukkipati 19
Logistic Regression
Problem Set-Up
I Two class classification

I Instead of the exact labels estimate the probabilities of the
labels i.e.
Predict P (yn = 1|xn , w) = µn

P (yn = 0|xn , w) = 1 − µn
I Here (xn , yn ) is the input output pair.
A.Dukkipati 20
Logistic Regression(Contd...)
Problem
Find a function f such that,
µ = f (xn )
Model
1
µn = f (xn ) = σ(w| xn ) =
1 + exp(−w| xn )
exp(w| xn )
=
1 + exp(w| xn )
A.Dukkipati 21
Logistic Regression(Contd...)
Sigmoid Function
I Here σ(.) is the sigmoid function.
I The model first computes a real valued score

w| x = D
P
i=1 wi xi and then nonlinearly “squashes” it
between (0,1) to turn into a probability.
A.Dukkipati 22
Logistic Regression(contd...)
Loss Function
Here we use cross entropy loss instead of squared loss.
I

− log(µ ) when yn = 1
n
L(yn , f (xn )) =
− log(1 − µn ) when yn = 0
= −yn log(µn ) − (1 − yn ) log(1 − µn )
= −yn log(σ(w| xn )) − (1 − yn ) log(1 − σ(w| xn ))
PN |x
I L(w) = − n=1 [yn w n − log(1 + exp(w| xn ))]
Here L(w) is the loss over all the data.
A.Dukkipati 23
Logistic Regression(contd...)
By taking the derivative w.r.t w
N
∂L X
= (µn − yn )xn
∂w
n=1
I Here the Gradient Descent Algorithm is

1 Initialize w(1) ∈ RD randomly
2 Iterate until the convergence
N
X
(t+1) (t)
w = |{z}
w − η (µ(t) − yn )xn
| {z } |{z}
n=1
| n {z }
New learned previous Learning Gradient at
parameter or value rate previous value
weights
|
I Note: Here µ(t) = σ(w(t) xn )
I Problem: Calculating gradient in each iteration requires
all the data. When N is large this may not be feasible.
A.Dukkipati 24
Stochastic Gradient Decent
I Note: Gradient decent requires all the data to calculate the

gradients.
I Strategy: Approximate gradient using randomly chosen
data point (xn , yn )
w(t+1) = w(t) − ηt (µ(t)

n − yn )xn
.
(t)
I Also: Replace predicted label probability µn by predicted
(t)
binary label ŷn , where

1 if µ(t) > 0.5 or w(t)| x > 0
(t) n n
ŷn =
0 if µ(t)
n < 0.5 or w
(t) |
xn < 0
A.Dukkipati 25
Stochastic Gradient Decent(contd...)
I Hence: Update rule becomes

w(t+1) = w(t) − ηt (ŷn(t) − yn )xn
I This is mistake driven update rule

I w(t) gets updated only when there is a misclassification i.e.
(t)
ŷn 6= yn
I Change: the class labels to {-1,+1}

−2y (t) (t)
n if ŷn = 6 yn
=⇒ ŷn(t) − yn = (t) (t)
0 if ŷn = yn
I Hence: Whenever there is a misclassification.
w(t+1) = w(t) − 2η(t) yn xn
=⇒ This is a perceptron learning algorithm which is a
A.Dukkipati
hyperplane based learning algorithm. 26
Hyperplanes
I Seperates a d-dimensional space into two half

spaces(positive and negative)
I Equation of the hyperplane is
w| x = 0
I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
A.Dukkipati 27
Hyperplane based classification
I Classification rule
y = sign(w| x + b)
w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1
A.Dukkipati 28
Hyperplane based classification
A.Dukkipati 29
The Perceptron Algorithm (Rosenblatt, 1958)
I Aim is to learn a linear hyperplane to separate two classes.

I Mistake drives online learning algorithm
I Guaranteed to find a separating hyperplane if data is
linearly separable.
I If data is not linearly separable
I Make linearly separable using kernel methods.
I (Or) Use multilayer perceptron.
A.Dukkipati 30
Perceptron Algorithm
I Given training data D = {(x1 , y1 ), ..., (xn , yn )}

I Initialize wold = [0, ..., 0], bold = 0
I Repeat until convergence.
I For a random (xn , yn ) ∈ D
I If yn (w| xn + b) ≤ 0
[Or sign(w| x + b) 6= yn i.e mistake mode]
wnew = wold + yn xn
bnew = bold + yn
A.Dukkipati 31
Perceptron Convergence Theorem (Block and Novikoff )
"Roughly" : If the data is linearly separable perceptron

algorithm converges.
A.Dukkipati 32
Feed Forward Neural Networks
Some Basic Features of Multilayer Perceptrons (or Feed-
forward Deep Neural Networks
I Network will have hidden layers.

I Since perceptron works only for linearly separable data,
each neuron has a non-linear activation function:
Activation function is differentiable.
I Network exhibit a high degree of connectivity.
A.Dukkipati 33
Why hidden Layers?
A.Dukkipati 34
Why hidden Layers? (Cont...)
I Hidden layers can automatically learn features from data

I The bottom-most hidden layer captures very low level
features (e.g., edges). Subsequent hidden layers learn
progressively more high-level features (e.g., parts of
objects) that are composed of previous layerâĂŹs features
A.Dukkipati 35
Two important steps in training neural network
1 Forward step:
I Input is fed to the first layer.
I Input signal is propagated through the network layer by
layer.
I Synaptic weights of the network are fixed i.e. no learning
happens in this step.
I Error is calculated at the output layer by comparing the
observed output with "desired output" (Ground truth)
2 Backward step:
I The observed error at the output layer is propagated
"backwards", layer by layer. (How?)
I Error is propagated "backwards", layer by layer.
I In this step, successive adjustments are made to the
synaptic weights.
A.Dukkipati 36
Propagation of information in neural network
Two kinds of signals:
1 Function signal (leads to observed error)

2 Error signal (leads to updation of weights or parameters)
Figure 2: Propagation of information in neural network
A.Dukkipati 37
Computation of signals
1 Computation of function signal (in the forward step)
2 Computation of gradient
I gradients of the "error surface" w.r.t weights (we will see
this later how)
A.Dukkipati 38
Error
I D = {(x(n), z(n))}N
n=1 be a training sample where x(n) is
an input and z(n) is the desired output.
I x(n) ∈ RD , we write x(n) = (x1 (n), ..., xD (n).
I z(n) ∈ RM , we write z(n) = (z1 (n), ..., zM (n)).
I Suppose the output of the network is
y(n) = (y1 (n), ..., yM (n) when x(n) is the input.
I Error at the j th output neuron is
ej (n) = |zj (n) − yj (n)|, j = 1, 2, ..., M
A.Dukkipati 39
Error (contd. . . )
I The total error per sample is

M
1X
E(n) = (zj (n) − yj (n))2
2
j=1
I Average error for the training data or empirical risk

N N M
1 X 1 XX
E= E(n) = (zj (n) − yj (n))2
N 2N
n=1 n=1 j=1
A.Dukkipati 40
The Backpropagation Algorithm
A.Dukkipati 41
The Backpropagation Algorithm
I x1 (n), . . . , xi (n), . . . , xm (n): Function signals that are

produced by the previous layer which is an input to the j th
neuron.
vj (n) = m
P
i=0 wji (n)xi (n)
I
I vj (n) is the induced local field.
I m is the size of the input (i.e. in the previous layer there
are m neurons)
I yj (n) = ϕ(vj (n))
I Function signal appearing at the output of neuron j.
A.Dukkipati 42
The Backpropagation Algorithm (cont...)
I BPA applies a correction ∆wji (n) to the synaptic weight

∂E(n)
proportional to , i = 1, 2, . . . , m.
∂wji (n)
I Note that we are trying to update j th neuron out pf all M
neurons.
I For nth data point, the error is
M M
1X 2 1X 2
E(n) = (zj (n) − yj (n)) = e
2 j=1 2 j=1 j
A.Dukkipati 43
We compute the derivative of

M
1X
E(n) = (zj (n) − yj (n))2
2
j=1
w.r.t wji (x) (apply chain rule)
A.Dukkipati 44
The derivative
∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

=
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)
Since
I E is a function of ej
I ej is a function of yj (yj is the output)
I yj is a function of vj (vj is the local field)
I vj is a function of wji
A.Dukkipati 45
I The derivative is
∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

=
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)
1
PM 2 ∂E(n)
I E(n) = 2 j=1 ej (n) =⇒ ∂ej (n) = ej (n)
∂e (n)
I ej (n) = zj (n) − yj (n) =⇒ ∂yjj (n) = −1
∂y (n)
I yj (n) = ϕ(vj (n)) =⇒ ∂vjj (n) = ϕ0j (vj (n))
Pm ∂vj (n)
I vj (n) = i=0 wji (n)xi (n) =⇒ ∂wji = xi (n)
I =⇒
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)
A.Dukkipati 46
The Backpropagation Algorithm (contd. . . )
Update rule for jth output neuron
I We have
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)
I Hence, the update rule is
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) + ηej (n)ϕ0j (vj (n))xi (n)
A.Dukkipati 47
The Backpropagation Algorithm (contd. . . )
Local Gradient
I Define local gradient δj (n) for j th neuron as
∂E(n)
δj (n) = −
∂vj (n)
∂E(n) ∂ej (n) ∂yj (n)
=− = ej (n)ϕ0j (vj (n))
∂ej (n) ∂yj (n) ∂vj (n)
I wji (n + 1) = wji (n) + η δj (n) xi (n)

| {z }
Local Gradient
A.Dukkipati 48
BPA: Case 1: Neuron j is an output node
I Output neuron has an “easy” access to the error
ej (n) = dj (n) − yj (n)

M
1X 2
E(n) = ej (n)
2
j=1
I Update rule
∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient
= wji (n) − ηδj (n)xi (n)
A.Dukkipati 49
BPA: Case 2: Neuron j is a hidden node
I Unlike in the case of output neuron, hidden neuron does

not have a direct access to the "error".
I TRICK
I Error signal for a hidden neuron will be determined
recursively.
I They expect the next hidden neuron (or output neuron),
they are connected to, to share "some" error.
I Error propagates by working backwards.
A.Dukkipati 50
BPA: Case 2: Neuron j is a hidden node (contd. . . )
Strategy
I First compute the local gradient δj (n) for j th hidden
neuron.
I How we will see ...
I Then use the update that is similar to output neuron
∆w = Learning rate × Local gradient × Input
= ηδj (n)xi (n)
A.Dukkipati 51
Local field of j th hidden neuron:
∂E(n) ∂E(n) ∂yj (n)

δj (n) = − =−
∂vj (n) ∂yj (n) ∂vj (n)
∂E(n)
=− ϕ0j (vj (n)) ∵ yj (n) = ϕj (vj (n))
∂yj (n)
Note: If this had been the output neuron, we would have had
∂E(n) ∂E(n) ∂ej (n)

= = −ej (n)
∂yj (n) ∂ej (n) ∂yj (n)
Since j is hidden, it does not have access to the error.
A.Dukkipati 52
I We are trying to compute local gradient
∂E(n) ∂E(n) 0
δj (n) = − =− ϕ (vj (n))
∂vj (n) ∂yj (n) j
∂E(n)
I Let us compute ∂yj (n)
X
I We have E(n) = 1
2 e2k (n)
k∈C
| {z }
Summation over all the output neurons
I Then
∂E(n) X ∂ek (n)
= ek (n)
∂yj (n) ∂yj (n)
k∈C
X ∂ek (n) ∂vk (n)
= ek (n)
∂vk (n) ∂yj (n)
k∈C
A.Dukkipati 53
∂E(n) P ∂ek (n) ∂vk (n)

We are computing ∂yj (n) = k∈C ek (n) ∂vk (n) ∂yj (n)
I We have
ek (n) = zk (n) − yk (n)

= zk (n) − ϕk (vk (n))
∂ek (n)
I
∂vk (n)= −ϕ0 (vk (n))
We have vk (n) = m
P
l=1 wkl (n)yl (n)
I
Note that j ∈ {1, 2, ..., m} and j th neuron output along

with the other neurons in that layer are fed to the k th
output neuron.
=⇒ ∂v k (n)
∂yj (n) = wkj (n)
A.Dukkipati 54
∂E(n) P ∂ek (n) ∂vk (n)

We are computing ∂yj (n) = k∈C ek (n) ∂vk (n) ∂yj (n)
∂E(n) X
=− ek (n)ϕk (vk (n))wkj (n)
∂yj (n)
k∈C
X
=− δk (n)wkj (n)
k∈C
where δk (n) = ek (n)ϕk (vk (n)) is the local gradient of the k th

neuron.
A.Dukkipati 55
∂E(n) X
We have =− δk (n)wkj (n)
∂yj (n)
k
∂yj (n)
and = ϕ0j (vj (n))
∂vj (n)
X
Hence, δj (n) = ϕ0j (vj (n)) δk (n)wkj (n)
k
A.Dukkipati 56
I Now we have local gradient at the j th hidden node

i.e. δj (n) = ϕ0j (vj (n)) k δk (n)wkj (n)
P
where δk (n) = ek (n)ϕk (vk (n)) is the local field at k th

output layer.
I Hence,
wji (n + 1) = wji (n) + ηδj (n)xi (n)

" #
X
= wji (n) + η ϕ0j (vj (n)) δk (n)wkj (n) xi (n)
k∈C
A.Dukkipati 57
BPA: Update Rule Summary
Case 1: j th neuron us a output neuron
∆wji (n) = η ej (n)ϕ0j (vj (n)) xi (n)

| {z }
Local gradient at the j th neuron
Case 2: j th neuron is a hidden neuron

!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n)) xi (n)
k∈C
| {z }
A.Dukkipati 58
BPA: Update Rule Summary
Case 1: j th neuron us a output neuron
∆wji (n) = η ej (n)ϕ0j (vj (n)) xi (n)

| {z }
Case 2: j th neuron is a hidden neuron

!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n)) xi (n)
k∈C
| {z }
A.Dukkipati 59
Online Vs Batch Learning
Batch Learning
I Each adjustment to the weights is performed after the

presentation of all the N examples in the training samples
are presented.
I That is, cost function is average error or empirical risk.
N N M
1 X 1 XX 2
E= E(n) = ej (n)
N 2N
n=1 n=1 j=1
N M
1 XX
= (zj (n) − yj (n))2
2N
n=1 j=1
A.Dukkipati 60
Online Vs Batch Learning (Cont...)
Batch Learning
I This constitutes one epoch of training.

I In each epoch of training, samples are randomly shuffled.
I The learning curve in this case is E vs epoch number.
I Advantage: It can be easily parallelized.
I Disadvantage: Memory requirements are very high.
A.Dukkipati 61
Online Vs Batch Learning
Online Learning
I Each adjustment to the weights is performed example by

example in the training data.
I The cost function is error obtained in each sample.
M M
1X 2 1X
E= ej (n) = (zj (n) − yj (n))2
2 2
j=1 j=1
I The learning curve in this case is E(n) vs epoch.

I Learning curve is significantly different from that of batch
learning.
I Online learning take advantage of redundant data (multiple
copies of data).
I Online learning is simple to implement.
A.Dukkipati 62
Activation Functions
Heaviside step function:

ϕ(x) = 0 if x<0
ϕ(x) = 1 if x>0
A.Dukkipati
I 63
Activation Functions
Heaviside step function (contd.)

The reasons why we cannot use Heaviside step function in
feedforward neural networks:
I We train neural network using backpropagation algorithm

which requires differential activation function. For
Heaviside step function, it is not differentiable at x=0 and
it has 0 derivation everywhere else.
=⇒ The gradient descent will not be able to make
progress in updating weights.
I We want our neural network weights to be modified
continuously so that predictions can be as close as real
values. Having a function that can only generate either 0 or
1 will not help to achieve this objective.
A.Dukkipati 64
Activation Functions (contd...)
Sigmoid Function
I Sigmoid function is also known as the logistic function.

I It non-linearly squashes a number to a value between 0 and
1.
1
I Sigmoid(x) = 1+e−z
I Activations are bounded between 0 and 1.
A.Dukkipati 65
Activation Functions (contd. . . )
Sigmoid Function (contd...)

Disadvantages:
1 When the input is too small (towards -∞) the gradient is

zero.
I Hence while executing the backpropagation algorithm
weights will not get updated i.e. there is no learning.
I Vanishing gradient problem.
2 Though computing activation functions are less

computationally expensive than matrix multiplication or
convolutions, still computing exponential is expensive.
A.Dukkipati 66
Tanh Function (or Hyperbolic tangent)

I It is similar to sigmoid function but squashes the values,
non-linearly between -1 and 1.
ez − e−z
tanh(z) = +1
ez + e−z
I
A.Dukkipati
Shares some disadvantages of sigmoid. 67
Rectified Linear Unit (ReLU)
I Given an input if it is negative or zero, it outputs zero.

Otherwise it outputs the number same as input.
ReLU (x) = max(0, x)
A.Dukkipati 68
Rectified Linear Unit (ReLU)

I Is it non-linear? Yes.
I A linear function should satisfy the property that
f (x + y) = f (x) + f (y)
But ϕ(−1) + ϕ(+1) 6= ϕ(0)
I But it is piece-wise linear.
A.Dukkipati 69
ReLU (contd...)
I It is a non-bounded function.
I Change from sigmoid to ReLU as an activation function in
hidden layer is possible - hidden neurons need not have
bounded values.
I The issue with sigmoid function is that it has very small
values (near zero) everywhere except near 0.
A.Dukkipati 70
ReLU (contd...)
I At the j th neuron which is a hidden layer

!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n))
k∈C
| {z }
| {z } Derivation of
Local gradient from above sigmoid function
We multiply the gradient from the layer above with partial

derivative of the sigmoid function.
I Since in the case of sigmoid, it has very small values
everywhere except for when the input has values close to 0.
=⇒ Lower layers will likely have smaller gradients in
terms of magnitude compared to higher layers.
A.Dukkipati 71
ReLU (contd...)
I The reason is in the case of sigmoid ϕ0 (.) is always less than
1 with most values being 0.
=⇒ This imbalance in gradient magnitude makes it
difficult to change the parameters of the neural networks
with stochastic descent.
A.Dukkipati 72
ReLU (contd...)
I This problem can be adjusted by the use of rectified linear
activation function because dervative of ReLU can have
many non-zero values.
=⇒ Which in turn means that magnitude of the gradient
is more balanced throughout the network.
I Dying ReLU: A neuron in the network is permanently dead
due to inability to fire in forward pass.
=⇒ When activation is zero in the forward pass all the
weights will get zero gradient.
=⇒ In backpropagation, the weights of neurons never get
updated.
I Using ReLU in RNNs can blow up the computations to
infinity as activations are not bounded.
A.Dukkipati 73
Rate of Learning
"Backpropagation algorithm provides an ’approximation’ to the

trajectory in weight space computed by the method of steepest
gradient."
Case 1:- Smaller the learning rate ⇒ smaller the change to the
synoptic weights in the network
⇓
Smoother the learning will be
Case 2:- Larger learning rate ⇒ The network may become
unstable or oscillatory
A.Dukkipati 74
Rate of Learning (contd. . . )
∆wji (n) = α∆wji (n − 1) + ηδj (n)yi (n)
where α is the momentum constant

α = 0 gives us original delta rule
A.Dukkipati 75
Stopping Criteria
"In general backpropagation algorithm cannot be shown to

converge"
⇓
Hence this is not a well defined criteria for stopping its
operation. Some good Criteria:
∂E
1. Euclidean norm of ∂w reaches a sufficiently small gradient
threshold.
∵ The necessary condition for w∗ to be global maximum or local
∂E

minimum is ∂w ∗ =0
w
A.Dukkipati 76
Stopping Criteria (contd. . . )
2. Absolute rate of change is the average squared error per

epoch is sufficiently small.
3. Stop when "generalizing performance" is adequate or it
apparent that generalization performance is peaked.
A.Dukkipati 77
Summary of Backpropagation Learning Algorithm
wji (n + 1) = wji (n) + ηδj (n)xi (n)
I η : learning rate
I δj (n) : local gradient
I xi (n) : Input to the j th neuron
Local gradient:
δj (n) = ej (n)ϕ0j (vj (n)) if j is a output neuron

hX i
= δk (n) wkj (n) ϕ0j (vj (n)) if j is a hidden neuron
| {z }
k local gradient
of kth neuron
at the output
A.Dukkipati 78
XOR Problem
I Rosenblatt’s single layer perceptron has no hidden layer

hence it cannot classify input pattern that are NOT
linearly separable.
Consider XOR Problem :-
0⊕0=0
1⊕0=0
0⊕1=1
1⊕0=1
n o n o
I (0,0),(1,1) and (0,1),(1,0) are not linearly separable.
Hence single layer neural network cannot solve this
problem.
A.Dukkipati 79
XOR Problem (contd. . . )
I We use a hidden layer
I Consider the following neural network
A.Dukkipati 80
I The function of output neuron : construct a linear

combination of decision boundaries formed by the two
hidden neurons.
For various inputs :
I For input (1,1): v1 = (1)(+1) + (1)(+1) + (+1)(-1.5)
= 1 + 1 - 1.5 = 0.5 ⇒ ϕ(v1 ) =1
v2 = (1)(+1) + 1(+1) + (+1)(-0.5)
= 1 + 1 - 0.5 = 1.5 ⇒ ϕ(v2 )=1
v3 = 1(-2)+ 1(+1) +(+1)(-0.5)
= -2 + 1 - 0.5 = -1.5 ⇒ ϕ(v3 ) = 0
A.Dukkipati 81
For various inputs :

I For input (0,0): v1 = (+1)(-1.5) = -1.5 ⇒ ϕ(v1 ) = 0
v2 = (+1)(-0.5) = -0.5 ⇒ ϕ(v2 ) = 0
v3 = (+1)(-0.5) = -0.5 ⇒ ϕ(v3 ) = 0
I For input (1,0): v1 = (1)(+1) + (+1)(-0.5)
= 1 - 1.5 = -0.5 ⇒ ϕ(v1 ) =0

v2 = 1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v2 ) = 1
v3 =1(+1)+(+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v3 ) =1
I For input (0,1): v1 = (1)(+1) + (+1)(-1.5)
= 1 - 1.5 = -0.5 ⇒ ϕ(v1 ) =0

v2 = 1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v2 ) = 1
A.Dukkipati v =1(+1) + (+1)(-0.5) 82
I Decision Boundary of neuron 1
A.Dukkipati 83
Universal Approximation Theorem
Let ϕ(.) be a non-constant, bounded and monotonic-increasing

function. Let Im0 denotes the m0 -dimensional unit hypercube
[0, 1]m0 . Let the space of continuous functions on Im0 is denoted
by C(Im0 ). Given any function f ∈ C(Im0 ) and > 0, there
exists an integer m1 and sets of real numbers αi , bi and wi ,
where i = 1, 2, ...m and j = 1, 2...m0 .
Such that
m1
X m0
X
F (x1 , ...xm0 ) = αi ϕ( wij xj + bj )
i=1 j=1
and F arbitrarily approximates f(.). That is
|F (x1 , ..xm0 ) − f (x1 , ...xm0 )| < 0 ∀x1 , ...xm0 ∈ Im0
A.Dukkipati 84

Deep PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Deep PDF

Transféré par

Droits d'auteur :

Formats disponibles

ML Lecture Slides

I Go beyond the curve fitting...

I Feed Forward Neural Networks and Convolutional Neural

I Brought “AI” back into computer science ***forey***

Until recently most machine learning and signal processing

I Shallow: typically contain at most one or two layers of

I Shallow architectures have been effective in solving many

Human information processing mechanism (eg. vision and

I Human speech production and perception systems are

I The concept of deep learning is obtained from neural

I Unfortunately BP alone did not work because nonconvex

The optimization difficulties associated with the deep models

I DBN is composed of a stack of restricted Boltzmann

Most importantly GPUs and Tools like Torch and

I McCulloch and Pitts (1943) introduced idea of neural

Supervised Learning Problem

I Given data {(xn , yn )}N

I Data: {(xn , yn )}N

I Here w1 , . . . , wd and b are model parameters

I Two class classification

Predict P (yn = 1|xn , w) = µn

I Here (xn , yn ) is the input output pair.

I Here σ(.) is the sigmoid function.

I The model first computes a real valued score

By taking the derivative w.r.t w

I Here the Gradient Descent Algorithm is

I Note: Gradient decent requires all the data to calculate the

w(t+1) = w(t) − ηt (µ(t)

I Hence: Update rule becomes

I This is mistake driven update rule

I Seperates a d-dimensional space into two half

I Aim is to learn a linear hyperplane to separate two classes.

I Given training data D = {(x1 , y1 ), ..., (xn , yn )}

"Roughly" : If the data is linearly separable perceptron

I Network will have hidden layers.

I Hidden layers can automatically learn features from data

Two kinds of signals:

1 Function signal (leads to observed error)

Figure 2: Propagation of information in neural network

1 Computation of function signal (in the forward step)

ej (n) = |zj (n) − yj (n)|, j = 1, 2, ..., M

I The total error per sample is

I Average error for the training data or empirical risk

I x1 (n), . . . , xi (n), . . . , xm (n): Function signals that are

I BPA applies a correction ∆wji (n) to the synaptic weight

We compute the derivative of

w.r.t wji (x) (apply chain rule)

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

Update rule for jth output neuron

I Define local gradient δj (n) for j th neuron as

I wji (n + 1) = wji (n) + η δj (n) xi (n)

I Output neuron has an “easy” access to the error

ej (n) = dj (n) − yj (n)

= wji (n) − ηδj (n)xi (n)

I Unlike in the case of output neuron, hidden neuron does

Local field of j th hidden neuron:

∂E(n) ∂E(n) ∂yj (n)

∂E(n) ∂E(n) ∂ej (n)

Since j is hidden, it does not have access to the error.

I We are trying to compute local gradient

∂E(n) P ∂ek (n) ∂vk (n)

ek (n) = zk (n) − yk (n)

I Brought “AI” back into computer science forey

|F (x1 , ..xm0 ) − f (x1 , ...xm0 )| < 0 ∀x1 , ...xm0 ∈ Im0