2019 6S191 PDF

Introduction to Deep Learning
MIT 6.S191
Alexander Amini
January 28, 2019
The Rise of Deep Learning
6.S191 Introduction to Deep Learning

1/28/19
introtodeeplearning.com
What is Deep Learning?
ARTIFICIAL
INTELLIGENCE
MACHINE LEARNING
Any technique that enables
Ability to learn without DEEP LEARNING
computers to mimic
human behavior explicitly being programmed Extract patterns from data using
neural networks

1/28/19
Lecture Schedule
• Mon Jan 28 – Fri Feb 1

• 1:00 pm – 4:00pm
• Lecture + Lab Breakdown
• Graded P/D/F; 3 Units
• 1 Final Assignment
1/28/19
Final Class Project
Option 1: Proposal Presentation
• Groups of 3 or 4 • Judged by a panel of industry judges
• Present a novel deep learning • Top winners are awarded:
research idea or application
• 3 minutes (strict)
• List of example proposals on
website: introtodeeplearning.com
• Presentations on Friday, Feb 1
• Submit groups by Wednesday
5pm to be eligible
• Submit slide by Thursday 9pm 3x NVIDIA RTX 2080 Ti 4x Google Home
to be eligible MSRP: $4000 MSRP: $400

1/28/19
Final Class Project
Option 1: Proposal Presentation Option 2: Write a 1-page review
• Groups of 3 or 4 of a deep learning paper
• Present a novel deep learning • Grade is based on clarity of
research idea or application writing and technical
• 3 minutes (strict) communication of main ideas
• List of example proposals on • Due Friday 1:00pm (before
website: introtodeeplearning.com lecture)
• Presentations on Friday, Feb 2
• Submit groups by Wednesday
5pm to be eligible
• Submit slide by Thursday 9pm
to be eligible

1/28/19
Class Support
• Piazza: http://piazza.com/mit/spring2019/6s191
• Useful for discussing labs
• Course Website: http://introtodeeplearning.com
• Lecture schedule
• Slides and lecture recordings
• Software labs
• Grading policy
• Email us: introtodeeplearning-staff@mit.edu
• Office Hours by request

1/28/19
Course Staff
Alexander Amini
Lead Organizer Thomas Mauri Harini
Houssam Julia Felix

Ava Soleimany
Lead Organizer
Jacob Rohil Gilbert

introtodeeplearning-staff@mit.edu + Ravi A.
1/28/19
Thanks to Sponsors!

1/28/19
Why Deep Learning and Why Now?
Why Deep Learning?
Hand engineered features are time consuming, brittle and not scalable in practice
Can we learn the underlying features directly from data?
Low Level Features Mid Level Features High Level Features
Lines & Edges Eyes & Nose & Ears Facial Structure

1/28/19
Why Now?
Neural Networks date back decades, so why the resurgence?
Stochastic Gradient
1952
Descent
1958
Perceptron 1. Big Data 2. Hardware 3. Software
• Learnable Weights
• Larger Datasets • Graphics • Improved
• Easier Collection Processing Units Techniques
& Storage (GPUs) • New Models
1986 Backpropagation • Massively • Toolboxes
• Multi-Layer Perceptron Parallelizable
1995 Deep Convolutional NN
• Digit Recognition

1/28/19
The Perceptron
The structural building block of deep learning
The Perceptron: Forward Propagation
Linear combination
Output of inputs
&" !" #
(' = * + &, !,
!$
Σ
,-"
&$ ('
!# Non-linear
activation function
&#
Inputs Weights Sum Non-Linearity Output

1/28/19
Linear combination
Output of inputs
1
!%
#
&" !" )( = + !%+ - &. !.
Σ
./"
!$ )(
&$ Non-linear Bias
!# activation function
&#

1/28/19
1
!% #
)( = + !%+ - &. !.
&" !" ./"
!$ Σ )( )( = + ! % + 1 2 3
&$ &" !"
!#
where: 1 = ⋮ and 3 = ⋮
&# !#
&#

1/28/19
Activation Functions
1
!%
*) = , !% + . / 0
&" !"
!$ Σ *) • Example: sigmoid function
&$ 1
!# , 1 = 2 1 =
1 + 3 45
&#

1
1/28/19
Common Activation Functions
Sigmoid Function Hyperbolic Tangent Rectified Linear Unit (ReLU)
1 & ( − & '(

! " = ! " = ( ! " = max ( 0 , " )
1 + & '( & + & '(
1, " > 0
! ′ " = !(") 1 − !(") ! ′ " = 1 − !(")- !′(") = 3
0, otherwise
tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)
NOTE: All activation functions are non-linear

1/28/19
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network
What if we wanted to build a Neural Network to

distinguish green vs red points?

1/28/19
Linear Activation functions produce linear

decisions no matter the network size

1/28/19
Linear Activation functions produce linear Non-linearities allow us to approximate

decisions no matter the network size arbitrarily complex functions

1/28/19
The Perceptron: Example
3
We have: + , = 1 and . =
1 −2
1
*) = / + , + 1 2 .
3
&' Σ *) = / 1+ &
&' 2 3
−2 ( −2
*) = / ( 1 + 3 & ' − 2 & ( )
&(
This is just a line in 2D!

1/28/19
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
(
2&
−
3
&' Σ *)
'
3&
1+
−2
&'
&(

1/28/19
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
−1
(
2&
2
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
−1
Assume we have input: 0 =
2
*) = , 1 + 3 ∗ −1 − 2 ∗ 2
= , −6 ≈ 0.002
1/28/19
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
1 < 0
(
2&
* < 0.5
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
1 > 0
* > 0.5

1/28/19
Building Neural Networks with Perceptrons
The Perceptron: Simplified
1
!%
&" !"
!$ Σ *)
&$
!#
&#

1/28/19
The Perceptron: Simplified
!$
&=( %
!" %
!#
#
% = )* + , !- )-
-.$
1/28/19
Multi Output Perceptron
!$
&$ = ( %$
%$
!"
&" = ( %"
%"
!#
#
%) = *+,) + . !/ */,)
/0$
1/28/19
Single Layer Neural Network
($) (")
7 7
6 %$
%$
!$
6 %"
%" ('$
!"
%& ('"
6 %&
!#
%)*
6 %)*
Inputs Hidden Final Output

# )*
($) ($) (") (")
%+ = -.,+ +3 !4 -4,+ ('+ = 6 -.,+ +3 %4 -4,+
45$ 45$
1/28/19
Single Layer Neural Network
%$
($)
!$ 5$,"
($)
5"," %" ('$
!"
($) %& ('"
5#,"
!#
%)*
#
($) ($)
%" = ,-," +2 !3 ,3,"
34$
($) ($) ($) ($)
= ,-," + !$ ,$," + !" ,"," + !# ,#,"
1/28/19
Multi Output Perceptron
from tf.keras.layers import *
inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
%$ model = Model(inputs, outputs)
!$
%" ('$
!"
%& ('"
!#
%)*
Inputs Hidden Output

1/28/19
Deep Neural Network
%&,$
!$
%&," *)$
!" ⋯ ⋯
%&,( *)"
!#
%&,+,
Inputs Hidden Output

+,78
(&) (&)
%&,- = /0,- +4 9(%&:$,5 ) /5,-
56$
1/28/19
Applying Neural Networks
Example Problem
Will I pass this class?
Let’s start with a simple two feature model
! " = Number of lectures you attend

! $ = Hours spent on the final project

1/28/19
Example Problem: Will I pass this class?
! " = Hours
spent on the
final project
Legend
Pass
Fail
! $ = Number of lectures you attend

1/28/19
! " = Hours
spent on the
final project
Legend
? Pass
4
5 Fail
! $ = Number of lectures you attend

1/28/19
$#
!#
#
! = 4 ,5 $" '&# Predicted: 0.1
!"
$%

1/28/19
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%

1/28/19
Quantifying Loss
The loss of our network measures the cost incurred from incorrect predictions
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
ℒ , ! (.) ; 1 , ' (.)

Predicted Actual
1/28/19
Empirical Loss
The empirical loss measures the total loss over our entire dataset
+(!) '
$#
4, 5 !# 0.1 1
)= 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
Also known as: . / = 2 ℒ + ! (3) ; / , ' (3)
• Objective function
• Cost function
1 34#
• Empirical Risk Predicted Actual
1/28/19
Binary Cross Entropy Loss
Cross entropy loss can be used with models that output a probability between 0 and 1
+(!) '
$#
4, 5 !# 0.1 1
) = 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
. / = 2 ' (3) log + ! 3 ; / + (1 − ' (3) ) log 1 − + ! 3 ; /
1 34#
Actual Predicted Actual Predicted
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )

Mean Squared Error Loss
Mean squared error loss can be used with regression models that output continuous real numbers
+(!) '
$#
4, 5 !# 30 90
) = 2, 1 80 20
$" '&#
5, 8 85 95
⋮ ⋮ !"
⋮ ⋮
$%
1 5 "
. / = 2 ' 3 3
− + ! ;/ Final Grades
1 34# (percentage)
Actual Predicted
loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )

Training Neural Networks
Loss Optimization
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!

1/28/19
Loss Optimization
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
Remember:
! = !(:) , !(/) , ⋯

1/28/19
Loss Optimization
!∗ = argmin *(!)
! Remember:
Our loss is a function of
the network weights!
*(-., -0)
-0
-.
Loss Optimization
Randomly pick an initial ("#, "%)
'("#, "%)
"%
"#
Loss Optimization
!"($)
Compute gradient, !$
&('(, '*)
'*
'(
Loss Optimization
Take small step in opposite direction of gradient
!(#$, #&)
#&
#$
Gradient Descent
Repeat until convergence
!(#$, #&)
#&
#$
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
2. Loop until convergence:

3. )*(+)
Compute gradient, )+ grads = tf.gradients(ys=loss, xs=weights)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights

1/28/19
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)

3. )*(+)
Compute gradient, )+ grads = tf.gradients(ys=loss, xs=weights)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights

1/28/19
Computing Gradients: Backpropagation
!) !"
' () +* #(%)
How does a small change in one weight (ex. !") affect the final loss #(%)?

1/28/19
&+ &'
) *+ -, "($)
!"($)
=
!&'
Let’s use the chain rule!

1/28/19
&. &'
, -. *) "($)
!"($) !"($) !*)

= ∗
!&' !*) !&'

1/28/19
&' &.
, -' *) "($)
!"($) !"($) !*)

= ∗
!&' !*) !&'
Apply chain rule! Apply chain rule!

1/28/19
&' &.
- ,' *) "($)
!"($) !"($) !*) !,'

= ∗ ∗
!&' !*) !,' !&'

1/28/19
&' &.
- ,' *) "($)
!"($) !"($) !*) !,'

= ∗ ∗
!&' !*) !,' !&'
Repeat this for every weight in the network using gradients from later layers
1/28/19
Neural Networks in Practice:
Optimization
Training Neural Networks is Difficult
“Visualizing the loss landscape

of neural nets”. Dec 2017.
1/28/19
Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!

1/28/19
Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
How can we set the

learning rate?

1/28/19
Setting the Learning Rate
Small learning rate converges slowly and gets stuck in false local minima
"(!)
Initial guess
!
1/28/19
Large learning rates overshoot, become unstable and diverge
"(!)
Initial guess
!
1/28/19
Stable learning rates converge smoothly and avoid local minima
"($)
Initial guess
!
1/28/19
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”

1/28/19
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape

1/28/19
Adaptive Learning Rates
• Learning rates are no longer fixed

• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...

1/28/19
Adaptive Learning Rate Algorithms
Qian et al. “On the momentum term in gradient
• Momentum tf.train.MomentumOptimizer descent learning algorithms.” 1999.
Duchi et al. “Adaptive Subgradient Methods for Online

• Adagrad tf.train.AdagradOptimizer Learning and Stochastic Optimization.” 2011.
Zeiler et al. “ADADELTA: An Adaptive Learning Rate

• Adadelta tf.train.AdadeltaOptimizer
Method.” 2012.
• Adam tf.train.AdamOptimizer
Kingma et al. “Adam: A Method for Stochastic
Optimization.” 2014.
• RMSProp tf.train.RMSPropOptimizer
Additional details: http://ruder.io/optimizing-gradient-descent/

1/28/19
Mini-batches
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights

1/28/19
Gradient Descent
Algorithm
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Can be very
computational to
compute!

1/28/19
Stochastic Gradient Descent
Algorithm
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights

1/28/19
Algorithm
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Easy to compute but
very noisy
(stochastic)!
1/28/19
Algorithm
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights

1/28/19
Algorithm
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Fast to compute and a much better
estimate of the true gradient!
1/28/19
Mini-batches while training
More accurate estimation of gradient

Smoother convergence
Allows for larger learning rates

1/28/19
Mini-batches while training
More accurate estimation of gradient

Smoother convergence
Allows for larger learning rates
Mini-batches lead to fast training!

Can parallelize computation + achieve significant speed increases on GPU’s

1/28/19
Overfitting
The Problem of Overfitting
Underfitting Ideal fit Overfitting

Model does not have capacity Too complex, extra parameters,
to fully learn the data does not generalize well

1/28/19
Regularization
What is it?
Technique that constrains our optimization problem to discourage complex models

1/28/19
Regularization
What is it?
Technique that constrains our optimization problem to discourage complex models
Why do we need it?

Improve generalization of our model on unseen data

1/28/19
Regularization 1: Dropout
• During training, randomly set some activations to 0
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)

1/28/19
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)

1/28/19
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)

1/28/19
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Loss
Training Iterations
1/28/19
Legend
Loss Testing
Training
Training Iterations
1/28/19
Legend
Loss Testing
Training
Training Iterations
1/28/19
Legend
Loss Testing
Training
Training Iterations
1/28/19
Legend
Loss Testing
Training
Training Iterations
1/28/19
Legend
Loss Testing
Training
Training Iterations
1/28/19
Legend
Loss Stop training Testing

here!
Training
Training Iterations
1/28/19
Under-fitting Over-fitting
Legend
Loss Stop training Testing

here!
Training
Training Iterations
1/28/19
Core Foundation Review
The Perceptron Neural Networks Training in Practice
• Structural building blocks • Stacking Perceptrons to • Adaptive learning

• Nonlinear activation form neural networks • Batching
functions • Optimization through • Regularization
backpropagation
(),%
"% "%
(),# '&%
"# Σ '& "# ⋯ ⋯
(),+ '&#
"$
"$
(),,-

1/28/19
Questions?

2019 6S191 PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

2019 6S191 PDF

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Deep Learning

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

• Mon Jan 28 – Fri Feb 1

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

Houssam Julia Felix

Jacob Rohil Gilbert

6.S191 Introduction to Deep Learning

Low Level Features Mid Level Features High Level Features

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning

&" !" )( = + !%+ - &. !.

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning

!$ Σ *) • Example: sigmoid function

Inputs Weights Sum Non-Linearity Output

1 & ( − & '(

tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)

NOTE: All activation functions are non-linear

What if we wanted to build a Neural Network to

6.S191 Introduction to Deep Learning

Linear Activation functions produce linear

6.S191 Introduction to Deep Learning

Linear Activation functions produce linear Non-linearities allow us to approximate

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning

Inputs Hidden Final Output

Inputs Hidden Output

6.S191 Introduction to Deep Learning

Inputs Hidden Output

Will I pass this class?

Let’s start with a simple two feature model

! " = Number of lectures you attend

6.S191 Introduction to Deep Learning

! $ = Number of lectures you attend

! $ = Number of lectures you attend

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

ℒ , ! (.) ; 1 , ' (.)

loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )

loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

2. Loop until convergence:

6.S191 Introduction to Deep Learning

2. Loop until convergence:

6.S191 Introduction to Deep Learning

6.S191 Introduction to Deep Learning

Let’s use the chain rule!

6.S191 Introduction to Deep Learning

!"($) !"($) !*)

6.S191 Introduction to Deep Learning

!"($) !"($) !*)