Académique Documents
Professionnel Documents
Culture Documents
Loss Functions
and Optimization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 1 April 11, 2017
Administrative
Assignment 1 is released:
http://cs231n.github.io/assignments2017/assignment1/
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 2 April 11, 2017
Administrative
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 3 April 11, 2017
Administrative
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 4 April 11, 2017
Recall from last time: Challenges of recognition
Viewpoint Illumination Deformation Occlusion
This image is CC0 1.0 public domain This image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 5 April 11, 2017
Recall from last time: data-driven approach, kNN
1-NN classifier 5-NN classifier
train test
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 6 April 11, 2017
Recall from last time: Linear Classifier
f(x,W) = Wx + b
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 7 April 11, 2017
Recall from last time: Linear Classifier
TODO:
1. Define a loss function
that quantifies our
unhappiness with the
scores across the training
data.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 8 April 11, 2017
Suppose: 3 training examples, 3 classes.
With some W the scores are:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 9 April 11, 2017
Suppose: 3 training examples, 3 classes. A loss function tells how
With some W the scores are: good our current classifier is
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 10 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 11 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where
Hinge loss
is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 12 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 13 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 14 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 15 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 16 April 11, 2017
Suppose: 3 training examples, 3 classes. Multiclass SVM loss:
With some W the scores are:
Given an example
where is the image and
where is the (integer) label,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 24 April 11, 2017
E.g. Suppose that we found a W such that L = 0.
Is this W unique?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 25 April 11, 2017
E.g. Suppose that we found a W such that L = 0.
Is this W unique?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 26 April 11, 2017
Suppose: 3 training examples, 3 classes.
With some W the scores are:
Before:
= max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
=0+0
=0
cat 3.2 1.3 2.2 With W twice as large:
= max(0, 2.6 - 9.8 + 1)
car 5.1 4.9 2.5 +max(0, 4.0 - 9.8 + 1)
= max(0, -6.2) + max(0, -4.8)
frog -1.7 2.0 -3.1 =0+0
=0
Losses: 2.9 0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 27 April 11, 2017
Data loss: Model predictions
should match training data
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 28 April 11, 2017
Data loss: Model predictions
should match training data
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 29 April 11, 2017
Data loss: Model predictions
should match training data
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 30 April 11, 2017
Data loss: Model predictions
should match training data
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 31 April 11, 2017
Data loss: Model predictions Regularization: Model
should match training data should be simple, so it
works on test data
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 32 April 11, 2017
Data loss: Model predictions Regularization: Model
should match training data should be simple, so it
works on test data
Occams Razor:
Among competing hypotheses,
the simplest is the best
William of Ockham, 1285 - 1347
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 33 April 11, 2017
= regularization strength
Regularization (hyperparameter)
In common use:
L2 regularization
L1 regularization
Elastic net (L1 + L2)
Max norm regularization (might see later)
Dropout (will see later)
Fancier: Batch normalization, stochastic depth
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 34 April 11, 2017
L2 Regularization (Weight Decay)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 35 April 11, 2017
L2 Regularization (Weight Decay)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 36 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
cat 3.2
car 5.1
frog -1.7
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 37 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
cat 3.2
car 5.1
frog -1.7
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 38 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
where
cat 3.2
car 5.1
frog -1.7
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 39 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
where
car 5.1
frog -1.7
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 40 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
where
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 41 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
where
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 42 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
cat 3.2
car 5.1
frog -1.7
unnormalized log probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 43 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
unnormalized probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 44 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
unnormalized probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 45 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
unnormalized probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 46 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
unnormalized probabilities
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 47 April 11, 2017
Softmax Classifier (Multinomial Logistic Regression)
Q2: Usually at
initialization W is small
so all s 0.
unnormalized probabilities What is the loss?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 48 April 11, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 49 April 11, 2017
Softmax vs. SVM
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 50 April 11, 2017
Softmax vs. SVM
Softmax
SVM
Full loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 52 April 11, 2017
Recap How do we find the best W?
Softmax
SVM
Full loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 53 April 11, 2017
Optimization
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 54 April 11, 2017
This image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 55 April 11, 2017
Walking man image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 56 April 11, 2017
Strategy #1: A first very bad idea solution: Random search
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 57 April 11, 2017
Lets see how well this works on the test set...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 58 April 11, 2017
Strategy #2: Follow the slope
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 59 April 11, 2017
Strategy #2: Follow the slope
The slope in any direction is the dot product of the direction with the gradient
The direction of steepest descent is the negative gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 60 April 11, 2017
current W: gradient dW:
[0.34, [?,
-1.11, ?,
0.78, ?,
0.12, ?,
0.55, ?,
2.81, ?,
-3.1, ?,
-1.5, ?,
0.33,] ?,]
loss 1.25347
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 61 April 11, 2017
current W: W + h (first dim): gradient dW:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 62 April 11, 2017
current W: W + h (first dim): gradient dW:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 63 April 11, 2017
current W: W + h (second dim): gradient dW:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 64 April 11, 2017
current W: W + h (second dim): gradient dW:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 65 April 11, 2017
current W: W + h (third dim): gradient dW:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 66 April 11, 2017
current W: W + h (third dim): gradient dW:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 67 April 11, 2017
This is silly. The loss is just a function of W:
want
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 68 April 11, 2017
This is silly. The loss is just a function of W:
want
This image is in the public domain This image is in the public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 69 April 11, 2017
Hammer image is in the public domain
Calculus!
want
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 70 April 11, 2017
current W: gradient dW:
[0.34, [-2.5,
-1.11, dW = ... 0.6,
0.78, (some function 0,
0.12, data and W) 0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, 1.1,
-1.5, 1.3,
0.33,] -2.1,]
loss 1.25347
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 71 April 11, 2017
In summary:
- Numerical gradient: approximate, slow, easy to write
=>
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 72 April 11, 2017
Gradient Descent
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 73 April 11, 2017
W_2
original W
W_1
negative gradient direction
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 74 April 11, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 75 April 11, 2017
Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!
Approximate sum
using a minibatch of
examples
32 / 64 / 128 common
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 76 April 11, 2017
Interactive Web Demo time....
http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 77 April 11, 2017
Interactive Web Demo time....
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 78 April 11, 2017
Aside: Image Features
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 79 April 11, 2017
Image Features: Motivation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 80 April 11, 2017
Example: Color Histogram
+1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 81 April 11, 2017
Example: Histogram of Oriented Gradients (HoG)
Divide image into 8x8 pixel regions Example: 320x240 image gets divided
Within each region quantize edge into 40x30 bins; in each bin there are
direction into 9 bins 9 numbers so feature vector has
30*40*9 = 10,800 numbers
Lowe, Object recognition from local scale-invariant features, ICCV 1999
Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 82 April 11, 2017
Example: Bag of Words
Step 1: Build codebook
Cluster patches to
Extract random form codebook
patches of visual words
Fei-Fei and Perona, A bayesian hierarchical model for learning natural scene categories, CVPR 2005
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 83 April 11, 2017
Image features vs ConvNets
f
Feature Extraction 10 numbers giving
scores for classes
training
10 numbers giving
scores for classes
training
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 84 April 11, 2017
Next time:
Backpropagation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 85 April 11, 2017