4995 Project Presentation

Experimental Analysis of SGD Variants
Siddharth Prusty, Sudeep Raja Putta, Steven Yin
COMS 4995 Course Project
December 3rd, 2019
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 1 / 16
Contents
1 The Optimization Technique
2 Experiments & Critical Review
3 Conclusions
Contents
3 Conclusions
Motivation for SGD Variants
SGD widely used in practice, by both academics and practitioners

Key SGD subtleties:
Convergence guarantees only exist for outputting the average of iterates(In most
cases)
The sampling scheme in each iteration of an epoch should be uniformly random and
with replacement from the n points
What happens in practice- Above subtleties often ignored, and below

variants often used instead:
Output the last iterate of SGD
Permute all data in an epoch and read from it sequentially without replacement
These discrepancies can lead to arbitrarily bad suboptimality rates
Motivation: Can we keep up these ’discrepancies’ but expect good

theoretical results?

Key SGD subtleties:
cases)



Key SGD subtleties:
cases)



Key SGD subtleties:
cases)


Proposed SGD Variants
SGD without replacement (SGDo)

Paper: Dheeraj Nagaraj, Praneeth Netrapalli, Prateek Jain; SGD without
Replacement: Sharper Rates for General Smooth Convex Functions; ICML 2019
Contribution: Gives sequence of learning rates ηt (& outputs) to theoretically
ensure that sampling without replacement gives good suboptimality rate
SGD Last-Iterate Optimal (liSGD)

Paper: Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli; Making the Last
Iterate of SGD Information-Theoretically Optimal; COLT 2019
Contribution: Gives sequence of learning rates ηt to theoretically ensure that
outputting the last iterate gives good suboptimality rate
Our Project: To experimentally validate claims from both the papers on

real data sets



real data sets



real data sets
SGD without Replacement (SGDo)
Claims of the authors:

Case 1.1: For smooth, strongly convex functions with constant stepsize ηk,i = 4` log(nK
αnK
)
& K large, suboptimality rate is O(1/K 2 ), better than O(1/K ) rate for SGD.
Case 1.2: For smooth, strongly convex functions with constant stepsize
2 log(nK )
ηk,i = min β
, 4` αnK & K arbitrary, suboptimality rate is O(1/K ), same as for SGD.
Case 1.3: For smooth,L-Lipschitz convex functions √

with constant stepsize of
ηk,i = η = min β2 , √D , suboptimality rate is O(1/ K ), same as for SGD.
L Kn
SGD without Replacement (SGDo)

Case 1.1: For smooth, strongly convex functions with constant stepsize ηk,i = 4` log(nK
αnK
)
& K large, suboptimality rate is O(1/K 2 ), better than O(1/K ) rate for SGD.
Case 1.2: For smooth, strongly convex functions with constant stepsize
2 log(nK )
ηk,i = min β
, 4` αnK & K arbitrary, suboptimality rate is O(1/K ), same as for SGD.
Case 1.3: For smooth,L-Lipschitz convex functions √

with constant stepsize of
ηk,i = η = min β2 , √D , suboptimality rate is O(1/ K ), same as for SGD.
L Kn

−i
C ×2
Case 2.1: For L-Lipschitz functions with stepsize defined stage-wise as ηt = √
T
,
√
when Ti < t < Ti+1 , 0 ≤ i ≤ k , suboptimality rate is O(1/ T ), same as for SGD.
Case 2.2: For L-Lipschitz and α-strongly convex functions with stepsize defined
−i
stage-wise as ηt = 2λt , when Ti < t < Ti+1 , 0 ≤ i ≤ k, suboptimality rate is O(1/T ),
same as for SGD.

−i
C ×2
Case 2.1: For L-Lipschitz functions with stepsize defined stage-wise as ηt = √
T
,
√
when Ti < t < Ti+1 , 0 ≤ i ≤ k , suboptimality rate is O(1/ T ), same as for SGD.
Case 2.2: For L-Lipschitz and α-strongly convex functions with stepsize defined
−i
stage-wise as ηt = 2λt , when Ti < t < Ti+1 , 0 ≤ i ≤ k, suboptimality rate is O(1/T ),
same as for SGD.
Contents
3 Conclusions
Experimental Setup
Dataset Details: Diabates regression dataset

n = 442 data points
10 dimensions, range [−0.2, 0.2]
Target range: [25, 346]
Optimization Functions: Sum of squares loss function with linear predictors, along
with a mix of `2 and `1 regularization.
n
1X > kw k22
F (w ) = (w xi − yi )2 + λ + γkw k1
n i=1 2
Consideration Set:
We constraint our optimization within the region {w : kw k22 ≤ 10}.
Whenever the function is smooth, we calculate the hessian to estimate α, β and
estimate L, σ 2 in our code
SGDo vs. Vanilla SGD (Case 1.2)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)
Details:
log(nK ) log(nK )
ηk,i = 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
Tail average: x̂ = K −dK1/2e+1 K k
P
k=1 x0
Comments:
Loss of last iterate for SGDo stays high, the loss for the tail average converges to
the optimal loss
SGDo Tail Average and vanilla SGD last iterate both achieve the optimal for a
small number of epochs
Details:
log(nK ) log(nK )
ηk,i = 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
Tail average: x̂ = K −dK1/2e+1 K k
P
k=1 x0
Comments:
Loss of last iterate for SGDo stays high, the loss for the tail average converges to
the optimal loss
SGDo Tail Average and vanilla SGD last iterate both achieve the optimal for a
small number of epochs

Details:
2 log(nK ) log(nK )
ηk,i = min β
, 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
1 PK k
Tail average: x̂ = K −dK /2e+1 k=1 x0
Comments:
SGDo converges at the rate of 1/K 2 where as vanilla SGD converges at 1/K . This
difference turned out to be very dramatic in our experiments

Details:
2 log(nK ) log(nK )
ηk,i = min β
, 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
1 PK k
Tail average: x̂ = K −dK /2e+1 k=1 x0
Comments:
SGDo converges at the rate of 1/K 2 where as vanilla SGD converges at 1/K . This
difference turned out to be very dramatic in our experiments
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 0.01 and γ = 0)
Details:
ηk,i = min β2 , √D for SGDo and ηt = 1√
for vanilla SGD
L Kn β+c nK
Simple averaging of iterates
Comments:
SGDo matches the performance of vanilla SGD eventually
The variance of SGDo seems to be lower than that of SGD
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 0.01 and γ = 0)
Details:
ηk,i = min β2 , √D for SGDo and ηt = 1√
for vanilla SGD
L Kn β+c nK
Simple averaging of iterates
Comments:
SGDo matches the performance of vanilla SGD eventually
The variance of SGDo seems to be lower than that of SGD
liSGD vs. Vanilla SGD (Case 2.1)
Scenario: Minimizing an L-Lipschitz convex function (λ = 10 and γ = 0)
Details: −i
C ×2
Stage-wise choose ηt = √ , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
T
D
η= √ √ for vanilla SGD
σ 2 +L2 T
Comments:
Outputting the last iterate of SGD does not converge
Outputting last iterate of liSGD eventually converges to SGD average iterate
Kinks in the graph of liSGD corresponding to learning rate changes
liSGD vs. Vanilla SGD (Case 2.1)
Scenario: Minimizing an L-Lipschitz convex function (λ = 10 and γ = 0)
Details: −i
C ×2
Stage-wise choose ηt = √ , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
T
D
η= √ √ for vanilla SGD
σ 2 +L2 T
Comments:
Outputting the last iterate of SGD does not converge
Outputting last iterate of liSGD eventually converges to SGD average iterate
Kinks in the graph of liSGD corresponding to learning rate changes
Scenario: Minimizing an L-Lipschitz & α-strongly convex function (λ = 1, γ = 10).
Details:
−i
Stage-wise choose ηt = 2αt , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
1
ηt = α(t+1) for vanilla SGD
Comments:
LiSGD does better than the average iterate of SGD
Scenario: Minimizing an L-Lipschitz & α-strongly convex function (λ = 1, γ = 10).
Details:
−i
Stage-wise choose ηt = 2αt , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
1
ηt = α(t+1) for vanilla SGD
Comments:
LiSGD does better than the average iterate of SGD
Contents
3 Conclusions
Conclusion
SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice
LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.
An interesting future direction:

IID with replacement sampling + Last iterate.
Can give a complete theoretical result for how SGD is used in practice.
THANK YOU!
Conclusion
SGDo:
LiSGD:

THANK YOU!
Conclusion
SGDo:
LiSGD:

THANK YOU!
Conclusion
SGDo:
LiSGD:

THANK YOU!

4995 Project Presentation

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

4995 Project Presentation

Transféré par

Droits d'auteur :

Formats disponibles

Experimental Analysis of SGD Variants

Siddharth Prusty, Sudeep Raja Putta, Steven Yin

COMS 4995 Course Project

December 3rd, 2019

1 The Optimization Technique

2 Experiments & Critical Review

1 The Optimization Technique

2 Experiments & Critical Review

SGD widely used in practice, by both academics and practitioners

What happens in practice- Above subtleties often ignored, and below

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good

SGD widely used in practice, by both academics and practitioners

What happens in practice- Above subtleties often ignored, and below

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good

SGD widely used in practice, by both academics and practitioners

What happens in practice- Above subtleties often ignored, and below

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good

SGD widely used in practice, by both academics and practitioners

What happens in practice- Above subtleties often ignored, and below

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good

SGD without replacement (SGDo)

SGD Last-Iterate Optimal (liSGD)

Our Project: To experimentally validate claims from both the papers on

SGD without replacement (SGDo)

SGD Last-Iterate Optimal (liSGD)

Our Project: To experimentally validate claims from both the papers on

SGD without replacement (SGDo)

SGD Last-Iterate Optimal (liSGD)

Our Project: To experimentally validate claims from both the papers on

Claims of the authors:

Case 1.3: For smooth,L-Lipschitz convex functions √

Claims of the authors:

Case 1.3: For smooth,L-Lipschitz convex functions √

Claims of the authors:

Claims of the authors:

1 The Optimization Technique

2 Experiments & Critical Review

Dataset Details: Diabates regression dataset

Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)

Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)

1 The Optimization Technique

2 Experiments & Critical Review

An interesting future direction:

An interesting future direction:

An interesting future direction:

An interesting future direction:

Vous aimerez peut-être aussi

Case 1.3: For smooth,L-Lipschitz convex functions √

Case 1.3: For smooth,L-Lipschitz convex functions √