Vous êtes sur la page 1sur 31

Experimental Analysis of SGD Variants

Siddharth Prusty, Sudeep Raja Putta, Steven Yin

COMS 4995 Course Project

December 3rd, 2019

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 1 / 16
Contents

1 The Optimization Technique

2 Experiments & Critical Review

3 Conclusions

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 2 / 16
Contents

1 The Optimization Technique

2 Experiments & Critical Review

3 Conclusions

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 3 / 16
Motivation for SGD Variants

SGD widely used in practice, by both academics and practitioners


Key SGD subtleties:
Convergence guarantees only exist for outputting the average of iterates(In most
cases)
The sampling scheme in each iteration of an epoch should be uniformly random and
with replacement from the n points

What happens in practice- Above subtleties often ignored, and below


variants often used instead:
Output the last iterate of SGD
Permute all data in an epoch and read from it sequentially without replacement

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good


theoretical results?

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Motivation for SGD Variants

SGD widely used in practice, by both academics and practitioners


Key SGD subtleties:
Convergence guarantees only exist for outputting the average of iterates(In most
cases)
The sampling scheme in each iteration of an epoch should be uniformly random and
with replacement from the n points

What happens in practice- Above subtleties often ignored, and below


variants often used instead:
Output the last iterate of SGD
Permute all data in an epoch and read from it sequentially without replacement

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good


theoretical results?

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Motivation for SGD Variants

SGD widely used in practice, by both academics and practitioners


Key SGD subtleties:
Convergence guarantees only exist for outputting the average of iterates(In most
cases)
The sampling scheme in each iteration of an epoch should be uniformly random and
with replacement from the n points

What happens in practice- Above subtleties often ignored, and below


variants often used instead:
Output the last iterate of SGD
Permute all data in an epoch and read from it sequentially without replacement

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good


theoretical results?

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Motivation for SGD Variants

SGD widely used in practice, by both academics and practitioners


Key SGD subtleties:
Convergence guarantees only exist for outputting the average of iterates(In most
cases)
The sampling scheme in each iteration of an epoch should be uniformly random and
with replacement from the n points

What happens in practice- Above subtleties often ignored, and below


variants often used instead:
Output the last iterate of SGD
Permute all data in an epoch and read from it sequentially without replacement

These discrepancies can lead to arbitrarily bad suboptimality rates

Motivation: Can we keep up these ’discrepancies’ but expect good


theoretical results?

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 4 / 16
Proposed SGD Variants

SGD without replacement (SGDo)


Paper: Dheeraj Nagaraj, Praneeth Netrapalli, Prateek Jain; SGD without
Replacement: Sharper Rates for General Smooth Convex Functions; ICML 2019
Contribution: Gives sequence of learning rates ηt (& outputs) to theoretically
ensure that sampling without replacement gives good suboptimality rate

SGD Last-Iterate Optimal (liSGD)


Paper: Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli; Making the Last
Iterate of SGD Information-Theoretically Optimal; COLT 2019
Contribution: Gives sequence of learning rates ηt to theoretically ensure that
outputting the last iterate gives good suboptimality rate

Our Project: To experimentally validate claims from both the papers on


real data sets

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 5 / 16
Proposed SGD Variants

SGD without replacement (SGDo)


Paper: Dheeraj Nagaraj, Praneeth Netrapalli, Prateek Jain; SGD without
Replacement: Sharper Rates for General Smooth Convex Functions; ICML 2019
Contribution: Gives sequence of learning rates ηt (& outputs) to theoretically
ensure that sampling without replacement gives good suboptimality rate

SGD Last-Iterate Optimal (liSGD)


Paper: Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli; Making the Last
Iterate of SGD Information-Theoretically Optimal; COLT 2019
Contribution: Gives sequence of learning rates ηt to theoretically ensure that
outputting the last iterate gives good suboptimality rate

Our Project: To experimentally validate claims from both the papers on


real data sets

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 5 / 16
Proposed SGD Variants

SGD without replacement (SGDo)


Paper: Dheeraj Nagaraj, Praneeth Netrapalli, Prateek Jain; SGD without
Replacement: Sharper Rates for General Smooth Convex Functions; ICML 2019
Contribution: Gives sequence of learning rates ηt (& outputs) to theoretically
ensure that sampling without replacement gives good suboptimality rate

SGD Last-Iterate Optimal (liSGD)


Paper: Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli; Making the Last
Iterate of SGD Information-Theoretically Optimal; COLT 2019
Contribution: Gives sequence of learning rates ηt to theoretically ensure that
outputting the last iterate gives good suboptimality rate

Our Project: To experimentally validate claims from both the papers on


real data sets

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 5 / 16
SGD without Replacement (SGDo)

Claims of the authors:


Case 1.1: For smooth, strongly convex functions with constant stepsize ηk,i = 4` log(nK
αnK
)

& K large, suboptimality rate is O(1/K 2 ), better than O(1/K ) rate for SGD.
Case 1.2: For smooth, strongly convex functions with constant stepsize
2 log(nK )
ηk,i = min β
, 4` αnK & K arbitrary, suboptimality rate is O(1/K ), same as for SGD.

Case 1.3: For smooth,L-Lipschitz convex functions √


with constant stepsize of
ηk,i = η = min β2 , √D , suboptimality rate is O(1/ K ), same as for SGD.
L Kn

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 6 / 16
SGD without Replacement (SGDo)

Claims of the authors:


Case 1.1: For smooth, strongly convex functions with constant stepsize ηk,i = 4` log(nK
αnK
)

& K large, suboptimality rate is O(1/K 2 ), better than O(1/K ) rate for SGD.
Case 1.2: For smooth, strongly convex functions with constant stepsize
2 log(nK )
ηk,i = min β
, 4` αnK & K arbitrary, suboptimality rate is O(1/K ), same as for SGD.

Case 1.3: For smooth,L-Lipschitz convex functions √


with constant stepsize of
ηk,i = η = min β2 , √D , suboptimality rate is O(1/ K ), same as for SGD.
L Kn

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 6 / 16
SGD Last-Iterate Optimal (liSGD)

Claims of the authors:


−i
C ×2
Case 2.1: For L-Lipschitz functions with stepsize defined stage-wise as ηt = √
T
,

when Ti < t < Ti+1 , 0 ≤ i ≤ k , suboptimality rate is O(1/ T ), same as for SGD.
Case 2.2: For L-Lipschitz and α-strongly convex functions with stepsize defined
−i
stage-wise as ηt = 2λt , when Ti < t < Ti+1 , 0 ≤ i ≤ k, suboptimality rate is O(1/T ),
same as for SGD.

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 7 / 16
SGD Last-Iterate Optimal (liSGD)

Claims of the authors:


−i
C ×2
Case 2.1: For L-Lipschitz functions with stepsize defined stage-wise as ηt = √
T
,

when Ti < t < Ti+1 , 0 ≤ i ≤ k , suboptimality rate is O(1/ T ), same as for SGD.
Case 2.2: For L-Lipschitz and α-strongly convex functions with stepsize defined
−i
stage-wise as ηt = 2λt , when Ti < t < Ti+1 , 0 ≤ i ≤ k, suboptimality rate is O(1/T ),
same as for SGD.

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 7 / 16
Contents

1 The Optimization Technique

2 Experiments & Critical Review

3 Conclusions

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 8 / 16
Experimental Setup

Dataset Details: Diabates regression dataset


n = 442 data points
10 dimensions, range [−0.2, 0.2]
Target range: [25, 346]

Optimization Functions: Sum of squares loss function with linear predictors, along
with a mix of `2 and `1 regularization.

n
1X > kw k22
F (w ) = (w xi − yi )2 + λ + γkw k1
n i=1 2

Consideration Set:
We constraint our optimization within the region {w : kw k22 ≤ 10}.
Whenever the function is smooth, we calculate the hessian to estimate α, β and
estimate L, σ 2 in our code

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 9 / 16
SGDo vs. Vanilla SGD (Case 1.2)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)
Details:
log(nK ) log(nK )
ηk,i = 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
Tail average: x̂ = K −dK1/2e+1 K k
P
k=1 x0

Comments:
Loss of last iterate for SGDo stays high, the loss for the tail average converges to
the optimal loss
SGDo Tail Average and vanilla SGD last iterate both achieve the optimal for a
small number of epochs
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 10 / 16
SGDo vs. Vanilla SGD (Case 1.2)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)
Details:
log(nK ) log(nK )
ηk,i = 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
Tail average: x̂ = K −dK1/2e+1 K k
P
k=1 x0

Comments:
Loss of last iterate for SGDo stays high, the loss for the tail average converges to
the optimal loss
SGDo Tail Average and vanilla SGD last iterate both achieve the optimal for a
small number of epochs
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 10 / 16
SGDo vs. Vanilla SGD (Case 1.1)

Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)


Details:  
2 log(nK ) log(nK )
ηk,i = min β
, 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
1 PK k
Tail average: x̂ = K −dK /2e+1 k=1 x0

Comments:
SGDo converges at the rate of 1/K 2 where as vanilla SGD converges at 1/K . This
difference turned out to be very dramatic in our experiments

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 11 / 16
SGDo vs. Vanilla SGD (Case 1.1)

Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 10, γ = 0)


Details:  
2 log(nK ) log(nK )
ηk,i = min β
, 4` αnK for SGDo, while ηk,i = αnK
for vanilla SGD
1 PK k
Tail average: x̂ = K −dK /2e+1 k=1 x0

Comments:
SGDo converges at the rate of 1/K 2 where as vanilla SGD converges at 1/K . This
difference turned out to be very dramatic in our experiments

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 11 / 16
SGDo vs. Vanilla SGD (Case 1.3)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 0.01 and γ = 0)
Details:  
ηk,i = min β2 , √D for SGDo and ηt = 1√
for vanilla SGD
L Kn β+c nK
Simple averaging of iterates

Comments:
SGDo matches the performance of vanilla SGD eventually
The variance of SGDo seems to be lower than that of SGD

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 12 / 16
SGDo vs. Vanilla SGD (Case 1.3)
Scenario: Minimizing a β-smooth & L-Lipschitz convex function (λ = 0.01 and γ = 0)
Details:  
ηk,i = min β2 , √D for SGDo and ηt = 1√
for vanilla SGD
L Kn β+c nK
Simple averaging of iterates

Comments:
SGDo matches the performance of vanilla SGD eventually
The variance of SGDo seems to be lower than that of SGD

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 12 / 16
liSGD vs. Vanilla SGD (Case 2.1)
Scenario: Minimizing an L-Lipschitz convex function (λ = 10 and γ = 0)
Details: −i
C ×2
Stage-wise choose ηt = √ , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
T
D
η= √ √ for vanilla SGD
σ 2 +L2 T

Comments:
Outputting the last iterate of SGD does not converge
Outputting last iterate of liSGD eventually converges to SGD average iterate
Kinks in the graph of liSGD corresponding to learning rate changes
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 13 / 16
liSGD vs. Vanilla SGD (Case 2.1)
Scenario: Minimizing an L-Lipschitz convex function (λ = 10 and γ = 0)
Details: −i
C ×2
Stage-wise choose ηt = √ , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
T
D
η= √ √ for vanilla SGD
σ 2 +L2 T

Comments:
Outputting the last iterate of SGD does not converge
Outputting last iterate of liSGD eventually converges to SGD average iterate
Kinks in the graph of liSGD corresponding to learning rate changes
Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 13 / 16
SGDo vs. Vanilla SGD (Case 2.2)
Scenario: Minimizing an L-Lipschitz & α-strongly convex function (λ = 1, γ = 10).
Details:
−i
Stage-wise choose ηt = 2αt , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
1
ηt = α(t+1) for vanilla SGD

Comments:
LiSGD does better than the average iterate of SGD

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 14 / 16
SGDo vs. Vanilla SGD (Case 2.2)
Scenario: Minimizing an L-Lipschitz & α-strongly convex function (λ = 1, γ = 10).
Details:
−i
Stage-wise choose ηt = 2αt , when Ti < t < Ti+1 , 0 ≤ i ≤ k for liSGD
1
ηt = α(t+1) for vanilla SGD

Comments:
LiSGD does better than the average iterate of SGD

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 14 / 16
Contents

1 The Optimization Technique

2 Experiments & Critical Review

3 Conclusions

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 15 / 16
Conclusion

SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice

LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.

An interesting future direction:


IID with replacement sampling + Last iterate.
Can give a complete theoretical result for how SGD is used in practice.

THANK YOU!

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16
Conclusion

SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice

LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.

An interesting future direction:


IID with replacement sampling + Last iterate.
Can give a complete theoretical result for how SGD is used in practice.

THANK YOU!

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16
Conclusion

SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice

LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.

An interesting future direction:


IID with replacement sampling + Last iterate.
Can give a complete theoretical result for how SGD is used in practice.

THANK YOU!

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16
Conclusion

SGDo:
Using a novel step-size sequence, provides theoretical guarantees on convergence
when iterates sampled without replacement (smoothness required).
SGDo loses the last iterate guarantee property of the vanilla SGD for the strongly
convex and smooth functions, hence may not be heavily used in practice

LiSGD:
Using novel step-size sequence, provides theoretical guarantees on suboptimality
rate when outputting the last iterate.
Most interesting results in the non- (strongly convex and smooth) scenarios.
Experimental results suggest that it could even be practical in practice.

An interesting future direction:


IID with replacement sampling + Last iterate.
Can give a complete theoretical result for how SGD is used in practice.

THANK YOU!

Siddharth, Sudeep & Steven COMS 4995 Course Project December 3rd, 2019 16 / 16

Vous aimerez peut-être aussi