Should we do learningrate decay for adam optimizer

I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?

asked Sep 15 '16 at 17:54

meng lin

170127

add a comment |

asked Sep 15 '16 at 17:54

meng lin

170127

add a comment |

asked Sep 15 '16 at 17:54

meng lin

170127

neural-network tensorflow

asked Sep 15 '16 at 17:54

meng lin

170127

asked Sep 15 '16 at 17:54

meng lin

170127

asked Sep 15 '16 at 17:54

meng lin

170127

asked Sep 15 '16 at 17:54

meng lin

170127

asked Sep 15 '16 at 17:54

meng lin

170127

add a comment |

4 Answers
4

active

oldest

votes

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.

But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).

The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever.
It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

add a comment |

In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.

The theory is that Adam already handles learning rate optimization (check reference) :

"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."

As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

answered Sep 15 '16 at 19:24

j314erre

1,3302919

add a comment |

Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.

decayed_lr = tf.train.exponential_decay(learning_rate,
 global_step, 10000,
 0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

add a comment |

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.

Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.

answered Jun 13 '18 at 15:06

Austin

1,38321138

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39517431%2fshould-we-do-learningrate-decay-for-adam-optimizer%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

add a comment |

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

add a comment |

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

answered Sep 16 '16 at 7:50

nessuno

14.3k34346

add a comment |

In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.

The theory is that Adam already handles learning rate optimization (check reference) :

"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."

As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

answered Sep 15 '16 at 19:24

j314erre

1,3302919

add a comment |

In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.

The theory is that Adam already handles learning rate optimization (check reference) :

"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."

As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

answered Sep 15 '16 at 19:24

j314erre

1,3302919

add a comment |

In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.

The theory is that Adam already handles learning rate optimization (check reference) :

"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."

As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

answered Sep 15 '16 at 19:24

j314erre

1,3302919

In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.

The theory is that Adam already handles learning rate optimization (check reference) :

"We propose Adam, a method for efficient stochastic optimization that
only requires first-order gradients with little memory requirement.
The method computes individual adaptive learning rates for different
parameters from estimates of first and second moments of the
gradients; the name Adam is derived from adaptive moment estimation."

As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.

answered Sep 15 '16 at 19:24

j314erre

1,3302919

answered Sep 15 '16 at 19:24

j314erre

1,3302919

answered Sep 15 '16 at 19:24

j314erre

1,3302919

answered Sep 15 '16 at 19:24

j314erre

1,3302919

add a comment |

decayed_lr = tf.train.exponential_decay(learning_rate,
 global_step, 10000,
 0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

add a comment |

decayed_lr = tf.train.exponential_decay(learning_rate,
 global_step, 10000,
 0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

add a comment |

decayed_lr = tf.train.exponential_decay(learning_rate,
 global_step, 10000,
 0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

decayed_lr = tf.train.exponential_decay(learning_rate,
 global_step, 10000,
 0.95, staircase=True)
opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

answered Nov 14 '18 at 11:33

Wenmin-Wu

528

add a comment |

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.

answered Jun 13 '18 at 15:06

Austin

1,38321138

add a comment |

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.

answered Jun 13 '18 at 15:06

Austin

1,38321138

add a comment |

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.

answered Jun 13 '18 at 15:06

Austin

1,38321138

Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.

answered Jun 13 '18 at 15:06

Austin

1,38321138

answered Jun 13 '18 at 15:06

Austin

1,38321138

answered Jun 13 '18 at 15:06

Austin

1,38321138

answered Jun 13 '18 at 15:06

Austin

1,38321138

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

gorumc5bOdir54gkknf111k,d9bK,AKLZYVkPf190Ww8SvU

搜尋此網誌

Myujth