Should we do learningrate decay for adam optimizer










31















I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?










share|improve this question


























    31















    I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?










    share|improve this question
























      31












      31








      31


      16






      I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?










      share|improve this question














      I'm training a network for image localization with Adam optimizer, and someone suggest me to use exponential decay. I don't want to try that because Adam optimizer itself decays learning rate. But that guy insists and he said he did that before. So should I do that and is there any theory behind your suggestion?







      neural-network tensorflow






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 15 '16 at 17:54









      meng linmeng lin

      170127




      170127






















          4 Answers
          4






          active

          oldest

          votes


















          40














          It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.



          But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).



          The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever.
          It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.






          share|improve this answer






























            20














            In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.



            The theory is that Adam already handles learning rate optimization (check reference) :




            "We propose Adam, a method for efficient stochastic optimization that
            only requires first-order gradients with little memory requirement.
            The method computes individual adaptive learning rates for different
            parameters from estimates of first and second moments of the
            gradients; the name Adam is derived from adaptive moment estimation."




            As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.






            share|improve this answer






























              1














              Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.



              decayed_lr = tf.train.exponential_decay(learning_rate,
              global_step, 10000,
              0.95, staircase=True)
              opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)





              share|improve this answer






























                0














                Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.



                Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.






                share|improve this answer






















                  Your Answer






                  StackExchange.ifUsing("editor", function ()
                  StackExchange.using("externalEditor", function ()
                  StackExchange.using("snippets", function ()
                  StackExchange.snippets.init();
                  );
                  );
                  , "code-snippets");

                  StackExchange.ready(function()
                  var channelOptions =
                  tags: "".split(" "),
                  id: "1"
                  ;
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function()
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled)
                  StackExchange.using("snippets", function()
                  createEditor();
                  );

                  else
                  createEditor();

                  );

                  function createEditor()
                  StackExchange.prepareEditor(
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: true,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: 10,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader:
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  ,
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  );



                  );













                  draft saved

                  draft discarded


















                  StackExchange.ready(
                  function ()
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39517431%2fshould-we-do-learningrate-decay-for-adam-optimizer%23new-answer', 'question_page');

                  );

                  Post as a guest















                  Required, but never shown

























                  4 Answers
                  4






                  active

                  oldest

                  votes








                  4 Answers
                  4






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  40














                  It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.



                  But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).



                  The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever.
                  It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.






                  share|improve this answer



























                    40














                    It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.



                    But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).



                    The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever.
                    It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.






                    share|improve this answer

























                      40












                      40








                      40







                      It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.



                      But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).



                      The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever.
                      It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.






                      share|improve this answer













                      It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated.



                      But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update).



                      The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever.
                      It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Sep 16 '16 at 7:50









                      nessunonessuno

                      14.3k34346




                      14.3k34346























                          20














                          In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.



                          The theory is that Adam already handles learning rate optimization (check reference) :




                          "We propose Adam, a method for efficient stochastic optimization that
                          only requires first-order gradients with little memory requirement.
                          The method computes individual adaptive learning rates for different
                          parameters from estimates of first and second moments of the
                          gradients; the name Adam is derived from adaptive moment estimation."




                          As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.






                          share|improve this answer



























                            20














                            In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.



                            The theory is that Adam already handles learning rate optimization (check reference) :




                            "We propose Adam, a method for efficient stochastic optimization that
                            only requires first-order gradients with little memory requirement.
                            The method computes individual adaptive learning rates for different
                            parameters from estimates of first and second moments of the
                            gradients; the name Adam is derived from adaptive moment estimation."




                            As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.






                            share|improve this answer

























                              20












                              20








                              20







                              In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.



                              The theory is that Adam already handles learning rate optimization (check reference) :




                              "We propose Adam, a method for efficient stochastic optimization that
                              only requires first-order gradients with little memory requirement.
                              The method computes individual adaptive learning rates for different
                              parameters from estimates of first and second moments of the
                              gradients; the name Adam is derived from adaptive moment estimation."




                              As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.






                              share|improve this answer













                              In my experience it does not make sense (and does not work well) to do learning rate decay with Adam optimizer.



                              The theory is that Adam already handles learning rate optimization (check reference) :




                              "We propose Adam, a method for efficient stochastic optimization that
                              only requires first-order gradients with little memory requirement.
                              The method computes individual adaptive learning rates for different
                              parameters from estimates of first and second moments of the
                              gradients; the name Adam is derived from adaptive moment estimation."




                              As with any deep learning problem YMMV, one size does not fit all, you should try different approaches and see what works for you, etc. etc.







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Sep 15 '16 at 19:24









                              j314errej314erre

                              1,3302919




                              1,3302919





















                                  1














                                  Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.



                                  decayed_lr = tf.train.exponential_decay(learning_rate,
                                  global_step, 10000,
                                  0.95, staircase=True)
                                  opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)





                                  share|improve this answer



























                                    1














                                    Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.



                                    decayed_lr = tf.train.exponential_decay(learning_rate,
                                    global_step, 10000,
                                    0.95, staircase=True)
                                    opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)





                                    share|improve this answer

























                                      1












                                      1








                                      1







                                      Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.



                                      decayed_lr = tf.train.exponential_decay(learning_rate,
                                      global_step, 10000,
                                      0.95, staircase=True)
                                      opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)





                                      share|improve this answer













                                      Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow. Hope it is helpful to someone.



                                      decayed_lr = tf.train.exponential_decay(learning_rate,
                                      global_step, 10000,
                                      0.95, staircase=True)
                                      opt = tf.train.AdamOptimizer(decayed_lr, epsilon=adam_epsilon)






                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Nov 14 '18 at 11:33









                                      Wenmin-WuWenmin-Wu

                                      528




                                      528





















                                          0














                                          Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.



                                          Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.






                                          share|improve this answer



























                                            0














                                            Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.



                                            Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.






                                            share|improve this answer

























                                              0












                                              0








                                              0







                                              Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.



                                              Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.






                                              share|improve this answer













                                              Adam has a single learning rate, but it is a max rate that is adaptive, so I don't think many people using learning rate scheduling with it.



                                              Due to the adaptive nature the default rate is fairly robust, but there may be times when you want to optimize it. What you can do is find an optimal default rate beforehand by starting with a very small rate and increasing it until loss stops decreasing, then look at the slope of the loss curve and pick the learning rate that is associated with the fastest decrease in loss (not the point where loss is actually lowest). Jeremy Howard mentions this in the fast.ai deep learning course and its from the Cyclical Learning Rates paper.







                                              share|improve this answer












                                              share|improve this answer



                                              share|improve this answer










                                              answered Jun 13 '18 at 15:06









                                              AustinAustin

                                              1,38321138




                                              1,38321138



























                                                  draft saved

                                                  draft discarded
















































                                                  Thanks for contributing an answer to Stack Overflow!


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid


                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.

                                                  To learn more, see our tips on writing great answers.




                                                  draft saved


                                                  draft discarded














                                                  StackExchange.ready(
                                                  function ()
                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39517431%2fshould-we-do-learningrate-decay-for-adam-optimizer%23new-answer', 'question_page');

                                                  );

                                                  Post as a guest















                                                  Required, but never shown





















































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown

































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown







                                                  Popular posts from this blog

                                                  Top Tejano songwriter Luis Silva dead of heart attack at 64

                                                  政党

                                                  天津地下鉄3号線