利用关键时刻促进记忆增强型Adam中的探索

摘要

基于自适应梯度的优化器，特别是Adam，在训练大规模深度学习模型方面留下了深远的影响。这类优化器的优势在于展现出快速收敛，同时对超参数选择更具鲁棒性。然而，它们通常泛化能力不如非自适应方法。最近的研究将这种性能差距与选择平坦最小值联系起来：自适应方法往往会在损失函数空间中更尖锐的盆地中找到解决方案，这反过来会损害泛化能力。为了克服这个问题，我们提出了一种新的增强记忆的Adam版本，通过在训练过程中使用关键动量项的缓冲区来促进探索更平坦的最小值。直观地说，使用缓冲区使优化器在盆地吸引力范围不够宽时会超调到盆地之外。我们凭经验证明，我们的方法提高了Adam的几个变体在标准监督语言建模和图像分类任务上的性能。

English

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

利用关键时刻促进记忆增强型Adam中的探索

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

摘要

Support