利用關鍵時刻促進記憶擴充Adam中的探索

摘要

基於梯度的適應性優化器，尤其是Adam，在訓練大規模深度學習模型方面留下了深刻的印記。這類優化器的優勢在於它們展現出快速收斂的特性，同時對超參數的選擇更具韌性。然而，它們通常泛化能力不如非適應性方法。最近的研究將這種性能差距歸因於平坦極小值的選擇：適應性方法往往會在損失景觀的更陡峭盆地中找到解，進而損害泛化能力。為了克服這個問題，我們提出了一種新的記憶增強版本的Adam，通過在訓練過程中使用一個關鍵動量項緩衝區，促進朝向更平坦極小值的探索。直觀地，使用該緩衝區使優化器在吸引盆地不夠寬時會超過範圍。我們在標準監督語言建模和圖像分類任務上實證表明，我們的方法提升了幾個Adam變體的性能。

English

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

利用關鍵時刻促進記憶擴充Adam中的探索

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

摘要

Support