Stable-SPAM: 16ビットAdamよりも安定して4ビットで学習する方法

要旨

本論文では、4ビット訓練向けに最近提案されたいくつかの最適化手法を包括的に評価し、低ビット精度が学習率に対する感度を増幅し、しばしば勾配ノルムの不安定化を引き起こし、高い学習率での発散を招くことを明らかにしています。これらの手法の中でも、モーメンタムリセットとスパイク対応勾配クリッピングを特徴とする最近の最適化手法SPAMは、様々なビットレベルで最高の性能を達成していますが、勾配ノルムを安定化させることに苦戦し、慎重な学習率調整を必要とします。これらの制限に対処するため、我々は強化された勾配正規化とクリッピング技術を組み込んだStable-SPAMを提案します。具体的には、Stable-SPAMは(1)スパイク勾配のクリッピング閾値をその履歴最大値に基づいて適応的に更新し、(2)勾配行列全体をその履歴l_2ノルム統計に基づいて正規化し、(3)SPAMからモーメンタムリセットを継承し、Adamの第一および第二モーメントを定期的にリセットすることで、スパイク勾配の蓄積を軽減します。大規模な実験により、Stable-SPAMが4ビットLLM訓練における勾配ノルムを効果的に安定化させ、AdamやSPAMと比較して優れた性能を発揮することが示されました。特に、Stable-SPAMで訓練した4ビットLLaMA-1Bモデルは、Adamで訓練したBF16 LLaMA-1Bモデルを最大2パープレキシティで上回りました。さらに、両モデルを4ビットで訓練した場合、Stable-SPAMはAdamと同等の損失を達成しながら、訓練ステップ数を約半分に削減しました。コードはhttps://github.com/TianjinYellow/StableSPAM.gitで公開されています。

English

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical l_2-norm statistics; and (3) inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to 2 perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Stable-SPAM: 16ビットAdamよりも安定して4ビットで学習する方法

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

要旨

Support