ChatPaper.aiChatPaper

Stable-SPAM:如何在4位元訓練中比16位元Adam更穩定

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

February 24, 2025
作者: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
cs.AI

摘要

本文全面評估了近期提出的多種用於4位元訓練的優化器,揭示了低精度會放大對學習率的敏感性,並經常導致梯度範數不穩定,從而在較高學習率下引發發散。其中,SPAM作為一款新近的優化器,具備動量重置和尖峰感知梯度裁剪功能,在各類位元水平上表現最佳,但在穩定梯度範數方面仍有困難,需要精細調整學習率。針對這些限制,我們提出了Stable-SPAM,它融合了增強型梯度歸一化與裁剪技術。具體而言,Stable-SPAM(1)通過追蹤歷史最大值自適應更新尖峰梯度的裁剪閾值;(2)基於梯度矩陣的歷史l_2範數統計進行整體歸一化;(3)繼承SPAM的動量重置機制,定期重置Adam的第一和第二動量,以減緩尖峰梯度的累積。大量實驗表明,Stable-SPAM在4位元大語言模型訓練中有效穩定了梯度範數,相較於Adam和SPAM展現出更優的性能。值得注意的是,使用Stable-SPAM訓練的4位元LLaMA-1B模型,其困惑度比採用Adam訓練的BF16 LLaMA-1B模型最多降低了2。此外,當兩者均在4位元下訓練時,Stable-SPAM在達到與Adam相同損失的同時,僅需約一半的訓練步數。程式碼已公開於https://github.com/TianjinYellow/StableSPAM.git。
English
This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical l_2-norm statistics; and (3) inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to 2 perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Summary

AI-Generated Summary

PDF182February 25, 2025