Stable-SPAM: Wie man stabiler in 4-Bit trainiert als mit 16-Bit Adam

papers.abstract

Dieses Paper bewertet umfassend mehrere kürzlich vorgeschlagene Optimierer für das Training mit 4-Bit und zeigt auf, dass die geringe Bit-Präzision die Empfindlichkeit gegenüber Lernraten verstärkt und oft zu instabilen Gradientennormen führt, was zu Divergenz bei höheren Lernraten führt. Unter diesen erzielt SPAM, ein kürzlich vorgestellter Optimierer mit Momentenrücksetzung und Spike-berücksichtigendem Gradientenclipping, die beste Leistung über verschiedene Bit-Levels hinweg, hat jedoch Schwierigkeiten, die Gradientennormen zu stabilisieren und erfordert eine sorgfältige Abstimmung der Lernrate. Um diese Einschränkungen zu beheben, schlagen wir Stable-SPAM vor, der verbesserte Gradientennormalisierungs- und Clipping-Techniken integriert. Insbesondere passt Stable-SPAM (1) adaptiv den Clipping-Schwellenwert für gespikte Gradienten an, indem er ihre historischen Maxima verfolgt; (2) normalisiert die gesamte Gradientenmatrix basierend auf ihren historischen l_2-Norm-Statistiken; und (3) übernimmt die Momentenrücksetzung von SPAM, um periodisch die ersten und zweiten Momente von Adam zurückzusetzen und die Akkumulation von gespikten Gradienten zu mildern. Umfangreiche Experimente zeigen, dass Stable-SPAM die Gradientennormen effektiv stabilisiert beim Training von 4-Bit LLM und eine überlegene Leistung im Vergleich zu Adam und SPAM bietet. Bemerkenswert ist, dass unser 4-Bit LLaMA-1B-Modell, das mit Stable-SPAM trainiert wurde, das mit Adam trainierte BF16 LLaMA-1B um bis zu 2 Perplexitäten übertrifft. Darüber hinaus erreicht Stable-SPAM, wenn beide Modelle in 4-Bit trainiert werden, denselben Verlust wie Adam, wobei nur etwa die Hälfte der Trainingschritte erforderlich sind. Der Code ist verfügbar unter https://github.com/TianjinYellow/StableSPAM.git.

English

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical l_2-norm statistics; and (3) inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to 2 perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Stable-SPAM: Wie man stabiler in 4-Bit trainiert als mit 16-Bit Adam

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

papers.abstract

Support