Stable-SPAM: Come Addestrare in 4-Bit con Maggiore Stabilità rispetto a Adam in 16-Bit

Abstract

Questo articolo valuta in modo esaustivo diversi ottimizzatori recentemente proposti per l'addestramento a 4 bit, rivelando che la precisione a basso bit amplifica la sensibilità ai tassi di apprendimento e spesso causa instabilità nelle norme dei gradienti, portando a divergenze a tassi di apprendimento più elevati. Tra questi, SPAM, un recente ottimizzatore che include il reset del momento e il clipping dei gradienti consapevole dei picchi, ottiene le migliori prestazioni a vari livelli di bit, ma fatica a stabilizzare le norme dei gradienti, richiedendo un'attenta regolazione del tasso di apprendimento. Per affrontare queste limitazioni, proponiamo Stable-SPAM, che incorpora tecniche avanzate di normalizzazione e clipping dei gradienti. In particolare, Stable-SPAM (1) aggiorna in modo adattivo la soglia di clipping per i gradienti con picchi monitorando i loro massimi storici; (2) normalizza l'intera matrice dei gradienti basandosi sulle statistiche storiche della norma l_2; e (3) eredita il reset del momento da SPAM per resettare periodicamente i primi e secondi momenti di Adam, mitigando l'accumulo di gradienti con picchi. Esperimenti estesi dimostrano che Stable-SPAM stabilizza efficacemente le norme dei gradienti nell'addestramento di LLM a 4 bit, offrendo prestazioni superiori rispetto ad Adam e SPAM. In particolare, il nostro modello LLaMA-1B a 4 bit addestrato con Stable-SPAM supera il LLaMA-1B BF16 addestrato con Adam fino a 2 punti di perplessità. Inoltre, quando entrambi i modelli sono addestrati a 4 bit, Stable-SPAM raggiunge la stessa perdita di Adam richiedendo solo circa la metà dei passi di addestramento. Il codice è disponibile all'indirizzo https://github.com/TianjinYellow/StableSPAM.git.

English

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical l_2-norm statistics; and (3) inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to 2 perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

Stable-SPAM: Come Addestrare in 4-Bit con Maggiore Stabilità rispetto a Adam in 16-Bit

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Abstract

Support