ChatPaper.aiChatPaper

QuEST:使用1位元權重和啟動穩定訓練LLM

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

February 7, 2025
作者: Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
cs.AI

摘要

為了降低大型語言模型(LLMs)的高昂成本,一種方法是在訓練或部署中使用量化或稀疏表示。儘管事後壓縮方法非常流行,但直接在這些表示上進行訓練以獲得更準確的壓縮模型的問題,即量化感知訓練(QAT),仍然是一個開放的問題:例如,最近的一項研究(arXiv:2411.04330v2)將模型可以使用QAT進行訓練的“最佳”位元寬度設定為8位權重和激活時,仍保持與標準FP16/BF16精度相競爭的準確性。 通過一種名為QuEST的新方法,我們推進了這一最新技術,該方法與FP16具有帕累托競爭力,即在更小的模型尺寸下提供更好的準確性,同時訓練具有4位或更少位元的權重和激活的模型。此外,QuEST允許穩定地訓練具有1位元權重和激活的模型。QuEST通過改進QAT方法的兩個關鍵方面實現了這一點:(1)通過Hadamard 正規化和 MSE-最佳擬合來準確且快速地量化權重和激活的(連續)分佈;(2)一種基於明確最小化在量化狀態下計算的噪聲梯度與“真實”(但未知)全精度梯度之間的誤差的新信任梯度估算器。對於Llama型架構的實驗表明,QuEST在整個硬件支持的精度範圍內引入了穩定的擴展定律,並且可以擴展到稀疏表示。我們提供了GPU核心支持,顯示由QuEST生成的模型可以高效執行。我們的代碼可在 https://github.com/IST-DASLab/QuEST 找到。
English
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

Summary

AI-Generated Summary

PDF443February 10, 2025