使用FP4量化優化大型語言模型訓練
Optimizing Large Language Model Training Using FP4 Quantization
January 28, 2025
作者: Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng
cs.AI
摘要
訓練大型語言模型(LLMs)所需的計算需求不斷增加,需要更有效的方法。量化訓練提供了一個有前途的解決方案,通過使用低位算術運算來降低成本。儘管FP8精度已經證明是可行的,但利用FP4仍然是一個挑戰,因為存在著顯著的量化誤差和有限的表示能力。本研究引入了第一個針對LLMs的FP4訓練框架,通過兩個關鍵創新來應對這些挑戰:一個可微分的量化估算器用於精確的權重更新,以及一個異常值夾緊和補償策略來防止激活崩潰。為確保穩定性,該框架集成了混合精度訓練方案和向量化量化。實驗結果表明,我們的FP4框架實現了與BF16和FP8相當的準確性,並具有最小的降級,有效擴展到訓練了高達100B令牌的13B參數LLMs。隨著支持FP4的下一代硬件的出現,我們的框架為高效的超低精度訓練奠定了基礎。
English
The growing computational demands of training large language models (LLMs)
necessitate more efficient methods. Quantized training presents a promising
solution by enabling low-bit arithmetic operations to reduce these costs. While
FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge
due to significant quantization errors and limited representational capacity.
This work introduces the first FP4 training framework for LLMs, addressing
these challenges with two key innovations: a differentiable quantization
estimator for precise weight updates and an outlier clamping and compensation
strategy to prevent activation collapse. To ensure stability, the framework
integrates a mixed-precision training scheme and vector-wise quantization.
Experimental results demonstrate that our FP4 framework achieves accuracy
comparable to BF16 and FP8, with minimal degradation, scaling effectively to
13B-parameter LLMs trained on up to 100B tokens. With the emergence of
next-generation hardware supporting FP4, our framework sets a foundation for
efficient ultra-low precision training.Summary
AI-Generated Summary