Quartet:原生FP4训练在大语言模型中可实现最优性能
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
May 20, 2025
作者: Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh
cs.AI
摘要
大型语言模型(LLMs)的快速发展伴随着计算需求的空前增长,顶尖模型的训练成本每几个月便翻一番。直接在低精度算术中训练模型提供了一种解决方案,既提升了计算吞吐量,又提高了能源效率。特别是,NVIDIA最新的Blackwell架构支持极低精度运算,尤其是FP4变体,预示着显著的效率提升。然而,当前在FP4精度下训练LLM的算法面临显著的精度下降问题,并常依赖于混合精度回退机制。本文中,我们系统研究了硬件支持的FP4训练,并提出了Quartet,一种新方法,能够实现精确的端到端FP4训练,所有主要计算(如线性层)均在低精度下完成。通过对Llama类模型的大量评估,我们揭示了一种新的低精度缩放定律,该定律量化了不同位宽下的性能权衡,使我们能够识别出一种在精度与计算之间达到“近最优”的低精度训练技术,即Quartet。我们利用为NVIDIA Blackwell GPU定制的优化CUDA内核实现了Quartet,并展示了其在FP4精度下能够达到顶尖的精度,成功训练了十亿级规模的模型。我们的方法证明,完全基于FP4的训练是标准精度和FP8训练的有力替代方案。我们的代码可在https://github.com/IST-DASLab/Quartet获取。
English
The rapid advancement of large language models (LLMs) has been paralleled by
unprecedented increases in computational demands, with training costs for
state-of-the-art models doubling every few months. Training models directly in
low-precision arithmetic offers a solution, by improving both computational
throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell
architecture facilitates extremely low-precision operations, specifically FP4
variants, promising substantial efficiency gains. Yet, current algorithms for
training LLMs in FP4 precision face significant accuracy degradation and often
rely on mixed-precision fallbacks. In this paper, we systematically investigate
hardware-supported FP4 training and introduce Quartet, a new approach enabling
accurate, end-to-end FP4 training with all the major computations (in e.g.
linear layers) being performed in low precision. Through extensive evaluations
on Llama-type models, we reveal a new low-precision scaling law that quantifies
performance trade-offs across varying bit-widths and allows us to identify a
"near-optimal" low-precision training technique in terms of
accuracy-vs-computation, called Quartet. We implement Quartet using optimized
CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve
state-of-the-art accuracy for FP4 precision, successfully training
billion-scale models. Our method demonstrates that fully FP4-based training is
a competitive alternative to standard-precision and FP8 training. Our code is
available at https://github.com/IST-DASLab/Quartet.Summary
AI-Generated Summary