Quartet：原生FP4训练在大语言模型中可实现最优性能

摘要

大型语言模型（LLMs）的快速发展伴随着计算需求的空前增长，顶尖模型的训练成本每几个月便翻一番。直接在低精度算术中训练模型提供了一种解决方案，既提升了计算吞吐量，又提高了能源效率。特别是，NVIDIA最新的Blackwell架构支持极低精度运算，尤其是FP4变体，预示着显著的效率提升。然而，当前在FP4精度下训练LLM的算法面临显著的精度下降问题，并常依赖于混合精度回退机制。本文中，我们系统研究了硬件支持的FP4训练，并提出了Quartet，一种新方法，能够实现精确的端到端FP4训练，所有主要计算（如线性层）均在低精度下完成。通过对Llama类模型的大量评估，我们揭示了一种新的低精度缩放定律，该定律量化了不同位宽下的性能权衡，使我们能够识别出一种在精度与计算之间达到“近最优”的低精度训练技术，即Quartet。我们利用为NVIDIA Blackwell GPU定制的优化CUDA内核实现了Quartet，并展示了其在FP4精度下能够达到顶尖的精度，成功训练了十亿级规模的模型。我们的方法证明，完全基于FP4的训练是标准精度和FP8训练的有力替代方案。我们的代码可在https://github.com/IST-DASLab/Quartet获取。

English

The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Quartet：原生FP4训练在大语言模型中可实现最优性能

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

摘要

Support