四重奏II:通过改进的无偏梯度估计在NVFP4中实现精确大语言模型预训练
Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
January 30, 2026
作者: Andrei Panferov, Erik Schultheis, Soroush Tabesh, Dan Alistarh
cs.AI
摘要
NVIDIA Blackwell GPU硬件支持的NVFP4低精度格式,首次有望实现大语言模型等海量参数的端到端全量化预训练。然而,现有量化训练方法仍会牺牲该格式的部分表示能力,以通过随机舍入(SR)获得更准确的无偏量化梯度估计,导致其精度较标准FP16和FP8训练存在明显差距。本文通过一种名为MS-EDEN的新型微尺度格式无偏量化方案改进了NVFP4量化训练的技术水平,该方案的量化误差比SR低2倍以上。我们将其集成到名为Quartet II的新型全NVFP4线性层量化方案中,通过理论分析证明Quartet II在正向和反向传播的所有主要矩阵乘法运算中均能实现更优的梯度估计。此外,我们的方案与近期针对NVFP4的训练优化技术形成良好协同。我们在38B token数据上对最高19亿参数的LLM进行端到端训练,进一步验证了Quartet II的有效性。我们提供的内核可在NVIDIA Blackwell GPU上运行,相比BF16实现最高4.2倍加速。代码已开源:https://github.com/IST-DASLab/Quartet-II。
English
The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .