弥合微缩FP4量化承诺与性能之间的差距
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
September 27, 2025
作者: Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
cs.AI
摘要
近期,硬件加速的微缩4位浮点格式(如MXFP4和NVFP4)在NVIDIA和AMD GPU上得到支持,预示着将彻底改变大规模语言模型(LLM)的推理效率。然而,其实际效益尚未得到验证。我们首次对MXFP4和NVFP4在训练后量化中的应用进行了全面研究,揭示了其承诺与现实性能之间的差距。分析表明,当前最先进的方法在处理FP4时面临两大挑战:(1) NVFP4的小组规模理论上抵消了传统的异常值缓解技术;(2) MXFP4的二次方比例量化因引入的高误差严重降低了精度。为弥合这一差距,我们提出了微旋转GPTQ(MR-GPTQ),这是经典GPTQ量化算法的变体,通过采用分块哈达玛变换和针对FP4特性的优化,定制了量化过程。我们通过一组高性能GPU内核支持这一方案,通过将旋转融合到权重中以及快速在线计算激活值,实现了MR-GPTQ格式的极低开销。这带来了与FP16相比,在NVIDIA B200上高达3.6倍的层级加速和2.2倍的端到端加速,在RTX5090上则分别达到6倍和4倍。广泛的实证评估显示,MR-GPTQ在精度上匹配甚至超越了现有技术,显著提升了MXFP4,使其接近NVFP4的水平。我们得出结论,虽然FP4并非INT4的自动升级,但像MR-GPTQ这样针对特定格式优化的方法,能够开启精度与性能权衡的新领域。
English
The recent hardware-accelerated microscaling 4-bit floating-point formats
such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to
revolutionize large language model (LLM) inference. Yet, their practical
benefits remain unproven. We present the first comprehensive study of MXFP4 and
NVFP4 for post-training quantization, revealing gaps between their promise and
real-world performance. Our analysis shows that state-of-the-art methods
struggle with FP4, due to two key issues: (1) NVFP4's small group size provably
neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two
scale quantization severely degrades accuracy due to high induced error. To
bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the
classic GPTQ quantization algorithm that tailors the quantization process to
FP4's unique properties, by using block-wise Hadamard transforms and
format-specific optimizations. We support our proposal with a set of
high-performance GPU kernels that enable the MR-GPTQ format with negligible
overhead, by rotation fusion into the weights, and fast online computation of
the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and
2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on
RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches
or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the
point where it nears that of NVFP4. We conclude that, while FP4 is not an
automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock
a new frontier of accuracy-performance trade-offs.