彌合微縮放FP4量化在承諾與性能之間的差距
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
September 27, 2025
作者: Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
cs.AI
摘要
近期,由NVIDIA和AMD GPU支持的硬件加速微縮放4位浮點格式(如MXFP4和NVFP4)有望徹底改變大型語言模型(LLM)的推理效能。然而,其實際效益尚未得到證實。我們首次對MXFP4和NVFP4進行了全面的訓練後量化研究,揭示了其承諾與實際性能之間的差距。我們的分析表明,由於兩個關鍵問題,最先進的方法在處理FP4時面臨挑戰:(1)NVFP4的小組大小理論上抵消了傳統的異常值緩解技術;(2)MXFP4的二次方比例量化由於引入的高誤差嚴重降低了精度。為彌補這一差距,我們引入了微旋轉GPTQ(MR-GPTQ),這是經典GPTQ量化算法的一個變體,通過使用分塊哈達瑪變換和格式特定的優化,使量化過程適應FP4的獨特特性。我們通過一組高性能GPU內核支持我們的提案,這些內核通過將旋轉融合到權重中並快速在線計算激活,以可忽略的開銷實現了MR-GPTQ格式。這在NVIDIA B200上實現了相對於FP16的層級加速最高達3.6倍,端到端加速達2.2倍;在RTX5090上實現了層級加速6倍,端到端加速4倍。我們廣泛的實證評估表明,MR-GPTQ匹配或超越了最先進的精度,顯著提升了MXFP4,使其接近NVFP4的水平。我們得出結論,雖然FP4並非INT4的自動升級,但像MR-GPTQ這樣的格式專用方法可以開啟精度與性能權衡的新領域。
English
The recent hardware-accelerated microscaling 4-bit floating-point formats
such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to
revolutionize large language model (LLM) inference. Yet, their practical
benefits remain unproven. We present the first comprehensive study of MXFP4 and
NVFP4 for post-training quantization, revealing gaps between their promise and
real-world performance. Our analysis shows that state-of-the-art methods
struggle with FP4, due to two key issues: (1) NVFP4's small group size provably
neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two
scale quantization severely degrades accuracy due to high induced error. To
bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the
classic GPTQ quantization algorithm that tailors the quantization process to
FP4's unique properties, by using block-wise Hadamard transforms and
format-specific optimizations. We support our proposal with a set of
high-performance GPU kernels that enable the MR-GPTQ format with negligible
overhead, by rotation fusion into the weights, and fast online computation of
the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and
2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on
RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches
or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the
point where it nears that of NVFP4. We conclude that, while FP4 is not an
automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock
a new frontier of accuracy-performance trade-offs.