彌合微縮放FP4量化在承諾與性能之間的差距

摘要

近期，由NVIDIA和AMD GPU支持的硬件加速微縮放4位浮點格式（如MXFP4和NVFP4）有望徹底改變大型語言模型（LLM）的推理效能。然而，其實際效益尚未得到證實。我們首次對MXFP4和NVFP4進行了全面的訓練後量化研究，揭示了其承諾與實際性能之間的差距。我們的分析表明，由於兩個關鍵問題，最先進的方法在處理FP4時面臨挑戰：（1）NVFP4的小組大小理論上抵消了傳統的異常值緩解技術；（2）MXFP4的二次方比例量化由於引入的高誤差嚴重降低了精度。為彌補這一差距，我們引入了微旋轉GPTQ（MR-GPTQ），這是經典GPTQ量化算法的一個變體，通過使用分塊哈達瑪變換和格式特定的優化，使量化過程適應FP4的獨特特性。我們通過一組高性能GPU內核支持我們的提案，這些內核通過將旋轉融合到權重中並快速在線計算激活，以可忽略的開銷實現了MR-GPTQ格式。這在NVIDIA B200上實現了相對於FP16的層級加速最高達3.6倍，端到端加速達2.2倍；在RTX5090上實現了層級加速6倍，端到端加速4倍。我們廣泛的實證評估表明，MR-GPTQ匹配或超越了最先進的精度，顯著提升了MXFP4，使其接近NVFP4的水平。我們得出結論，雖然FP4並非INT4的自動升級，但像MR-GPTQ這樣的格式專用方法可以開啟精度與性能權衡的新領域。

English

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

彌合微縮放FP4量化在承諾與性能之間的差距

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

摘要

Support