マイクロスケーリングFP4量子化における約束と性能のギャップを埋める

要旨

近年、NVIDIAおよびAMDのGPUでサポートされているMXFP4やNVFP4などのハードウェアアクセラレーションを活用した4ビット浮動小数点フォーマットは、大規模言語モデル（LLM）の推論に革命をもたらすと期待されている。しかし、その実用的な利点は未だ証明されていない。本研究では、MXFP4およびNVFP4を用いた学習後量子化に関する初の包括的な調査を行い、その期待と実世界での性能との間に存在するギャップを明らかにする。我々の分析によれば、最先端の手法はFP4において以下の2つの主要な課題に直面している：(1) NVFP4の小さなグループサイズは、従来の外れ値緩和技術を無効化することが証明されている；(2) MXFP4の2のべき乗スケール量子化は、高い誘導誤差により精度を著しく低下させる。このギャップを埋めるため、我々はMicro-Rotated-GPTQ（MR-GPTQ）を提案する。これは、ブロック単位のアダマール変換とフォーマット固有の最適化を用いて、FP4の特性に合わせて量子化プロセスを調整する、古典的なGPTQ量子化アルゴリズムの変種である。提案手法をサポートするため、重みへの回転融合と活性化の高速オンライン計算を実現する一連の高性能GPUカーネルを開発し、MR-GPTQフォーマットを無視可能なオーバーヘッドで実現した。これにより、NVIDIA B200ではレイヤー単位で最大3.6倍、エンドツーエンドで2.2倍の高速化を、RTX5090ではレイヤー単位で6倍、エンドツーエンドで4倍の高速化を達成した。広範な実証評価の結果、MR-GPTQは最先端の精度を達成または上回り、MXFP4の精度を大幅に向上させ、NVFP4に近づけることが示された。結論として、FP4はINT4に対する自動的なアップグレードではないものの、MR-GPTQのようなフォーマット特化型の手法は、精度と性能の新たなトレードオフの可能性を開くことができる。

English

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

マイクロスケーリングFP4量子化における約束と性能のギャップを埋める

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

要旨

Support