미세 스케일링 FP4 양자화의 약속과 성능 간의 격차 해소

초록

최근 NVIDIA와 AMD GPU에서 지원되는 MXFP4 및 NVFP4와 같은 하드웨어 가속 마이크로스케일링 4비트 부동소수점 형식은 대규모 언어 모델(LLM) 추론에 혁신을 가져올 것으로 기대됩니다. 그러나 이러한 형식의 실질적인 이점은 아직 입증되지 않았습니다. 본 연구는 MXFP4와 NVFP4를 대상으로 한 최초의 포스트 트레이닝 양자화(PTQ) 종합 연구를 제시하며, 이들의 약속과 실제 성능 간의 격차를 밝혀냅니다. 우리의 분석에 따르면, 최신 방법들은 FP4에서 두 가지 주요 문제로 인해 어려움을 겪고 있습니다: (1) NVFP4의 작은 그룹 크기는 기존의 이상치 완화 기술을 효과적으로 무력화하며, (2) MXFP4의 2의 거듭제곱 스케일 양자화는 높은 오차로 인해 정확도를 심각하게 저하시킵니다. 이러한 격차를 해소하기 위해, 우리는 고전적인 GPTQ 양자화 알고리즘의 변형인 Micro-Rotated-GPTQ(MR-GPTQ)를 소개합니다. 이 방법은 블록 단위 Hadamard 변환과 형식별 최적화를 통해 FP4의 고유한 특성에 맞춰 양자화 프로세스를 조정합니다. 우리는 이 제안을 지원하기 위해, 가중치에 회전을 융합하고 활성화의 빠른 온라인 계산을 통해 MR-GPTQ 형식을 최소한의 오버헤드로 구현하는 고성능 GPU 커널 세트를 개발했습니다. 이를 통해 NVIDIA B200에서 FP16 대비 최대 3.6배의 레이어별 속도 향상과 2.2배의 종단 간 속도 향상을, RTX5090에서는 최대 6배의 레이어별 속도 향상과 4배의 종단 간 속도 향상을 달성했습니다. 우리의 광범위한 실험 평가는 MR-GPTQ가 최신 정확도를 유지하거나 능가하며, MXFP4의 성능을 크게 향상시켜 NVFP4에 근접한 수준으로 끌어올리는 것을 보여줍니다. 결론적으로, FP4가 INT4에 비해 자동으로 우월한 것은 아니지만, MR-GPTQ와 같은 형식 특화 방법은 정확도와 성능 간의 새로운 균형을 찾는 길을 열어줄 수 있습니다.

English

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

미세 스케일링 FP4 양자화의 약속과 성능 간의 격차 해소

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

초록

Support