Het overbruggen van de kloof tussen belofte en prestaties bij microscaling FP4-kwantisering

Samenvatting

De recente hardware-versnelde microscaling 4-bit floating-point formaten zoals MXFP4 en NVFP4, ondersteund op NVIDIA en AMD GPU's, beloven een revolutie teweeg te brengen in de inferentie van grote taalmodellen (LLM's). Toch blijven hun praktische voordelen onbewezen. Wij presenteren de eerste uitgebreide studie van MXFP4 en NVFP4 voor post-training kwantisatie, waarbij we de kloof tussen hun belofte en de prestaties in de praktijk blootleggen. Onze analyse toont aan dat state-of-the-art methoden moeite hebben met FP4, vanwege twee belangrijke problemen: (1) de kleine groepgrootte van NVFP4 maakt traditionele technieken voor het mitigeren van uitschieters bewezen ondoeltreffend; (2) de machten-van-twee schaalkwantisatie van MXFP4 leidt tot een ernstige verslechtering van de nauwkeurigheid door een hoog geïnduceerd foutenpercentage. Om deze kloof te overbruggen, introduceren we Micro-Rotated-GPTQ (MR-GPTQ), een variant van het klassieke GPTQ-kwantisatiealgoritme dat het kwantisatieproces afstemt op de unieke eigenschappen van FP4, door gebruik te maken van bloksgewijze Hadamard-transformaties en format-specifieke optimalisaties. Wij ondersteunen ons voorstel met een set hoogwaardige GPU-kernels die het MR-GPTQ-formaat mogelijk maken met verwaarloosbare overhead, door rotatiefusie in de gewichten en snelle online berekening van de activaties. Dit resulteert in snelheidswinsten ten opzichte van FP16 van tot 3,6x laaggewijs en 2,2x end-to-end op de NVIDIA B200, en van 6x laaggewijs en 4x end-to-end op de RTX5090. Onze uitgebreide empirische evaluatie toont aan dat MR-GPTQ de state-of-the-art nauwkeurigheid evenaart of overtreft, waarbij MXFP4 aanzienlijk wordt verbeterd, tot het punt waar het bijna die van NVFP4 benadert. Wij concluderen dat, hoewel FP4 geen automatische upgrade is ten opzichte van INT4, format-specifieke methoden zoals MR-GPTQ een nieuw front kunnen openen in de afweging tussen nauwkeurigheid en prestaties.

English

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

Het overbruggen van de kloof tussen belofte en prestaties bij microscaling FP4-kwantisering

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Samenvatting

Support