INT versus FP: Een uitgebreide studie naar fijnmazige low-bit kwantiseringsformaten

Samenvatting

Moderne AI-hardware, zoals Nvidia's Blackwell-architectuur, omarmt in toenemende mate low-precision floating-point (FP)-formaten om de alomtegenwoordige activeringsuitbijters in Large Language Models (LLM's) te verwerken. Ondanks deze industriële trend ontbrak tot nu toe een uniforme vergelijking van FP- en integer (INT)-kwantisering op verschillende granulariteiten, waardoor co-design van algoritmen en hardware zonder duidelijke richtlijn bleef. Dit artikel voorziet in die leemte door de afwegingen tussen FP- en INT-formaten systematisch te onderzoeken. We onthullen een kritiek prestatiekruispunt: waar FP uitblinkt in grofkorrelige kwantisering, is de vergelijking op fijnkorrelig (bloksgewijs) niveau genuanceerder. Onze uitgebreide vergelijking toont aan dat voor populaire 8-bits fijnkorrelige formaten (bijv. MX met blokgrootte 32), MXINT8 superieur is aan zijn FP-tegenhanger in zowel algoritmische nauwkeurigheid als hardware-efficiëntie. Voor 4-bits formaten heeft FP (bijv. MXFP4, NVFP4) echter vaak een nauwkeurigheidsvoordeel, hoewel we aantonen dat NVINT4 NVFP4 kan overtreffen wanneer uitbijter-reductietechnieken zoals Hadamard-rotatie worden toegepast. We introduceren ook een symmetrische clippingsmethode die gradientbias oplost bij fijnkorrelige INT-training met weinig bits, wat nagenoeg verliesvrije prestaties voor MXINT8-training mogelijk maakt. Deze bevindingen dagen de huidige hardware-koers uit, door aan te tonen dat een universele FP-aanpak suboptimaal is en te bepleiten dat fijnkorrelige INT-formaten, in het bijzonder MXINT8, een betere balans bieden van nauwkeurigheid, vermogen en efficiëntie voor toekomstige AI-versnellers.

English

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

INT versus FP: Een uitgebreide studie naar fijnmazige low-bit kwantiseringsformaten

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Samenvatting

Support