INT contro FP: Uno Studio Completo sui Formati di Quantizzazione a Basso Bit a Grana Fine

Abstract

L'hardware AI moderno, come l'architettura Blackwell di Nvidia, sta abbracciando sempre più formati floating-point (FP) a bassa precisione per gestire i pervasivi outlier di attivazione nei Large Language Model (LLM). Nonostante questa tendenza industriale, è mancato un confronto unificato tra la quantizzazione FP e intera (INT) attraverso diverse granularità, lasciando la co-progettazione di algoritmi e hardware senza una guida chiara. Questo articolo colma tale lacuna investigando sistematicamente i compromessi tra i formati FP e INT. Riveliamo un critico punto di crossover nelle prestazioni: mentre l'FP eccelle nella quantizzazione a granularità grossolana, il confronto a livelli di granularità fine (a livello di blocco) è più articolato. Il nostro confronto completo dimostra che per i popolari formati a 8 bit a granularità fine (ad esempio, MX con dimensione del blocco 32), MXINT8 è superiore alla sua controparte FP sia in accuratezza algoritmica che in efficienza hardware. Tuttavia, per i formati a 4 bit, l'FP (ad esempio, MXFP4, NVFP4) mantiene spesso un vantaggio in accuratezza, sebbene mostriamo che NVINT4 può superare NVFP4 quando vengono applicate tecniche di mitigazione degli outlier come la rotazione di Hadamard. Introduciamo anche un metodo di clipping simmetrico che risolve il bias del gradiente nell'addestramento INT a basso bit e granularità fine, consentendo prestazioni quasi senza perdite per l'addestramento MXINT8. Questi risultati sfidano la traiettoria hardware attuale, dimostrando che un approccio FP universale è subottimale e sostenendo che i formati INT a granularità fine, in particolare MXINT8, offrono un migliore bilanciamento tra accuratezza, potenza ed efficienza per i futuri acceleratori AI.

English

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

INT contro FP: Uno Studio Completo sui Formati di Quantizzazione a Basso Bit a Grana Fine

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Abstract

Support