整數與浮點數:精細低比特量化格式的全面研究
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
October 29, 2025
作者: Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo
cs.AI
摘要
當代人工智慧硬體(如NVIDIA的Blackwell架構)正日益採用低精度浮點數格式,以應對大型語言模型中普遍存在的激活值異常值問題。儘管存在這一產業趨勢,學界仍缺乏針對不同粒度層級的浮點數與整數量化統一比較,導致演算法與硬體協同設計缺乏明確指引。本文透過系統性研究浮點數與整數格式的權衡取捨填補此空白。我們發現關鍵的效能轉折點:浮點數在粗粒度量化中表現卓越,但在細粒度(區塊級)比較中則呈現更複雜的態勢。綜合比較顯示,對於流行的8位元細粒度格式(如區塊大小32的MX格式),MXINT8在演算法精度與硬體效率上均優於對應的浮點數格式。然而在4位元格式中,浮點數(如MXFP4、NVFP4)通常具有精度優勢,但我們證明當採用哈達瑪旋轉等異常值緩解技術時,NVINT4可超越NVFP4。我們還提出一種對稱剪裁方法,解決細粒度低位元整數訓練中的梯度偏差問題,使MXINT8訓練實現近乎無損的效能。這些發現對當前硬體發展路徑提出挑戰,證明「一刀切」的浮點數方案並非最優解,並主張細粒度整數格式(特別是MXINT8)能為未來AI加速器提供更佳的精度、功耗與效率平衡。
English
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly
embracing low-precision floating-point (FP) formats to handle the pervasive
activation outliers in Large Language Models (LLMs). Despite this industry
trend, a unified comparison of FP and integer (INT) quantization across varying
granularities has been missing, leaving algorithm and hardware co-design
without clear guidance. This paper fills that gap by systematically
investigating the trade-offs between FP and INT formats. We reveal a critical
performance crossover: while FP excels in coarse-grained quantization, the
comparison at fine-grained (block-wise) levels is more nuanced. Our
comprehensive comparison demonstrates that for popular 8-bit fine-grained
formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart
in both algorithmic accuracy and hardware efficiency. However, for 4-bit
formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we
show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like
Hadamard rotation are applied. We also introduce a symmetric clipping method
that resolves gradient bias in fine-grained low-bit INT training, enabling
nearly lossless performance for MXINT8 training. These findings challenge the
current hardware trajectory, demonstrating that a one-size-fits-all FP approach
is suboptimal and advocating that fine-grained INT formats, particularly
MXINT8, offer a better balance of accuracy, power, and efficiency for future AI
accelerators.