ZeroQuant(4+2):通過新的FP6中心策略重新定義LLMs的量化,以應對多樣的生成任務。
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
December 14, 2023
作者: Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao
cs.AI
摘要
本研究探討了在大型語言模型(LLMs)中像GPTQ這樣的4位量化方法,突顯了GPTQ在零樣本任務中的過度擬合和有限的增強。儘管先前的研究僅關注零樣本測量,我們將任務範圍擴展到更具生成性的類別,如代碼生成和抽象摘要,我們發現INT4量化在這些任務中表現顯著不佳。然而,僅僅轉向更高精度格式,如FP6,特別具有挑戰性,因為目前AI硬件上缺乏複雜的整合和系統加速策略,因而被忽視。我們的結果顯示,即使使用粗粒度量化方案,FP6在各種算法和任務中表現穩健,展示了其在準確性和多功能性方面的優越性。值得注意的是,使用FP6量化,\codestar-15B模型在代碼生成方面的表現與其FP16對應物相當,對於較小的模型,如406M,在摘要中它與基準線的匹配程度接近。這是INT4無法實現的。為了更好地適應各種AI硬件並實現最佳系統性能,我們提出了一種新穎的4+2設計,用於FP6,以實現與最先進的INT4細粒度量化相似的延遲。通過我們的設計,FP6可以成為目前在LLMs中使用的4位量化方法的一個有前途的解決方案。
English
This study examines 4-bit quantization methods like GPTQ in large language
models (LLMs), highlighting GPTQ's overfitting and limited enhancement in
Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we
extend task scope to more generative categories such as code generation and
abstractive summarization, in which we found that INT4 quantization can
significantly underperform. However, simply shifting to higher precision
formats like FP6 has been particularly challenging, thus overlooked, due to
poor performance caused by the lack of sophisticated integration and system
acceleration strategies on current AI hardware. Our results show that FP6, even
with a coarse-grain quantization scheme, performs robustly across various
algorithms and tasks, demonstrating its superiority in accuracy and
versatility. Notably, with the FP6 quantization, \codestar-15B model performs
comparably to its FP16 counterpart in code generation, and for smaller models
like the 406M it closely matches their baselines in summarization. Neither can
be achieved by INT4. To better accommodate various AI hardware and achieve the
best system performance, we propose a novel 4+2 design for FP6 to achieve
similar latency to the state-of-the-art INT4 fine-grain quantization. With our
design, FP6 can become a promising solution to the current 4-bit quantization
methods used in LLMs.