ZeroQuant(4+2)：通过一种新的基于FP6的策略重新定义LLMs的量化，用于多样的生成任务。

摘要

本研究探讨了在大型语言模型（LLMs）中类似GPTQ的4位量化方法，突出了GPTQ在零样本任务中的过拟合和有限的改进。之前的研究仅关注零样本测量，我们将任务范围扩展到更多生成类别，如代码生成和抽象总结，在这些领域中发现INT4量化可能表现不佳。然而，简单地转向更高精度格式如FP6却尤为具有挑战性，因为当前人工智能硬件上缺乏复杂的集成和系统加速策略，从而被忽视。我们的结果显示，即使采用粗粒度量化方案，FP6在各种算法和任务中表现稳健，展示了其在准确性和多功能性方面的优势。值得注意的是，采用FP6量化后，\codestar-15B模型在代码生成方面的表现与其FP16对应模型相当，对于像406M这样的较小模型，在总结方面与它们的基准模型表现接近。这是INT4无法实现的。为了更好地适应各种人工智能硬件并实现最佳系统性能，我们提出了一种新颖的4+2设计，用于FP6，以实现与最先进的INT4细粒度量化相似的延迟。通过我们的设计，FP6可以成为当前用于LLMs的4位量化方法的一个有前途的解决方案。

English

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

ZeroQuant(4+2)：通过一种新的基于FP6的策略重新定义LLMs的量化，用于多样的生成任务。

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

摘要

Support