ZeroQuant(4+2): 多様な生成タスクのための新しいFP6中心戦略によるLLM量子化の再定義

要旨

本研究では、GPTQなどの4ビット量子化手法を大規模言語モデル（LLM）において検証し、GPTQがZero-Shotタスクにおいて過学習を起こし、性能向上が限定的であることを明らかにしました。先行研究がZero-Shot評価に焦点を当てていたのに対し、我々はコード生成や抽象的要約といったより生成的なタスクカテゴリに範囲を拡大し、INT4量子化が著しく性能を低下させることを発見しました。しかし、FP6のような高精度フォーマットへの単純な移行は、現在のAIハードウェア上での洗練された統合とシステム加速戦略の欠如による性能の低さから、特に困難であり、見過ごされてきました。我々の結果は、FP6が粗粒度の量子化スキームであっても、様々なアルゴリズムとタスクにおいて堅牢に機能し、精度と汎用性の優位性を示すことを明らかにしています。特に、FP6量子化により、\codestar-15Bモデルはコード生成においてFP16版と同等の性能を発揮し、406Mのような小型モデルでは要約タスクでベースラインに匹敵する結果を示しました。これらはINT4では達成できませんでした。様々なAIハードウェアに適応し、最適なシステム性能を実現するため、我々はFP6に対して新たな4+2設計を提案し、最先端のINT4細粒度量子化と同等のレイテンシを実現しました。この設計により、FP6はLLMで使用される現在の4ビット量子化手法に対する有望な解決策となり得ます。

English

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

ZeroQuant(4+2): 多様な生成タスクのための新しいFP6中心戦略によるLLM量子化の再定義

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

要旨

Support