大規模言語モデルの共同量子化とスパース化のための最適な脳復元

要旨

大規模言語モデル（LLM）の圧縮技術、特に量子化とプルーニングの分野では、最近目覚ましい進展が見られています。しかし、これらの技術がそれぞれの限界に近づくにつれ、単一の手法に依存してさらなる圧縮を図ることがますます困難になってきています。本研究では、量子化とスパース性を組み合わせるという代替ソリューションを探求します。この併用アプローチは有望ではあるものの、重み分布に対する本質的に相反する要件（量子化はコンパクトな範囲を好むのに対し、プルーニングは高い分散を必要とする）により、新たな困難が生じます。この問題に対処するため、我々は最適脳修復（Optimal Brain Restoration, OBR）を提案します。これは、量子化とプルーニングの間でエラー補償を行うことで両者を整合させる、汎用的でトレーニング不要なフレームワークです。OBRは、下流タスクにおける性能劣化を最小化するために、二次のヘッシアン目的関数に基づいて構築され、代理近似を通じて扱いやすい問題に再定式化され、最終的にはグループエラー補償によって閉形式解に到達します。実験結果から、OBRは既存のLLMに対してW4A4KV4量子化と50%のスパース性を実現し、FP16密行列ベースラインと比較して最大4.72倍の高速化と6.4倍のメモリ削減を達成することが示されました。

English

Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

大規模言語モデルの共同量子化とスパース化のための最適な脳復元

Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

要旨

Support