大语言模型联合量化与稀疏化的最优脑恢复

摘要

近期，大型语言模型（LLM）压缩技术，如量化和剪枝，已取得显著进展。然而，随着这些技术逐渐逼近各自的理论极限，依赖单一方法实现进一步压缩变得愈发困难。本研究探索了一种结合量化与稀疏性的替代方案。这一联合方法虽前景广阔，却因权重分布上固有的矛盾需求而引入新挑战：量化倾向于紧凑的范围，而剪枝则受益于高方差。针对这一问题，我们提出了最优大脑恢复（Optimal Brain Restoration, OBR），这是一个通用且无需训练的框架，通过误差补偿在剪枝与量化之间实现对齐。OBR基于二阶Hessian目标函数，最小化下游任务上的性能损失，并通过代理近似将其重构为可处理的问题，最终通过组误差补偿达到闭式解。实验表明，OBR支持在现有LLM上实现激进的W4A4KV4量化并保持50%的稀疏度，相较于FP16密集基线，实现了高达4.72倍的加速和6.4倍的内存缩减。

English

Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

大语言模型联合量化与稀疏化的最优脑恢复

Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

摘要

Support