大语言模型联合量化与稀疏化的最优脑恢复
Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
September 14, 2025
作者: Hang Guo, Yawei Li, Luca Benini
cs.AI
摘要
近期,大型语言模型(LLM)压缩技术,如量化和剪枝,已取得显著进展。然而,随着这些技术逐渐逼近各自的理论极限,依赖单一方法实现进一步压缩变得愈发困难。本研究探索了一种结合量化与稀疏性的替代方案。这一联合方法虽前景广阔,却因权重分布上固有的矛盾需求而引入新挑战:量化倾向于紧凑的范围,而剪枝则受益于高方差。针对这一问题,我们提出了最优大脑恢复(Optimal Brain Restoration, OBR),这是一个通用且无需训练的框架,通过误差补偿在剪枝与量化之间实现对齐。OBR基于二阶Hessian目标函数,最小化下游任务上的性能损失,并通过代理近似将其重构为可处理的问题,最终通过组误差补偿达到闭式解。实验表明,OBR支持在现有LLM上实现激进的W4A4KV4量化并保持50%的稀疏度,相较于FP16密集基线,实现了高达4.72倍的加速和6.4倍的内存缩减。
English
Recent advances in Large Language Model (LLM) compression, such as
quantization and pruning, have achieved notable success. However, as these
techniques gradually approach their respective limits, relying on a single
method for further compression has become increasingly challenging. In this
work, we explore an alternative solution by combining quantization and
sparsity. This joint approach, though promising, introduces new difficulties
due to the inherently conflicting requirements on weight distributions:
quantization favors compact ranges, while pruning benefits from high variance.
To attack this problem, we propose Optimal Brain Restoration (OBR), a general
and training-free framework that aligns pruning and quantization by error
compensation between both. OBR minimizes performance degradation on downstream
tasks by building on a second-order Hessian objective, which is then
reformulated into a tractable problem through surrogate approximation and
ultimately reaches a closed-form solution via group error compensation.
Experiments show that OBR enables aggressive W4A4KV4 quantization with 50%
sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory
reduction compared to the FP16-dense baseline.