模型保持自适应舍入
Model-Preserving Adaptive Rounding
May 29, 2025
作者: Albert Tseng, Zhaofeng Sun, Christopher De Sa
cs.AI
摘要
后训练量化(PTQ)的主要目标是生成一个压缩模型,其输出分布尽可能接近原始模型。为实现这一目标,几乎所有大语言模型(LLM)的PTQ算法都通过独立最小化即时激活误差来量化线性层。然而,这种局部目标忽略了后续层的影响,因此减少该误差并不必然带来更接近的模型。在本研究中,我们引入了另一种量化算法(YAQA),这是一种自适应舍入算法,利用各线性层关于全模型KL散度的Kronecker分解近似Hessian矩阵。YAQA包含两个组成部分:可在大规模参数(如千亿级)LLM上高效计算的全层Hessian矩阵的Kronecker分解草图,以及一个独立于量化器的舍入算法,该算法利用这些草图并附带理论保证。在多种模型和量化器上,YAQA实证将KL散度相对于原始模型减少了约30%,同时在下游任务中实现了最先进的性能。
English
The main goal of post-training quantization (PTQ) is to produced a compressed
model whose output distribution is as close to the original model's as
possible. To do this tractably, almost all LLM PTQ algorithms quantize linear
layers by independently minimizing the immediate activation error. However,
this localized objective ignores the effect of subsequent layers, so reducing
it does not necessarily give a closer model. In this work, we introduce Yet
Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses
Kronecker-factored approximations of each linear layer's Hessian with respect
to the full model KL divergence. YAQA consists of two components:
Kronecker-factored sketches of the full layerwise Hessian that can be tractably
computed for hundred-billion parameter LLMs, and a quantizer-independent
rounding algorithm that uses these sketches and comes with theoretical
guarantees. Across a wide range of models and quantizers, YAQA empirically
reduces the KL divergence to the original model by approx 30% while
achieving state of the art performance on downstream tasks.Summary
AI-Generated Summary