模型保持自适应舍入

摘要

后训练量化（PTQ）的主要目标是生成一个压缩模型，其输出分布尽可能接近原始模型。为实现这一目标，几乎所有大语言模型（LLM）的PTQ算法都通过独立最小化即时激活误差来量化线性层。然而，这种局部目标忽略了后续层的影响，因此减少该误差并不必然带来更接近的模型。在本研究中，我们引入了另一种量化算法（YAQA），这是一种自适应舍入算法，利用各线性层关于全模型KL散度的Kronecker分解近似Hessian矩阵。YAQA包含两个组成部分：可在大规模参数（如千亿级）LLM上高效计算的全层Hessian矩阵的Kronecker分解草图，以及一个独立于量化器的舍入算法，该算法利用这些草图并附带理论保证。在多种模型和量化器上，YAQA实证将KL散度相对于原始模型减少了约30%，同时在下游任务中实现了最先进的性能。

English

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by approx 30% while achieving state of the art performance on downstream tasks.

模型保持自适应舍入

Model-Preserving Adaptive Rounding

摘要

Support