モデル保存型適応丸め

要旨

ポストトレーニング量子化（PTQ）の主な目的は、元のモデルの出力分布に可能な限り近い圧縮モデルを生成することです。これを実現するために、ほとんどのLLM PTQアルゴリズムは、線形層を独立して即時の活性化誤差を最小化することで量子化します。しかし、この局所的な目的は後続の層の影響を無視するため、これを最小化しても必ずしもモデルが近くなるわけではありません。本研究では、Yet Another Quantization Algorithm（YAQA）を紹介します。これは、各線形層のヘッシアンをフルモデルのKLダイバージェンスに関してクロネッカー分解近似を用いた適応的な丸めアルゴリズムです。YAQAは2つのコンポーネントで構成されています：数百億パラメータのLLMに対して計算可能なフル層ごとのヘッシアンのクロネッカー分解スケッチと、これらのスケッチを使用し理論的保証を伴う量子化器に依存しない丸めアルゴリズムです。広範なモデルと量子化器において、YAQAは元のモデルへのKLダイバージェンスを約30%削減し、下流タスクにおいて最先端の性能を達成しました。

English

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by approx 30% while achieving state of the art performance on downstream tasks.

モデル保存型適応丸め

Model-Preserving Adaptive Rounding

要旨

Support