ポスト量子化積分による感度重みの同定

要旨

大規模言語モデル（LLM）の運用はコストがかかります。しかし、学習後の重み量子化は、メモリ制約下でのモデルサイズの圧縮と、高速化のための帯域幅の節約という両面でこの問題を解決できます。全ての重み次元が同等に重要ではないため、これらの手法は通常、感度メトリックに依存します。このメトリックは、損失関数に対する重みの要素ごとの影響を示し、より良い量子化のために元の重みを前処理するために使用されます。本研究では、感度メトリックの精度に関する実証研究を行い、既存の勾配およびヘッシアンベースのメトリックが非常に不正確であることを発見しました。これらは、主にテイラー公式における勾配とヘッシアン項という局所的な2次近似の収束半径が小さいため、量子化の損失関数への影響を桁違いに過小評価しています。この問題を解決するために、我々はPost-quantization Integral（PQI）を提案します。これは、後処理感度を細かく推定する正確なメトリックです。この正確なメトリックを活用するために、さらにReQuantを提案します。これは、自己適応型外れ値選択と段階的な重要重み分離という2つのDense-and-Sparse分離コンポーネントを主に含む、シンプルでありながら強力なフレームワークです。結果は、ReQuantが最先端の学習後量子化手法を大幅に向上させ、Llama 3.2 1BモデルにおいてQTIPを用いて2.66のパープレキシティ改善をもたらすことを示しています。

English

Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

ポスト量子化積分による感度重みの同定

Identifying Sensitive Weights via Post-quantization Integral

要旨

Support