透過後量化積分識別敏感權重

摘要

服務大型語言模型（LLMs）的成本高昂。然而，訓練後的權重量化可以通過壓縮模型大小以適應有限記憶體並節省頻寬來加速，從而解決這一問題。由於並非所有權重維度都同等重要，這些方法通常依賴於敏感度指標，該指標反映了權重對損失函數的逐元素影響，並用於預處理原始權重以實現更好的量化。在本研究中，我們對敏感度指標的準確性進行了實證研究，發現現有的基於梯度和海森矩陣的指標非常不準確：它們低估了量化對損失函數的影響，誤差達數個數量級，這主要是由於局部二階近似（即泰勒公式中的梯度和海森項）的收斂半徑較小。為解決這一問題，我們提出了後量化積分（Post-quantization Integral, PQI），這是一種精確的指標，能夠以細粒度方式估計後驗敏感度。為了利用這一精確指標，我們進一步提出了ReQuant，這是一個簡單而強大的框架，主要由兩個密集與稀疏分離組件構成：自適應異常值選擇和逐步重要權重分離。結果顯示，ReQuant顯著提升了最先進的訓練後量化方法，在Llama 3.2 1B模型上使用QTIP時，困惑度提升了2.66。

English

Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

透過後量化積分識別敏感權重

Identifying Sensitive Weights via Post-quantization Integral

摘要

Support