포스트-양자화 적분을 통한 민감도 가중치 식별

초록

대규모 언어 모델(LLMs)을 서빙하는 데는 비용이 많이 듭니다. 그러나 사후 훈련 가중치 양자화는 이 문제를 해결할 수 있는데, 제한된 메모리를 위해 모델 크기를 압축하고 가속화를 위해 대역폭을 절약함으로써 가능합니다. 모든 가중치 차원이 동일하게 중요한 것은 아니기 때문에, 이러한 방법들은 일반적으로 민감도 지표에 의존합니다. 이 지표는 가중치가 손실 함수에 미치는 요소별 영향을 나타내며, 더 나은 양자화를 위해 원래 가중치를 전처리하는 데 사용됩니다. 본 연구에서는 민감도 지표의 정확성에 대한 실증적 연구를 수행했으며, 기존의 그래디언트 및 헤시안 기반 지표들이 매우 부정확하다는 것을 발견했습니다: 이들은 양자화가 손실 함수에 미치는 영향을 크게 과소평가하는데, 이는 주로 테일러 공식에서의 그래디언트 및 헤시안 항과 같은 국소 2차 근사의 작은 수렴 반경 때문입니다. 이 문제를 해결하기 위해, 우리는 사후 양자화 적분(PQI)을 제안합니다. 이는 세밀한 방식으로 사후 민감도를 정확하게 추정하는 지표입니다. 이 정확한 지표를 활용하기 위해, 우리는 ReQuant이라는 간단하지만 강력한 프레임워크를 추가로 제안합니다. 이 프레임워크는 주로 두 가지 Dense-and-Sparse 분리 구성 요소로 이루어져 있습니다: 자체 적응 이상치 선택 및 단계별 중요 가중치 분리. 결과는 ReQuant이 최신 사후 훈련 양자화 방법을 크게 향상시킴을 보여주며, Llama 3.2 1B에서 QTIP를 사용할 때 2.66의 perplexity 향상을 달성했습니다.

English

Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

포스트-양자화 적분을 통한 민감도 가중치 식별

Identifying Sensitive Weights via Post-quantization Integral

초록

Support