모델 보존 적응형 반올림

초록

사후 양자화(PTQ)의 주요 목표는 원본 모델의 출력 분포에 최대한 가까운 압축 모델을 생성하는 것입니다. 이를 실현 가능하게 하기 위해, 거의 모든 대규모 언어 모델(LLM) PTQ 알고리즘은 선형 계층을 양자화할 때 즉각적인 활성화 오차를 독립적으로 최소화합니다. 그러나 이러한 지역적 목표는 후속 계층의 영향을 무시하기 때문에, 이를 줄인다고 해서 반드시 더 가까운 모델을 얻는 것은 아닙니다. 본 연구에서는 전체 모델의 KL 발산에 대한 각 선형 계층의 헤시안(Hessian)을 크로네커 곱(Kronecker-factored) 근사로 활용하는 적응형 반올림 알고리즘인 YAQA(Yet Another Quantization Algorithm)를 소개합니다. YAQA는 두 가지 구성 요소로 이루어져 있습니다: 수백억 개의 파라미터를 가진 LLM에서도 실현 가능한 전체 계층별 헤시안의 크로네커 곱 스케치, 그리고 이 스케치를 사용하며 이론적 보장을 제공하는 양자화 독립적 반올림 알고리즘입니다. 다양한 모델과 양자화기에 걸쳐 YAQA는 원본 모델과의 KL 발산을 약 30% 감소시키면서 다운스트림 작업에서 최첨단 성능을 달성합니다.

English

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by approx 30% while achieving state of the art performance on downstream tasks.

모델 보존 적응형 반올림

Model-Preserving Adaptive Rounding

초록

Support