RSQ: 중요한 토큰에서 학습하는 것이 더 나은 양자화된 대형 언어 모델로 이어진다

초록

레이어별 양자화(Layer-wise quantization)는 비용이 많이 드는 재학습 없이 대규모 모델을 효율적으로 압축하는 핵심 기술입니다. 기존 방법들은 일반적으로 각 레이어의 가중치를 모든 출력 토큰에 걸쳐 레이어 재구성 손실을 "균일하게" 최적화하여 양자화합니다. 그러나 본 논문에서는 중요한 토큰(예: 큰 어텐션 점수를 가진 토큰)으로부터의 학습을 우선시함으로써 더 나은 양자화 모델을 얻을 수 있음을 보여줍니다. 이러한 발견을 바탕으로, 우리는 RSQ(Rotate, Scale, then Quantize)를 제안합니다. RSQ는 (1) 이상치(예외적으로 큰 크기를 가진 값)를 완화하기 위해 모델에 회전(직교 변환)을 적용하고, (2) 토큰의 중요도에 기반하여 토큰 특성을 스케일링하며, (3) 스케일링된 토큰으로 계산된 2차 통계를 사용하여 GPTQ 프레임워크로 모델을 양자화합니다. 토큰 중요도를 계산하기 위해, 우리는 휴리스틱 및 동적 전략을 모두 탐구합니다. 모든 접근법을 철저히 분석한 결과, 각 토큰의 어텐션 점수를 중요도로 사용하는 어텐션 집중(attention concentration)을 최적의 접근법으로 채택했습니다. 우리는 RSQ가 LLaMA3, Mistral, Qwen2.5 등 세 가지 모델 패밀리와 다양한 다운스트림 작업에서 일관되게 베이스라인 방법을 능가함을 보여줍니다. 또한, RSQ로 양자화된 모델은 장문맥 작업에서도 우수한 성능을 달성하여 그 효과를 더욱 입증합니다. 마지막으로, RSQ는 다양한 설정(모델 크기, 캘리브레이션 데이터셋, 비트 정밀도, 양자화 방법 등)에서 일반화 가능성을 보여줍니다.

English

Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.

RSQ: 중요한 토큰에서 학습하는 것이 더 나은 양자화된 대형 언어 모델로 이어진다

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

초록

Support