RSQ：從關鍵詞元學習引領更優的量化大型語言模型

摘要

層級量化是一種關鍵技術，能夠在不進行昂貴的重新訓練的情況下，高效地壓縮大型模型。以往的方法通常通過「均勻」優化所有輸出詞元的層重建損失來量化每一層的權重。然而，本文證明，通過優先從重要詞元（例如具有較大注意力分數的詞元）中學習，可以獲得更好的量化模型。基於這一發現，我們提出了RSQ（旋轉、縮放、再量化），該方法（1）對模型應用旋轉（正交變換）以減輕異常值（具有異常大值的數據）的影響，（2）根據詞元的重要性縮放其特徵，以及（3）使用GPTQ框架並基於縮放後的詞元計算的二階統計量來量化模型。為了計算詞元的重要性，我們探索了啟發式和動態策略。通過對所有方法的深入分析，我們採用了注意力集中度，即使用每個詞元的注意力分數作為其重要性，作為最佳方法。我們證明，RSQ在多個下游任務和三種模型家族（LLaMA3、Mistral和Qwen2.5）中始終優於基線方法。此外，使用RSQ量化的模型在長上下文任務中表現出卓越的性能，進一步凸顯了其有效性。最後，RSQ在各種設置中展示了通用性，包括不同的模型大小、校準數據集、比特精度和量化方法。

English

Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.

RSQ：從關鍵詞元學習引領更優的量化大型語言模型

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

摘要

Support