QuantLRM：基于微调信号的大规模推理模型量化方法

摘要

僅權重量化對於壓縮大型語言模型至關重要。受經典幅度剪枝思想啟發，我們探究在推理激勵微調過程中，權重更新的幅度是否能為大型推理模型的量化提供有價值的信號。我們提出假設：微調期間最小和最大的權重更新比中等幅度的更新更為重要，此現象稱為「兩端保護」。經假設驗證後，我們提出QuantLRM——基於微調信號的大型推理模型權重量化方法。通過擬合簡單的受限二次函數來保護權重更新的兩端，並將通道的二次函數均值與零權重更新次數相乘，計算出比激活值或二階信息更有效的通道重要性。我們在四個推理基準測試集（AIME-120、FOLIO、時間序列和GPQA-Diamond）上對多種微調模型（包括監督學習、直接偏好優化和強化學習微調）進行量化實驗，實證表明QuantLRM能持續提升大型推理模型的量化效果，其中強化學習微調模型平均提升達6.55%。該方法還支持未經微調的大型推理模型，通過偽微調收集有效信號，顯著增強了適用性。

English

Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term "protecting both ends". Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.

QuantLRM：基于微调信号的大规模推理模型量化方法

QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

摘要

Support