QuantLRM：基于微调信号的大规模推理模型量化方法

摘要

仅权重量化对压缩大语言模型至关重要。受经典幅度剪枝思想启发，我们研究在推理激励微调过程中，权重更新幅度能否为大型推理模型的量化提供有效信号。我们提出假设：微调过程中最小和最大的权重更新比中等幅度的更新更为重要，这一现象称为"两端保护"。经假设验证后，我们提出QuantLRM——基于微调信号的大型推理模型权重量化方法。通过拟合简单的受限二次函数来保护权重更新的两端，将通道的二次函数均值与零权重更新次数相乘，计算出比激活值或二阶信息更有效的通道重要性。我们在四个推理基准测试集（AIME-120、FOLIO、时序序列和GPQA-Diamond）上对多种微调模型（包括监督微调、直接偏好优化和强化学习微调）进行量化实验，实证表明QuantLRM能持续提升大型推理模型的量化效果，在强化学习微调模型上平均提升6.55%。该方法还支持未微调的大型推理模型，通过伪微调收集有效信号，极大增强了适用性。

English

Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term "protecting both ends". Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.