QuantLRM: 微調整信号による大規模推論モデルの量子化

要旨

重みのみの量子化は大規模言語モデル（LLM）の圧縮において重要である。古典的なマグニチュードプルーニングの思想に着想を得て、我々は推論志向のファインチューニングにおける重み更新の大きさが、大規模推論モデル（LRM）の量子化に有用な信号を提供し得るかどうかを検討する。我々は、ファインチューニング中の重み更新量が最小および最大であるものが、中間的な大きさの更新よりも重要であるという仮説を立て、この現象を「両端保護」と名付ける。仮説検証を経て、ファインチューニング信号によるLRMの重み量子化手法であるQuantLRMを提案する。両端を保護するため、重み更新量に対して単純な制約付き二次関数をフィッティングする。チャネルごとの二次関数の平均値と、重み更新がゼロであった回数を乗算することで、活性化や二次情報を用いるよりも効果的なチャネル重要度を算出する。QuantLRMを適用し、様々なファインチューニング済みモデル（教師あり学習、直接選好最適化、強化学習によるファインチューニングを含む）を4つの推論ベンチマーク（AIME-120、FOLIO、時間系列推論、GPQA-Diamond）で量子化した結果、QuantLRMはLRM量子化において一貫した性能向上をもたらし、強化学習ファインチューニングモデルでは平均6.55%の改善を達成した。また、ファインチューニング未実施のLRMに対しても、擬似ファインチューニングを通じて効果的な信号を収集するQuantLRMは、その適用性を大幅に高めている。

English

Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term "protecting both ends". Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.

QuantLRM: 微調整信号による大規模推論モデルの量子化

QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

要旨

Support