大規模言語モデルの推論のための強化学習対応知識蒸留

要旨

強化学習（RL）による事後学習は近年、長い連鎖思考推論を行う大規模言語モデル（LLM）において大きな進展をもたらしてきたが、こうしたモデルの高い推論コストは、より小規模な生徒モデルへの蒸留を促進している。既存の知識蒸留（KD）手法の多くは教師ありファインチューニング（SFT）向けに設計されており、固定された教師モデルの軌跡や、教師-生徒間のKLダイバージェンスに基づく正則化に依存している。これらをRLと組み合わせる場合、分布ミスマッチと目的関数の干渉という問題が生じやすい。教師の監督は生徒の変化するロールアウト分布と整合しない可能性があり、KL正則化項は報酬最大化と競合し、損失のバランス調整を慎重に行う必要がある。これらの課題に対処するため、本論文ではRL対応蒸留（RLAD）を提案する。これはRLの実行中に選択的模倣を行い、現在のポリシー更新を改善する場合にのみ、生徒を教師の方向へ導く。中核となる要素である信頼領域比蒸留（TRRD）は、教師-生徒間のKL正則化項を、教師と旧ポリシーの混合を基準としたPPO/GRPO形式の尤度比目的関数に置き換える。これにより、生徒のロールアウトに対するアドバンテージを考慮した信頼領域制約付きの蒸留が実現され、探索、利用、模倣のバランスが自然に取れる。様々な論理推論および数学ベンチマークにおいて、RLADはオフライン蒸留、標準的なGRPO、およびKLベースのオンラインポリシー教師-生徒知識蒸留を一貫して上回る性能を示した。

English

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

大規模言語モデルの推論のための強化学習対応知識蒸留

Reinforcement-aware Knowledge Distillation for LLM Reasoning

要旨

Support