サンプルルーティングによるグループ相対的蒸留と自己蒸留の統合的政策最適化

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデルの学習後調整における標準的なパラダイムとなっている。グループ相対方策最適化（GRPO）が広く採用されている一方で、その粗い信用割り当ては失敗したロールアウトを一律にペナルティするため、特定の逸脱を効率的に修正するために必要なトークンレベルの焦点を欠いている。自己蒸留方策最適化（SDPO）は、より高密度で標的を絞ったロジットレベルの監督を提供することでこの問題に対処し、急速な早期改善を可能にするが、長時間の学習中に頻繁に破綻する。我々は、この後期段階における不安定性を二つの本質的な欠陥に起因すると分析する：既に正しいサンプルに対する自己蒸留は最適化の曖昧さを導入し、自己教師信号の信頼性が次第に低下するのである。これらの問題を解決するため、本論文ではサンプル振り分け方策最適化（SRPO）を提案する。これは、正解サンプルをGRPOの報酬連携型強化学習に、失敗サンプルをSDPOの標的ロジットレベル修正に振り分ける、統一されたオン方策フレームワークである。SRPOはさらに、エントロピーを考慮した動的重み付け機構を組み込み、エントロピーが高く信頼性の低い蒸留ターゲットを抑制し、確信度の高いターゲットを強調する。5つのベンチマークと2つのモデル規模で評価した結果、SRPOはSDPOの急速な早期改善とGRPOの長期的な安定性の両方を達成した。SRPOは両ベースラインのピーク性能を一貫して上回り、Qwen3-8Bにおける5ベンチマーク平均をGRPOより3.4%、SDPOより6.3%向上させると同時に、適度な応答長を生成し、ステップ当たりの計算コストを最大17.2%削減した。

English

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

サンプルルーティングによるグループ相対的蒸留と自己蒸留の統合的政策最適化

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

要旨

Support