変分方策蒸留による言語フィードバックからの学習

要旨

検証可能な報酬からの強化学習（RLVR）は、結果シグナルが疎であるため、複雑な推論タスクにおいて深刻な探索のボトルネックを生じさせる。近年のオン方策自己蒸留手法は、言語フィードバックを活用して密なトークンレベルの監督信号を生成することで、この問題に対処しようとしている。しかし、これらの手法はフィードバックを解釈する固定された受動的な教師に依存している。学習方針（学生方針）の改善に伴い、教師のゼロショット評価能力は頭打ちとなり、結局それ以上の学習は停止してしまう。この問題を克服するために、我々は変分方策蒸留（VPD）を提案する。これは、言語フィードバックからの学習を変分EM（期待値最大化）問題として定式化するフレームワークである。VPDは両方の方針を共進化させる。Eステップでは、教師が軌跡の結果に基づいて適応的信頼領域更新により能動的に洗練され、テキストフィードバックを動的に改善された目標トークン分布に変換する。Mステップでは、学生が自身のオン方策ロールアウトにおいて、この密な分布的なガイダンスを内面化する。VPDは、テキストによる批評から実行可能な信号を抽出する教師の能力を継続的に向上させることで、受動的蒸留の限界を克服する。科学的推論およびコード生成タスクにおける多様な診断フィードバック源を用いた評価において、VPDは標準RLVRおよび既存の自己蒸留ベースラインの両方を一貫して上回る。最後に、厳密な数学的推論とコールドスタート環境で我々のフレームワークをストレステストすることで、純粋な環境駆動型RLと比較したフィードバック駆動型自己蒸留の基本的限界を明らかにする。

English

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.