リッチフィードバックを用いた分布的DAggerによる強化学習

要旨

推論モデルは急速に進歩してきたが、検証可能な報酬を用いた強化学習（RLVR）の主要な手法は驚くほど狭い範囲にとどまっている。すなわち、多数の応答をサンプリングし、各応答に最終回答の正誤を示す1ビットの報酬を与えるというものである。しかし、多くの設定では、実行トレース、ツールの出力、専門家による修正、モデルの自己評価など、豊富なフィードバックが得られる。本研究では、このようなフィードバックを活用する方法として、古典的な模倣学習アルゴリズムであるDAggerの分布的な変種を検討する。この変種では、学習主体が現在の方策によって訪問された状態に関する専門家分布に局所的にアクセスできる。これにより、ブラックボックスな専門家を受け入れる単純な前方クロスエントロピー目的関数が導かれ、その系列レベルの勾配は、将来の専門家と学習主体の不一致をより早期の決定に伝播させることで、豊かなクレジット割り当てを実行する。逆KLやジェンセン・シャノンに基づく自己蒸留目的関数を用いた従来の強化学習は、単調な方策改善を保証できないこと、すなわち専門家がより高い報酬を持つ場合でも、その更新により悪い行動の確率が増加する可能性があることを示す。対照的に、前方クロスエントロピーは単調な方策改善を許容し、後悔に関する保証も享受できることを示す。さらに、我々の目的関数が、教師重み付き成功尤度の下界を最適化し、Pass@Nの改善につながることを示す。実験的には、我々のアプローチであるDistILは、科学的推論、コーディング、難解な数学問題の解決など、さまざまな領域において、RLVRおよび自己蒸留ベースの強化学習ベースラインを上回る性能を達成する。

English

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.