利用分散式DAgger從豐富回饋中進行強化學習

摘要

推理模型的發展進展迅速，但主流的可驗證獎勵強化學習（RLVR）方法仍出奇地狹隘：生成大量回應，並以單一位元（僅指示最終答案是否正確）為每個回應賦予獎勵。然而，許多情境提供豐富的回饋，包括執行軌跡、工具輸出、專家修正及模型自我評估。我們研究如何透過經典模仿學習演算法 DAgger 的分布變體來運用這類回饋，其中學習者能局部存取當前策略所造訪狀態的專家分布。這產生了一個簡單的前向交叉熵目標，該目標接受黑箱專家，並透過將未來專家-學生之間的分歧傳播回早期決策，來執行序列層級的豐富信用分配。我們證明，基於反向 KL 散度或 Jensen-Shannon 散度的先前自我蒸餾強化學習目標，無法保證策略的單調改進：即使專家獲得更高獎勵，這些更新仍可能增加選擇較差動作的機率。相比之下，我們證明前向交叉熵能實現策略的單調改進，並享有遺憾界的保證。我們進一步證明，我們的目標最佳化了教師加權成功可能性的下界，從而提升 Pass@N 指標。在實驗上，我們的方法 DistIL 在科學推理、程式碼撰寫及解決困難數學問題等多個領域中，均優於 RLVR 及基於自我蒸餾的強化學習基線。

English

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.