ChatPaper.aiChatPaper

基于分布DAgger的丰富反馈强化学习

Reinforcement Learning from Rich Feedback with Distributional DAgger

June 3, 2026
作者: Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad
cs.AI

摘要

推理模型已取得快速进展,但主流的基于可验证奖励的强化学习(RLVR)方法仍局限于狭窄的范式:生成大量响应,并用单个比特的奖励信号(仅指示最终答案是否正确)对每个响应进行反馈。然而,许多场景提供了丰富的反馈信息,包括执行轨迹、工具输出、专家修正以及模型自我评估。本文研究如何利用这些反馈,提出经典模仿学习算法DAgger的一种分布变体。在该变体中,学习者可局部访问当前策略所遍历状态上的专家分布,进而得到一个简单的前向交叉熵目标函数,该函数允许使用黑盒专家,其序列级梯度通过传播未来专家-学生的不一致性到早期决策,实现丰富的信用分配。我们证明,先前基于反向KL散度或詹森-香农散度的自蒸馏强化学习方法无法保证策略的单调改进:即使专家的奖励更高,这些更新也可能增加不良动作的概率。相比之下,我们证明前向交叉熵可实现策略的单调改进,并在遗憾界上具有保障。此外,我们的目标函数优化了教师加权成功似然的一个下界,从而提升了Pass@N指标。实验结果表明,我们的方法DistIL在科学推理、编程和求解困难数学问题等多个领域均优于RLVR及基于自蒸馏的强化学习基线。
English
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.