利用分散式DAgger從豐富回饋中進行強化學習
Reinforcement Learning from Rich Feedback with Distributional DAgger
June 3, 2026
作者: Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad
cs.AI
摘要
推理模型的發展進展迅速,但主流的可驗證獎勵強化學習(RLVR)方法仍出奇地狹隘:生成大量回應,並以單一位元(僅指示最終答案是否正確)為每個回應賦予獎勵。然而,許多情境提供豐富的回饋,包括執行軌跡、工具輸出、專家修正及模型自我評估。我們研究如何透過經典模仿學習演算法 DAgger 的分布變體來運用這類回饋,其中學習者能局部存取當前策略所造訪狀態的專家分布。這產生了一個簡單的前向交叉熵目標,該目標接受黑箱專家,並透過將未來專家-學生之間的分歧傳播回早期決策,來執行序列層級的豐富信用分配。我們證明,基於反向 KL 散度或 Jensen-Shannon 散度的先前自我蒸餾強化學習目標,無法保證策略的單調改進:即使專家獲得更高獎勵,這些更新仍可能增加選擇較差動作的機率。相比之下,我們證明前向交叉熵能實現策略的單調改進,並享有遺憾界的保證。我們進一步證明,我們的目標最佳化了教師加權成功可能性的下界,從而提升 Pass@N 指標。在實驗上,我們的方法 DistIL 在科學推理、程式碼撰寫及解決困難數學問題等多個領域中,均優於 RLVR 及基於自我蒸餾的強化學習基線。
English
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.