ChatPaper.aiChatPaper

基于自蒸馏的强化学习

Reinforcement Learning via Self-Distillation

January 28, 2026
作者: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause
cs.AI

摘要

大型语言模型越来越多地在可验证领域(如代码和数学)中通过强化学习进行后训练。然而,当前基于可验证奖励的强化学习方法仅能从每次尝试的标量结果奖励中学习,形成了严重的信用分配瓶颈。许多可验证环境实际上能提供丰富的文本反馈(如运行时错误或裁判评估),用以解释尝试失败的原因。我们将这一设定形式化为带丰富反馈的强化学习,并提出了自蒸馏策略优化方法,该方法无需外部教师或显式奖励模型,即可将符号化反馈转化为密集的学习信号。SDPO将基于反馈的当前模型视为自教师,并将其反馈指导下的下一词预测蒸馏回策略中。通过这种方式,SDPO利用了模型在上下文中回溯识别自身错误的能力。在科学推理、工具使用以及LiveCodeBench v6上的竞技编程任务中,SDPO相较于强可验证奖励基线方法,在样本效率和最终准确率上均有提升。值得注意的是,即便在仅返回标量反馈的标准可验证奖励环境中,SDPO通过将成功轨迹作为失败尝试的隐式反馈,其表现仍优于基线方法。最后,在测试时对单个问题应用SDPO可加速困难二元奖励任务的探索发现,仅需最佳K采样或多轮对话三分之一的尝试次数即可达到相同的发现概率。
English
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
PDF52January 30, 2026