基於自我蒸餾的強化學習
Reinforcement Learning via Self-Distillation
January 28, 2026
作者: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause
cs.AI
摘要
大型語言模型在可驗證領域(如程式碼與數學)中正日益普遍地接受強化學習的後訓練。然而,當前基於可驗證獎勵的強化學習方法僅能從每次嘗試的標量結果獎勵中學習,形成了嚴重的信用分配瓶頸。許多可驗證環境實際上能提供豐富的文本反饋(例如運行時錯誤或評判評估),用以解釋嘗試失敗的原因。我們將此設定形式化為「具豐富反饋的強化學習」,並提出自蒸餾策略優化法,該方法能將符記化反饋轉化為密集的學習信號,無需依賴外部教師或顯式獎勵模型。SDPO將當前模型在反饋條件下的狀態視為自我教師,並將其基於反饋的下一符記預測蒸餾回策略中。透過這種方式,SDPO利用了模型在上下文情境中回顧性識別自身錯誤的能力。在科學推理、工具使用及LiveCodeBench v6的競技程式設計等任務中,SDPO相較於強勁的RLVR基線模型,顯著提升了樣本效率與最終準確率。值得注意的是,在僅回傳標量反饋的標準RLVR環境中,SDPO透過將成功滾動案例作為失敗嘗試的隱性反饋,同樣優於基線模型。最後,在測試階段對單一問題應用SDPO,可加速困難二元獎勵任務的探索進程,僅需三分之一嘗試次數即可達到與k選最佳採樣或多輪對話相同的發現概率。
English
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.