透過變分策略蒸餾從語言回饋中學習

摘要

基於可驗證獎勵的強化學習（RLVR）存在結果信號稀疏的問題，這在複雜推理任務中造成了嚴重的探索瓶頸。近期的同策略自蒸餾方法試圖通過利用語言反饋生成密集的詞元級監督來解決此問題。然而，這些方法依賴於一個固定的被動教師模型來解讀反饋。隨著學生策略的改進，教師的零樣本評估能力趨於平緩，最終導致學習停滯。為克服此限制，我們提出變分策略蒸餾（VPD）框架，將從語言反饋中學習形式化為變分期望最大化（EM）問題。VPD使兩種策略共同演化：在E步中，教師通過自適應信任區域更新基於軌跡結果被主動優化，將文本反饋轉化為動態改進的目標詞元分佈；在M步中，學生在其自身的同策略推演中內化這種密集的分佈式引導。通過持續提升教師從文本評價中提取可行信號的能力，VPD克服了被動蒸餾的限制。在科學推理與代碼生成任務的多種診斷性反饋來源上的評估結果顯示，VPD consistently 優於標準RLVR及現有自蒸餾基線。最後，通過在嚴格數學推理與冷啟動場景中對框架進行壓力測試，我們闡明了與純環境驅動的RL相比，反饋驅動的自蒸餾的基本界限。

English

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.