基于变分策略蒸馏的语言反馈学习

摘要

基于可验证奖励的强化学习（RLVR）受限于稀疏的结果信号，在复杂推理任务中造成严重的探索瓶颈。近期在线自蒸馏方法尝试通过利用语言反馈生成密集的令牌级监督来解决这一问题。然而，这些方法依赖固定的被动教师来解读反馈。随着学生策略的改进，教师的零样本评估能力趋于停滞，最终阻碍进一步学习。为克服此局限，我们提出变分策略蒸馏（VPD），该框架将基于语言反馈的学习形式化为变分期望最大化（EM）问题。VPD联合进化两种策略：在E步中，通过自适应信任域更新机制，教师根据轨迹结果主动优化，将文本反馈转化为动态改进的目标令牌分布；在M步中，学生在其自身在线策略展开中内化这种密集的分布指导。通过持续提升教师从文本批评中提取可行动信号的能力，VPD突破了被动蒸馏的局限。在科学推理与代码生成任务中，针对多种诊断性反馈来源的评估表明，VPD在标准RLVR及现有自蒸馏基线方法上均取得一致优势。最后，通过针对严格数学推理与冷启动模式的压力测试，我们揭示了相较于纯环境驱动强化学习，反馈驱动自蒸馏的根本性边界。

English

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.