OPID: 面向智能体强化学习的同策略技能蒸馏

摘要

基于结果的强化学习为语言智能体提供了稳定的优化基础，但其稀疏的轨迹级奖励难以对中间决策的强化或抑制提供有效指导。在线策略自蒸馏能提供密集的令牌级监督，然而现有技能条件变体通常依赖外部技能记忆库或检索到的特权上下文，这些方法不仅维护成本高昂，在多轮交互中还可能偏离当前策略所引发的状态分布。我们提出OPID（在线策略技能蒸馏）框架，该框架直接从已完成的在线策略轨迹中提取技能监督信息。OPID将轨迹事后分析表示为分层技能：回合级技能捕获全局工作流程或故障规避规则，而步骤级技能则捕获关键时间步的局部决策知识。关键优先路由机制在检测到关键决策时启用步骤级技能，否则默认使用回合级技能作为指导。将所选技能注入交互历史后，旧策略可在原始上下文与技能增强上下文下对同一采样响应进行重新评分。由此产生的对数概率偏移形成令牌级自蒸馏优势，并与结果优势结合用于策略优化。因此OPID既保持了强化学习作为主要训练目标，又引入了密集且与分布匹配的事后监督。在ALFWorld、WebShop和基于搜索的问答上的实验表明，与仅使用结果的强化学习和现有技能蒸馏基线相比，OPID普遍提升了智能体性能、样本效率和鲁棒性。我们的代码已开源在 https://github.com/jinyangwu/OPID/tree/main。

English

Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet existing skill-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi-turn interaction. We propose OPID (On-Policy Skill Distillation), a framework that extracts skill supervision directly from completed on-policy trajectories. OPID represents trajectory hindsight as hierarchical skills: episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local decision knowledge at critical timesteps. A critical-first routing mechanism uses step-level skills when critical decisions are identified and falls back to episode-level skills as default guidance otherwise. The selected skill is injected into the interaction history, allowing the old policy to re-score the same sampled response under both original and skill-augmented contexts. The resulting log-probability shift yields a token-level self-distillation advantage, which is combined with the outcome advantage for policy optimization. OPID thus preserves RL as the primary training objective while introducing dense, distribution-matched hindsight supervision. Experiments on ALFWorld, WebShop and Search-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome-only RL and existing skill-distillation baselines. Our code is available at https://github.com/jinyangwu/OPID/tree/main.