TCOD：多轮自主智能体同策略蒸馏中的时序课程研究

摘要

在线策略蒸馏（OPD）在将前沿模型或领域专用模型的推理能力迁移至小型学生模型方面展现出巨大潜力。尽管在静态单轮任务中表现优异，但其在多轮智能体场景下的行为机制仍待深入探索。本研究揭示了传统OPD在此类场景中的核心缺陷——轨迹级KL不稳定性。具体表现为：KL散度随任务成功率下降而上升，即便训练收敛后KL值仍维持高位，导致训练过程失稳。这种不稳定性源于轮次间的误差累积：随着错误逐步叠加，学生模型会偏离教师模型的有效支持域，使得监督信号失效。为此，我们提出时序课程在线策略蒸馏（TCOD），通过课程化调度控制学生模型接触的轨迹深度，使其从短轨迹到长轨迹渐进学习。在ALFWorld、WebShop和ScienceWorld三个多轮智能体基准测试中，四组师生模型的实验结果表明：TCOD能有效抑制KL值攀升，在整个训练周期内保持KL稳定性，相较传统OPD将智能体性能提升最高达18分。进一步评估发现，TCOD甚至能超越教师模型的表现，并在教师模型失败的任务上展现泛化能力。

English

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.

TCOD：多轮自主智能体同策略蒸馏中的时序课程研究

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

摘要

Support