TCOD:多轮自主智能体同策略蒸馏中的时序课程研究
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
April 27, 2026
作者: Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng
cs.AI
摘要
在線蒸餾(OPD)技術已展現出將前沿模型或領域專用模型的推理能力遷移至小型學生模型的巨大潛力。儘管該技術在靜態單輪任務中表現優異,但其在多輪智能體場景中的行為機制仍待深入探索。本研究發現原始OPD在此類場景中存在關鍵缺陷,我們稱之為軌跡級KL不穩定性。具體而言,我們觀察到KL散度與任務成功率呈現同步異變——即使訓練收斂後KL值仍持續高位運行,導致訓練過程失穩。這種不穩定性源於輪次間的誤差疊加效應:隨著錯誤累積,學生模型會偏離教師模型的有效支持域,使得監督信號可信度下降。為解決此問題,我們提出時序課程在線蒸餾(TCOD),通過課程化調控軌跡暴露深度(由短及長漸進擴展)的簡潔框架。在三個多輪智能體基準測試(ALFWorld、WebShop、ScienceWorld)中對四組師生模型的實驗表明,TCOD能有效抑制KL值飆升並增強訓練全程的KL穩定性,相比原始OPD將智能體性能最高提升18個百分點。進一步評估顯示,TCOD甚至能實現對教師模型的性能超越,並在教師模型失效的任務中展現泛化能力。
English
On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.