TCOD: マルチターン自律エージェントにおけるオンポリシー蒸留の時間的カリキュラムの探求

要旨

オン方針蒸留（OPD）は、最先端モデルやドメイン特化モデルからより小さな学生モデルへ推論能力を転送する強力な可能性を示している。静的単一ターンタスクでは効果的であるが、マルチターンのエージェント設定におけるその挙動は未解明のままである。本研究では、このような設定における従来のOPDの主要な限界を特定し、それを「軌跡レベルのKL不安定性」と命名する。具体的には、KLダイバージェンスが成功率の低下と共に増加し、収束後もKLが高止まりすることで訓練が不安定化することを観察した。この不安定性はターン間誤差の累積によって生じる。誤差が蓄積するにつれ、学生モデルは教師モデルの有効サポート範囲を超えて駆動され、監督信号が信頼できなくなる。この問題に対処するため、我々はTCOD（Temporal Curriculum On-Policy Distillation）を提案する。これは、学生モデルに曝す軌跡の深さを制御し、カリキュラムスケジュールに従って短い軌跡から長い軌跡へ段階的に拡張する、簡潔かつ効果的なフレームワークである。3つのマルチターンエージェントベンチマーク（ALFWorld, WebShop, ScienceWorld）における4組の教師-学生ペアでの実験結果は、TCODがKL値の急上昇を緩和し、訓練全体を通じてKL安定性を向上させ、従来のOPDと比較してエージェント性能を最大18ポイント向上させることを示した。更なる評価により、TCODは教師モデルの性能を凌駕し、教師モデルが失敗するタスクへも一般化し得ることが示された。

English

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.

TCOD: マルチターン自律エージェントにおけるオンポリシー蒸留の時間的カリキュラムの探求

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

要旨

Support