TCOD: 다중 턴 자율 에이전트를 위한 온-정책 지식 증류에서의 시간적 커리큘럼 탐구

초록

온정책 지식 증류(OPD)는 최첨단 또는 도메인 특화 모델의 추론 능력을 더 작은 학생 모델로 전이하는 데 강력한 잠재력을 보여주고 있습니다. 정적인 단일 턴 과제에서는 효과적이지만, 다중 턴 에이전트 환경에서의 동작은 아직 충분히 연구되지 않았습니다. 본 연구에서는 이러한 환경에서 기본 OPD의 주요 한계를 확인하며, 이를 '궤적 수준 KL 불안정성'으로 명명합니다. 구체적으로, KL 발산이 성공률 하락과 함께 증가하며, 수렴 후에도 KL이 높게 유지되어 훈련이 불안정해지는 현상을 관찰합니다. 이러한 불안정성은 턴 간 오류 누적에서 비롯됩니다: 오류가 누적됨에 따라 학생 모델은 교사 모델의 효과적 지원 범위를 벗어나게 되어 지도 신호의 신뢰성이 떨어집니다. 이를 해결하기 위해, 우리는 학생 모델에 노출되는 궤적 깊이를 제어하고 커리큘럼 일정에 따라 짧은 것에서 긴 것으로 점진적으로 확장하는 간단하면서 효과적인 프레임워크인 TCOD(시간적 커리큘럼 온정책 지식 증류)를 제안합니다. 3개의 다중 턴 에이전트 벤치마크(ALFWorld, WebShop, ScienceWorld)에서 4개의 학생-교사 모델 쌍에 대한 실험 결과는 TCOD가 훈련 전반에 걸쳐 KL 급증을 완화하고 KL 안정성을 향상시켜, 기본 OPD 대비 에이전트 성능을 최대 18점까지 향상시킴을 보여줍니다. 추가 평가를 통해 TCOD가 교사 모델의 성능을 능가할 수 있으며, 교사 모델이 실패하는 과제로도 일반화될 수 있음이 입증되었습니다.

English

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.

TCOD: 다중 턴 자율 에이전트를 위한 온-정책 지식 증류에서의 시간적 커리큘럼 탐구

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

초록

Support