자기 증류 에이전트 강화 학습

초록

강화 학습(RL)은 LLM 에이전트의 사후 훈련을 위한 핵심 패러다임으로 부상했지만, 궤적 수준의 보상 신호는 장기적 상호작용에 대해 거친 수준의 지도만을 제공한다. 온-정책 자기 증류(OPSD)는 특권 맥락으로 증강된 교사 분기로부터 조밀한 토큰 수준의 안내를 도입하여 RL을 보완한다. 그러나 OPSD를 다중 턴 에이전트로 전이하는 것은 문제가 있음이 드러난다. 누적되는 다중 턴 불안정성이 지도 학습을 불안정하게 만들고, 스킬 조건화된 특권 안내는 불완전한 스킬 검색 또는 활용으로 인해 발생할 수 있는 부정적인 교사 거부에 대해 비대칭적 처리를 요구한다. 우리는 OPSD를 게이트된 보조 목표로 취급하면서 RL을 주요 최적화 백본으로 유지하는 SDAR(자기 증류 에이전트 강화 학습)을 소개한다. SDAR는 분리된 토큰 수준 신호를 시그모이드 게이트에 매핑하여, 교사가 승인한 양성 갭 토큰에 대한 증류를 강화하고 부정적인 교사 거부를 부드럽게 약화시킨다. ALFWorld, WebShop 및 Search-QA에서 Qwen2.5 및 Qwen3 제품군 전반에 걸쳐, SDAR는 GRPO 대비 현저한 성능 향상(ALFWorld에서 +9.4%, Search-QA에서 +7.0%, WebShop-Acc에서 +10.2%)을 보이며, 단순 GRPO+OPSD의 불안정성을 피하고, 모델 규모에 걸쳐 하이브리드 RL-OPSD 기준선을 일관되게 능가한다.

English

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.