自蒸餾式智能體強化學習

摘要

強化學習（RL）已成為後續訓練大型語言模型代理的核心範式，但其軌跡層級的獎勵訊號僅為長期互動提供粗略的監督。在線策略自蒸餾（OPSD）透過引入來自具備特權上下文增強的教師分支的密集詞元層級引導，來補充強化學習。然而，將OPSD遷移至多輪代理卻產生問題：多輪互動的不穩定性加劇，導致監督不穩定；而技能條件化的特權引導則需要不對稱的處理，因為教師的拒絕訊號（可能源於不完善的技能擷取或運用）需要被適當地對待。我們提出SDAR（自蒸餾代理強化學習），將OPSD視為門控輔助目標，同時保持RL作為主要優化骨幹。SDAR將分離的詞元層級訊號映射至sigmoid門控，強化對教師認可的正向差距詞元的蒸餾，並柔和衰減教師的拒絕訊號。在ALFWorld、WebShop及Search-QA基準上，基於Qwen2.5與Qwen3系列模型，SDAR相較於GRPO有顯著提升（ALFWorld提升9.4%，Search-QA提升7.0%，WebShop-Acc提升10.2%），避免單純GRPO+OPSD的不穩定性，且在各模型規模下持續優於混合RL-OPSD基線。

English

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.