自蒸餾式智能體強化學習
Self-Distilled Agentic Reinforcement Learning
May 14, 2026
作者: Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
cs.AI
摘要
強化學習(RL)已成為後續訓練大型語言模型代理的核心範式,但其軌跡層級的獎勵訊號僅為長期互動提供粗略的監督。在線策略自蒸餾(OPSD)透過引入來自具備特權上下文增強的教師分支的密集詞元層級引導,來補充強化學習。然而,將OPSD遷移至多輪代理卻產生問題:多輪互動的不穩定性加劇,導致監督不穩定;而技能條件化的特權引導則需要不對稱的處理,因為教師的拒絕訊號(可能源於不完善的技能擷取或運用)需要被適當地對待。我們提出SDAR(自蒸餾代理強化學習),將OPSD視為門控輔助目標,同時保持RL作為主要優化骨幹。SDAR將分離的詞元層級訊號映射至sigmoid門控,強化對教師認可的正向差距詞元的蒸餾,並柔和衰減教師的拒絕訊號。在ALFWorld、WebShop及Search-QA基準上,基於Qwen2.5與Qwen3系列模型,SDAR相較於GRPO有顯著提升(ALFWorld提升9.4%,Search-QA提升7.0%,WebShop-Acc提升10.2%),避免單純GRPO+OPSD的不穩定性,且在各模型規模下持續優於混合RL-OPSD基線。
English
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.