ChatPaper.aiChatPaper

掌握基礎,再信賴勝利:通過漸進探索實現自我模仿的代理強化學習

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

September 26, 2025
作者: Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun
cs.AI

摘要

強化學習(RL)是提升大型語言模型(LLMs)在長期、稀疏獎勵的代理任務中策略性工具使用能力的主導範式,然而它面臨著探索與利用之間的基本權衡挑戰。現有研究通過策略熵的視角來激發探索,但這種機械的熵最大化容易因多輪分佈偏移而導致RL訓練不穩定。本文旨在代理自身經驗的指導下實現漸進的探索與利用平衡,既不陷入熵崩潰,也不導致失控發散。我們提出了SPEAR,一種基於課程的自模仿學習(SIL)配方,用於訓練代理型LLMs。它擴展了基礎的SIL框架,其中回放緩存存儲自生成的潛在軌跡以供離策略更新,通過逐步引導策略演變在跨階段的良好平衡熵範圍內。具體而言,我們的方法結合了課程來管理探索過程,利用內在獎勵促進技能層面的探索,並通過SIL促進動作層面的探索。最初,輔助工具調用獎勵在工具使用技能的積累中起著關鍵作用,使代理能夠廣泛接觸環境反饋的陌生分佈,伴隨著熵的上升趨勢。隨著訓練的推進,自模仿得到加強,以利用回放經驗中的現有成功模式進行比較性的動作層面探索,加速解決方案的迭代,而不會導致無界的熵增長。為了進一步穩定訓練,我們重新校準回放緩存中經驗的優勢,以應對潛在的策略漂移。在軌跡層面的熵控制中引入了正則化措施,如對概率與優勢之間高協方差的token進行裁剪,以抑制過度自信。
English
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
PDF62September 29, 2025