先学后信:基于渐进探索的自我模仿强化学习
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
September 26, 2025
作者: Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun
cs.AI
摘要
强化学习(RL)是提升大语言模型(LLMs)在长期、稀疏奖励的智能体任务中策略工具使用能力的主导范式,然而它面临着探索与利用之间权衡的根本性挑战。现有研究通过策略熵的视角来激励探索,但这种机械化的熵最大化容易因多轮分布偏移而导致RL训练不稳定。本文旨在不陷入熵崩溃或失控发散的情况下,基于智能体自身经验的指导下实现渐进式的探索与利用平衡。我们提出了SPEAR,一种基于课程的自模仿学习(SIL)方案,用于训练具备自主能力的LLMs。该方法扩展了基础的SIL框架,其中回放缓冲区存储自生成的优质轨迹以供离策略更新,通过分阶段逐步引导策略演化,使其保持在良好平衡的熵范围内。具体而言,我们的方法结合了课程管理探索过程,利用内在奖励促进技能层面的探索,并通过SIL促进动作层面的探索。初期,辅助工具调用奖励在工具使用技能的积累中起关键作用,使智能体能够广泛接触环境反馈的陌生分布,伴随熵的上升趋势。随着训练的推进,自模仿学习得到加强,从回放经验中挖掘现有成功模式,进行对比性的动作层面探索,加速解决方案的迭代,同时避免熵的无限制增长。为进一步稳定训练,我们重新校准回放缓冲区中经验的优势值,以应对可能的策略漂移。在轨迹层面的熵控制中引入了正则化措施,如对概率与优势值之间高协方差的标记进行裁剪,以抑制过度自信。
English
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic
tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks,
yet it faces a fundamental challenge of exploration-exploitation trade-off.
Existing studies stimulate exploration through the lens of policy entropy, but
such mechanical entropy maximization is prone to RL training instability due to
the multi-turn distribution shifting. In this paper, we target the progressive
exploration-exploitation balance under the guidance of the agent own
experiences without succumbing to either entropy collapsing or runaway
divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL)
recipe for training agentic LLMs. It extends the vanilla SIL framework, where a
replay buffer stores self-generated promising trajectories for off-policy
update, by gradually steering the policy evolution within a well-balanced range
of entropy across stages. Specifically, our approach incorporates a curriculum
to manage the exploration process, utilizing intrinsic rewards to foster
skill-level exploration and facilitating action-level exploration through SIL.
At first, the auxiliary tool call reward plays a critical role in the
accumulation of tool-use skills, enabling broad exposure to the unfamiliar
distributions of the environment feedback with an upward entropy trend. As
training progresses, self-imitation gets strengthened to exploit existing
successful patterns from replayed experiences for comparative action-level
exploration, accelerating solution iteration without unbounded entropy growth.
To further stabilize training, we recalibrate the advantages of experiences in
the replay buffer to address the potential policy drift. Reugularizations such
as the clipping of tokens with high covariance between probability and
advantage are introduced to the trajectory-level entropy control to curb
over-confidence.