로프를 익히고 승리를 믿어라: 에이전트 강화 학습을 위한 점진적 탐색과 자기 모방

초록

강화 학습(Reinforcement Learning, RL)은 장기적이고 보상이 드물게 주어진 에이전트 작업에서 대형 언어 모델(LLM)의 전략적 도구 사용 능력을 향상시키는 주요 패러다임이다. 그러나 RL은 탐색과 활용 간의 균형을 맞추는 근본적인 문제에 직면해 있다. 기존 연구는 정책 엔트로피(policy entropy)의 관점에서 탐색을 자극하지만, 이러한 기계적인 엔트로피 극대화는 다중 턴 분포 변화로 인해 RL 훈련의 불안정성을 초래하기 쉽다. 본 논문에서는 엔트로피 붕괴(entropy collapsing)나 무한 발산(runaway divergence)에 빠지지 않으면서 에이전트 자신의 경험을 바탕으로 점진적인 탐색-활용 균형을 달성하는 것을 목표로 한다. 우리는 SPEAR라는 커리큘럼 기반 자기 모방 학습(Self-Imitation Learning, SIL) 레시피를 제안한다. 이는 기존의 SIL 프레임워크를 확장하여, 오프-폴리시 업데이트를 위해 자기 생성된 유망한 궤적을 저장하는 재생 버퍼(replay buffer)를 사용하면서, 단계별로 엔트로피가 균형 잡힌 범위 내에서 정책 진화를 점진적으로 조정한다. 구체적으로, 우리의 접근 방식은 탐색 과정을 관리하기 위해 커리큘럼을 도입하고, 내재적 보상(intrinsic reward)을 활용하여 기술 수준의 탐색을 촉진하며, SIL을 통해 행동 수준의 탐색을 용이하게 한다. 초기에는 보조 도구 호출 보상이 도구 사용 기술의 축적에 중요한 역할을 하며, 상승하는 엔트로피 추세와 함께 환경 피드백의 낯선 분포에 광범위하게 노출되도록 한다. 훈련이 진행됨에 따라, 자기 모방이 강화되어 재생된 경험에서 기존의 성공적인 패턴을 활용함으로써 비교적 행동 수준의 탐색을 가속화하고, 무한한 엔트로피 증가 없이 솔루션 반복을 촉진한다. 훈련을 더욱 안정화하기 위해, 재생 버퍼 내 경험의 이점을 재조정하여 잠재적인 정책 표류(policy drift)를 해결한다. 확률과 이점 간의 높은 공분산을 가진 토큰의 클리핑(clipping)과 같은 정규화 기법이 궤적 수준의 엔트로피 제어에 도입되어 과도한 자신감을 억제한다.

English

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

로프를 익히고 승리를 믿어라: 에이전트 강화 학습을 위한 점진적 탐색과 자기 모방

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

초록

Support