에이전트 강화 정책 최적화

초록

검증 가능한 보상을 활용한 대규모 강화 학습(RLVR)은 단일 턴 추론 작업에서 대규모 언어 모델(LLM)의 잠재력을 활용하는 데 있어 그 효과를 입증했습니다. 현실적인 추론 시나리오에서, LLM은 종종 외부 도구를 활용하여 문제 해결 과정을 지원할 수 있습니다. 그러나 현재의 강화 학습 알고리즘은 모델의 내재적인 장기 추론 능력과 다중 턴 도구 상호작용 능력 간의 균형을 적절히 맞추지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 다중 턴 LLM 기반 에이전트를 훈련하기 위해 특화된 새로운 에이전트 강화 학습 알고리즘인 Agentic Reinforced Policy Optimization(ARPO)을 제안합니다. 예비 실험을 통해, LLM이 외부 도구와 상호작용한 직후 생성된 토큰의 엔트로피 분포가 증가하는 등 매우 불확실한 행동을 보이는 경향이 있음을 관찰했습니다. 이러한 관찰에 기반하여, ARPO는 엔트로피 기반의 적응형 롤아웃 메커니즘을 도입하여 전역 궤적 샘플링과 단계별 샘플링을 동적으로 균형 있게 조정함으로써 도구 사용 후 높은 불확실성이 있는 단계에서의 탐색을 촉진합니다. 또한, 이점 속성 추정을 통합함으로써 ARPO는 LLM이 단계별 도구 사용 상호작용에서의 이점 차이를 내재화할 수 있도록 합니다. 우리는 계산적 추론, 지식 추론, 깊이 탐색 분야의 13가지 도전적인 벤치마크에서 실험을 진행하여 ARPO가 궤적 수준의 강화 학습 알고리즘을 능가하는 우수성을 입증했습니다. 특히, ARPO는 기존 방법이 요구하는 도구 사용 예산의 절반만으로도 향상된 성능을 달성하며, LLM 기반 에이전트를 실시간 동적 환경에 맞추는 확장 가능한 솔루션을 제공합니다. 우리의 코드와 데이터셋은 https://github.com/dongguanting/ARPO에서 공개되었습니다.

English

Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO