代理强化策略优化
Agentic Reinforced Policy Optimization
July 26, 2025
作者: Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
cs.AI
摘要
大规模可验证奖励强化学习(RLVR)已展现出其在挖掘大型语言模型(LLMs)潜力以应对单轮推理任务方面的有效性。在实际推理场景中,LLMs常能借助外部工具辅助任务解决过程。然而,现有的强化学习算法未能充分平衡模型内在的长程推理能力与其在多轮工具交互中的熟练度。为弥合这一差距,我们提出了代理式强化策略优化(Agentic Reinforced Policy Optimization, ARPO),一种专为训练基于LLM的多轮代理而设计的新型代理式强化学习算法。通过初步实验,我们观察到LLMs在与外部工具交互后,往往表现出高度不确定的行为,具体表现为生成标记的熵分布增加。受此启发,ARPO引入了一种基于熵的自适应滚动机制,动态平衡全局轨迹采样与步骤级采样,从而促进在工具使用后高不确定性步骤上的探索。通过整合优势归因估计,ARPO使LLMs能够在逐步工具使用交互中内化优势差异。我们在计算推理、知识推理及深度搜索领域的13个挑战性基准测试中进行的实验,证明了ARPO相较于轨迹级强化学习算法的优越性。值得注意的是,ARPO仅需现有方法一半的工具使用预算即可实现性能提升,为将基于LLM的代理与实时动态环境对齐提供了可扩展的解决方案。我们的代码与数据集已发布于https://github.com/dongguanting/ARPO。
English
Large-scale reinforcement learning with verifiable rewards (RLVR) has
demonstrated its effectiveness in harnessing the potential of large language
models (LLMs) for single-turn reasoning tasks. In realistic reasoning
scenarios, LLMs can often utilize external tools to assist in task-solving
processes. However, current RL algorithms inadequately balance the models'
intrinsic long-horizon reasoning capabilities and their proficiency in
multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced
Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training
multi-turn LLM-based agents. Through preliminary experiments, we observe that
LLMs tend to exhibit highly uncertain behavior, characterized by an increase in
the entropy distribution of generated tokens, immediately following
interactions with external tools. Motivated by this observation, ARPO
incorporates an entropy-based adaptive rollout mechanism, dynamically balancing
global trajectory sampling and step-level sampling, thereby promoting
exploration at steps with high uncertainty after tool usage. By integrating an
advantage attribution estimation, ARPO enables LLMs to internalize advantage
differences in stepwise tool-use interactions. Our experiments across 13
challenging benchmarks in computational reasoning, knowledge reasoning, and
deep search domains demonstrate ARPO's superiority over trajectory-level RL
algorithms. Remarkably, ARPO achieves improved performance using only half of
the tool-use budget required by existing methods, offering a scalable solution
for aligning LLM-based agents with real-time dynamic environments. Our code and
datasets are released at https://github.com/dongguanting/ARPO