APPO:智能体程序化策略优化
APPO: Agentic Procedural Policy Optimization
June 10, 2026
作者: Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu
cs.AI
摘要
智能体强化学习(RL)的最新进展显著提升了大语言模型智能体的多轮工具使用能力。然而,现有方法大多基于粗粒度的启发式单元(如工具调用边界或固定工作流)进行信用分配,导致难以识别哪些中间决策影响了后续结果。本文从两个角度研究智能体强化学习:分支位置选择及分支后的信用分配方式。初步分析显示,有影响力的决策点广泛分布于整个生成序列中,而非集中在工具调用处,而仅凭词元熵无法可靠反映其对最终结果的影响。基于这些发现,我们提出了智能体程序化策略优化(Agentic Procedural Policy Optimization, APPO),该方法将分支与信用分配从粗粒度的交互单元转移到序列中的细粒度决策点上。APPO 使用结合词元不确定性与后续续写策略诱导似然增益的分支分数来选择分支位置,在过滤掉虚假高熵位置的同时实现更具针对性的探索;该方法进一步引入了程序级优势缩放,以在分支展开中更好地分配信用。在13个基准上的实验表明,APPO 在保持高效工具调用和行为可解释性的同时,持续将强智能体强化学习基线提升近4个百分点。
English
Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.