APPO:智能體程序性策略優化
APPO: Agentic Procedural Policy Optimization
June 10, 2026
作者: Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu
cs.AI
摘要
近期在智能體強化學習(RL)方面的進展大幅提升了大型語言模型代理的多輪工具使用能力。然而,現有方法多數基於粗略的啟發式單元(如工具調用邊界或固定工作流程)進行信用分配,使得難以識別哪些中間決策影響了下游結果。在本研究中,我們從兩個角度探討智能體強化學習:何處進行分支,以及分支後如何分配信用。我們的初步分析顯示,有影響力的決策點廣泛分布於生成的序列中,而非集中於工具調用,而僅憑 token 熵並不可靠地反映其對最終結果的影響。基於這些觀察,我們提出智能體程序性策略優化(APPO),將分支與信用分配從粗略的交互單元轉移至序列中的細粒度決策點。APPO 使用結合 token 不確定性與後續延續策略誘導似然增益的分支分數來選擇分支位置,從而實現更具針對性的探索,同時過濾掉虛假的高熵位置。它進一步引入程序級優勢縮放,以更好地在分支展開間分配信用。在 13 個基準測試上的實驗顯示,APPO 持續將強智能體 RL 基準提升近 4 個百分點,同時保持高效的工具調用和行為可解釋性。
English
Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.