APPO：エージェント的手続き的方策最適化

要旨

近年、エージェント強化学習（RL）の進展により、大規模言語モデルエージェントのマルチターン型ツール使用能力が大幅に向上した。しかし、従来の手法の多くは、ツール呼び出し境界や固定ワークフローといった粗いヒューリスティック単位でクレジットを割り当てるため、どの中間決定が下流の結果に影響を与えるかを特定することが困難である。本研究では、エージェントRLを「どこで分岐するか」と「分岐後にどのようにクレジットを割り当てるか」という二つの観点から検討する。予備分析の結果、影響力のある決定点はツール呼び出しに集中するのではなく、生成系列全体に広く分布しており、トークンのエントロピーのみでは最終結果への影響を確実に反映しないことが示された。これらの知見に基づき、我々は「エージェント手続き型方策最適化（APPO）」を提案する。APPOは、分岐とクレジット割り当てを粗い相互作用単位から系列内の細粒度の決定点へと移行させる。APPOは、トークンの不確実性と後続の連続生成における方策誘起の尤度ゲインを組み合わせた分岐スコアを用いて分岐位置を選択し、擬陽性の高エントロピー位置を除去しつつ、より標的を絞った探索を可能にする。さらに、手続きレベルのアドバンテージスケーリングを導入し、分岐ロールアウト全体でのクレジット配分を改善する。13のベンチマークによる実験の結果、APPOは強力なエージェントRLベースラインを一貫して約4ポイント向上させ、効率的なツール呼び出しを維持しつつ、行動の解釈可能性を保持することが示された。

English

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.