AgentProcessBench:诊断工具使用代理的步骤级流程质量
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
March 15, 2026
作者: Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin
cs.AI
摘要
尽管大语言模型(LLMs)已发展为工具使用型智能体,但在长周期交互中仍显脆弱。与数学推理中错误常可通过回溯修正不同,工具使用失败往往引发不可逆的副作用,这使得精确的步骤级验证至关重要。然而现有流程级基准测试主要局限于封闭世界的数学领域,未能捕捉工具执行的动态性和开放性。为弥补这一空白,我们推出AgentProcessBench——首个专注于评估现实场景中工具增强轨迹的步骤级效能的基准测试。该基准包含1,000条多样化轨迹和8,509个人工标注的步骤注释,标注者间一致性达89.1%。其特色在于采用三元标注方案捕捉探索行为,并通过错误传播规则降低标注歧义。大量实验揭示关键发现:(1)较弱策略模型因提前终止而呈现虚高的正确步骤比例;(2)区分中性动作与错误动作仍是当前模型的重大挑战;(3)流程衍生信号为结果监督提供互补价值,显著增强测试时的扩展能力。我们期待AgentProcessBench能推动奖励模型的未来研究,为通用智能体的发展铺平道路。代码与数据详见https://github.com/RUCBM/AgentProcessBench。
English
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.