ChatPaper.aiChatPaper

工具奖励模型基准:评估与推进工具使用智能体的过程奖励模型

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

January 18, 2026
作者: Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, Ruocheng Guo
cs.AI

摘要

奖励引导的搜索方法通过有效指导复杂动作空间中的采样与探索,在增强工具使用智能体方面展现出巨大潜力。其核心设计在于利用过程奖励模型(PRM)提供步进级奖励,实现更细粒度的监控。然而,目前工具使用场景下仍缺乏系统可靠的PRM评估基准。本文提出ToolPRMBench——一个专为评估工具使用智能体PRM而设计的大规模基准测试平台。该平台基于多个代表性工具使用基准构建,将智能体轨迹转化为步进级测试用例。每个用例包含交互历史、正确动作、合理但错误的替代动作及相关工具元数据。我们分别采用离线采样来隔离局部单步错误,并通过在线采样捕捉完整智能体推演中的实际多步故障。同时提出多LLM验证流程以降低标注噪声并确保数据质量。基于ToolPRMBench,我们在大语言模型、通用PRM和工具专用PRM上开展了广泛实验。结果表明不同PRM效能存在显著差异,同时凸显了专用PRM在工具使用场景中的潜力。代码与数据将在https://github.com/David-Li0406/ToolPRMBench发布。
English
Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.
PDF131January 22, 2026