工具流程獎勵模型基準:評估與推進工具使用代理的流程獎勵模型
ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
January 18, 2026
作者: Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, Ruocheng Guo
cs.AI
摘要
獎勵引導的搜尋方法已展現出增強工具使用智能體的強大潛力,其透過在複雜動作空間中有效引導取樣與探索來實現。作為核心設計,這些搜尋方法利用過程獎勵模型提供步驟層級的獎勵,從而實現更細粒度的監控。然而,目前在工具使用場景中仍缺乏系統性且可靠的PRM評估基準。本文提出ToolPRMBench——一個專為評估工具使用智能體的PRM而設計的大規模基準測試平台。該基準基於多個代表性工具使用測試集構建,將智能體軌跡轉換為步驟層級的測試案例。每個案例包含互動歷程、正確動作、合理但不正確的替代動作,以及相關工具元數據。我們分別採用離線取樣來隔離局部單步錯誤,並透過線上取樣捕捉完整智能體推演中的真實多步失誤。同時提出多LLM驗證流程以降低標籤噪聲並確保數據質量。我們在ToolPRMBench上對大型語言模型、通用PRM及工具專用PRM進行廣泛實驗,結果清晰揭示了不同PRM的效能差異,並凸顯出專用PRM在工具使用領域的潛力。程式碼與數據將發佈於https://github.com/David-Li0406/ToolPRMBench。
English
Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.