ChatPaper.aiChatPaper

ToolRM:面向工具调用大语言模型的结果奖励模型

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

September 15, 2025
作者: Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
cs.AI

摘要

随着大型语言模型(LLMs)与外部工具的交互日益频繁,工具使用的奖励建模已成为一个关键但尚未充分探索的领域。现有的奖励模型主要基于自然语言输出进行训练,难以有效评估基于工具的推理与执行。为量化这一差距,我们引入了FC-RewardBench,这是首个旨在系统评估奖励模型在工具调用场景下性能的基准。我们的分析表明,当前的奖励模型往往遗漏了有效工具使用的关键信号,凸显了领域特定建模的必要性。为此,我们提出了一种基于结果的奖励模型训练框架,利用从宽松许可、开放权重的LLMs中合成的数据进行训练。我们训练了参数规模从1.7B到14B不等的模型,并在七个跨领域基准上进行了评估。这些模型在通用基线模型上持续表现出色,下游任务性能平均提升高达25%,并通过奖励引导的过滤实现了数据高效微调。
English
As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.
PDF22September 17, 2025