ToolRM：面向工具调用大语言模型的结果奖励模型

摘要

随着大型语言模型（LLMs）与外部工具的交互日益频繁，工具使用的奖励建模已成为一个关键但尚未充分探索的领域。现有的奖励模型主要基于自然语言输出进行训练，难以有效评估基于工具的推理与执行。为量化这一差距，我们引入了FC-RewardBench，这是首个旨在系统评估奖励模型在工具调用场景下性能的基准。我们的分析表明，当前的奖励模型往往遗漏了有效工具使用的关键信号，凸显了领域特定建模的必要性。为此，我们提出了一种基于结果的奖励模型训练框架，利用从宽松许可、开放权重的LLMs中合成的数据进行训练。我们训练了参数规模从1.7B到14B不等的模型，并在七个跨领域基准上进行了评估。这些模型在通用基线模型上持续表现出色，下游任务性能平均提升高达25%，并通过奖励引导的过滤实现了数据高效微调。

English

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25\% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.

ToolRM：面向工具调用大语言模型的结果奖励模型

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

摘要

Support