ToolRM:面向工具调用大型语言模型的結果獎勵模型
ToolRM: Outcome Reward Models for Tool-Calling Large Language Models
September 15, 2025
作者: Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi
cs.AI
摘要
随着大型语言模型(LLMs)与外部工具的交互日益频繁,针对工具使用的奖励建模已成为一个关键但尚未充分探索的领域。现有的奖励模型主要基于自然语言输出进行训练,难以有效评估基于工具的推理与执行。为量化这一差距,我们引入了FC-RewardBench,这是首个旨在系统评估奖励模型在工具调用场景中表现的基准。我们的分析表明,当前的奖励模型往往未能捕捉到有效工具使用的关键信号,凸显了领域特定建模的必要性。为此,我们提出了一种基于结果的奖励模型训练框架,利用从宽松许可、开放权重的大型语言模型中合成的数据进行训练。我们训练了参数规模从1.7B到14B不等的模型,并在七个跨领域基准上进行了评估。这些模型在多个下游任务中持续超越通用基线,平均性能提升高达25%,并通过奖励引导的筛选实现了数据高效微调。
English
As large language models (LLMs) increasingly interact with external tools,
reward modeling for tool use has become a critical yet underexplored area.
Existing reward models, trained primarily on natural language outputs, struggle
to evaluate tool-based reasoning and execution. To quantify this gap, we
introduce FC-RewardBench, the first benchmark designed to systematically assess
reward models' performance in tool-calling scenarios. Our analysis shows that
current reward models often miss key signals of effective tool use,
highlighting the need for domain-specific modeling. To address this, we propose
a training framework for outcome-based reward models using data synthesized
from permissively licensed, open-weight LLMs. We train models ranging from 1.7B
to 14B parameters and evaluate them across seven out-of-domain benchmarks.
These models consistently outperform general-purpose baselines, achieving up to
25\% average improvement in downstream task performance and enabling
data-efficient fine-tuning through reward-guided filtering.