ToolRL：獎勵即為工具學習之所需

摘要

當前的大型語言模型（LLMs）通常會通過監督式微調（SFT）來獲得工具使用能力。然而，SFT在面對陌生或複雜的工具使用場景時，其泛化能力往往不足。最近，強化學習（RL）領域的進展，特別是類似R1的模型，展現出了令人鼓舞的推理與泛化能力。但針對工具使用的獎勵設計面臨獨特挑戰：多種工具可能被調用，且參數各異，而粗粒度的獎勵信號（如答案匹配）無法提供有效學習所需的細粒度反饋。在本研究中，我們首次在RL範式下對工具選擇與應用任務的獎勵設計進行了全面研究。我們系統性地探索了多種獎勵策略，分析了其類型、尺度、粒度及時間動態。基於這些洞察，我們提出了一種針對工具使用任務的原則性獎勵設計，並應用群組相對策略優化（GRPO）來訓練LLMs。在多樣化基準上的實證評估表明，我們的方法實現了穩健、可擴展且穩定的訓練，相較於基礎模型提升了17%，相較於SFT模型提升了15%。這些結果凸顯了深思熟慮的獎勵設計在提升LLMs工具使用能力與泛化性能中的關鍵作用。所有代碼均已公開，以促進未來研究。

English

Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.

ToolRL：獎勵即為工具學習之所需

ToolRL: Reward is All Tool Learning Needs

摘要

Support