OTC: 強化学習による最適なツール呼び出し

要旨

ツール統合推論（TIR）は、大規模言語モデル（LLM）に外部ツール（検索エンジンやコードインタプリタなど）を呼び出す能力を付与し、言語のみの推論では解決できないタスクに対応することを可能にします。強化学習（RL）は、最終的な回答の正確性を最適化することでTIRを改善する可能性を示していますが、既存のアプローチではツール使用の効率性やコストがしばしば見過ごされています。これにより、計算コストや金銭的負担を増大させる過剰なツール呼び出しや、回答の質を損なう不十分なツール使用といった非最適な行動が生じる可能性があります。本研究では、最小限のツール呼び出しで正確な回答を生成するようモデルを促す、シンプルかつ効果的なRLベースのフレームワーク「最適ツール呼び出し制御ポリシー最適化（OTC-PO）」を提案します。本手法では、正確性とツール効率性を同時に考慮したツール統合報酬を導入し、高いツール生産性を促進します。このフレームワークを近接ポリシー最適化（PPO）とグループ相対選好最適化（GRPO）に適用し、OTC-PPOとOTC-GRPOを実現しました。Qwen-2.5およびQwen-Mathを用いた複数のQAベンチマークでの実験では、本アプローチがツール呼び出しを最大73.1％削減し、ツール生産性を最大229.4％向上させながら、同等の回答精度を維持することを示しました。私たちの知る限り、これはTIRにおけるツール使用効率を明示的に最適化する初めてのRLベースのフレームワークです。

English

Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools, such as search engines and code interpreters, to solve tasks beyond the capabilities of language-only reasoning. While reinforcement learning (RL) has shown promise in improving TIR by optimizing final answer correctness, existing approaches often overlook the efficiency and cost associated with tool usage. This can lead to suboptimal behavior, including excessive tool calls that increase computational and financial overhead, or insufficient tool use that compromises answer quality. In this work, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 73.1\% and improves tool productivity by up to 229.4\%, while maintaining comparable answer accuracy. To the best of our knowledge, this is the first RL-based framework that explicitly optimizes tool-use efficiency in TIR.

OTC: 強化学習による最適なツール呼び出し

OTC: Optimal Tool Calls via Reinforcement Learning

要旨

Support