Tool-R0：ゼロデータからのツール学習を実現する自己進化型LLMエージェント

要旨

大規模言語モデル（LLM）は、複雑なタスクを解決するためにツールを利用できる自律エージェントの基盤となりつつある。強化学習（RL）は、このようなエージェント能力を付与する一般的な手法として登場したが、通常は厳密に制御された訓練環境下で行われる。これは、注意深く構築されたタスクと解決策のペア、および相当量の人的監督に依存することが多く、超知能システムに向けたオープンエンドな自己進化に対する根本的な障壁となっている。本論文では、ゼロデータの仮定の下、自己対戦型強化学習を用いて汎用ツール呼び出しエージェントを一から訓練するためのTool-R0フレームワークを提案する。同一の基盤LLMから初期化されたTool-R0は、相補的な報酬を持つGeneratorとSolverを共進化させる。一方は相手の能力限界に挑戦的なタスクを提案し、他方は現実世界のツール呼び出しを用いてそれらを解決することを学習する。これにより、既存のタスクやデータセットを必要としない自己進化サイクルが創出される。様々なツール利用ベンチマークによる評価では、Tool-R0が基盤モデルに対して92.5%の相対的改善をもたらし、同じ設定下での完全教師ありツール呼び出しベースラインを凌駕することを示した。我々の研究はさらに、共進化、カリキュラム動態、スケーリング挙動を分析することにより、自己対戦型LLMエージェントに関する実証的知見を提供する。

English

Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.

Tool-R0：ゼロデータからのツール学習を実現する自己進化型LLMエージェント

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

要旨

Support