工具-R0：零数据工具学习的自演进大型语言模型代理

摘要

大型语言模型（LLMs）正逐渐成为能够使用工具解决复杂任务的自主智能体基础架构。强化学习（RL）已成为注入此类智能体能力的常用方法，但通常需要在严格控制的训练环境下进行。这种方法往往依赖于精心构建的任务-解决方案对和大量人工监督，这为通向超级智能系统的开放式自我进化设置了根本性障碍。本文提出Tool-R0框架，在零数据假设下通过自我博弈式强化学习从头训练通用工具调用智能体。该框架从同一基础LLM初始化，通过互补奖励机制协同进化生成器与求解器：前者根据对方能力边界提出针对性挑战任务，后者学习通过真实世界工具调用来解决问题。由此形成无需预设任务或数据集的自我进化循环。在不同工具使用基准测试中，Tool-R0相较基础模型实现92.5%的相对性能提升，并在相同设定下超越全监督工具调用基线。本研究还通过分析协同进化、课程动态和扩展行为，为自我博弈式LLM智能体提供了实证洞察。

English

Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.

工具-R0：零数据工具学习的自演进大型语言模型代理

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

摘要

Support