TOUCAN:从真实世界MCP环境中合成150万条工具代理数据
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
October 1, 2025
作者: Zhangchen Xu, Adriana Meza Soria, Shawn Tan, Anurag Roy, Ashish Sunil Agrawal, Radha Poovendran, Rameswar Panda
cs.AI
摘要
大型语言模型(LLM)代理正迅速成为跨领域任务自动化的强大系统。然而,开源社区的进展因缺乏高质量、宽松许可的工具代理训练数据而受限。现有数据集在多样性、真实性和复杂性方面往往不足,尤其是在多工具和多轮交互方面。为填补这一空白,我们推出了迄今为止最大的公开可用工具代理数据集——Toucan,包含从近500个真实世界模型上下文协议(MCP)中合成的150万条轨迹。与以往工作不同,Toucan利用真实的MCP环境生成多样、真实且具有挑战性的任务,其轨迹涉及真实工具的执行。我们的流程首先使用五种不同模型生成广泛的工具使用查询,应用基于模型的质量过滤,然后使用两个代理框架通过三个教师模型生成代理轨迹。严格的基于规则和基于模型的验证确保了高质量输出。我们还引入了三种扩展机制,以进一步多样化任务并模拟多轮对话。在Toucan上微调的模型在BFCL V3基准测试中超越了更大的闭源模型,并在MCP-Universe Bench上推动了帕累托前沿的进步。
English
Large Language Model (LLM) agents are rapidly emerging as powerful systems
for automating tasks across domains. Yet progress in the open-source community
is constrained by the lack of high quality permissively licensed tool-agentic
training data. Existing datasets are often limited in diversity, realism, and
complexity, particularly regarding multi-tool and multi-turn interactions. To
address this gap, we introduce Toucan, the largest publicly available
tool-agentic dataset to date, containing 1.5 million trajectories synthesized
from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work,
Toucan leverages authentic MCP environments to generate diverse, realistic, and
challenging tasks with trajectories involving real tool execution. Our pipeline
first produces a broad spectrum of tool-use queries using five distinct models,
applies model-based quality filtering, and then generates agentic trajectories
with three teacher models using two agentic frameworks. Rigorous rule-based and
model-based validation ensures high-quality outputs. We also introduce three
extension mechanisms to further diversify tasks and simulate multi-turn
conversations. Models fine-tuned on Toucan outperform larger closed-source
counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on
MCP-Universe Bench.