TOUCAN：从真实世界MCP环境中合成150万条工具代理数据

摘要

大型语言模型（LLM）代理正迅速成为跨领域任务自动化的强大系统。然而，开源社区的进展因缺乏高质量、宽松许可的工具代理训练数据而受限。现有数据集在多样性、真实性和复杂性方面往往不足，尤其是在多工具和多轮交互方面。为填补这一空白，我们推出了迄今为止最大的公开可用工具代理数据集——Toucan，包含从近500个真实世界模型上下文协议（MCP）中合成的150万条轨迹。与以往工作不同，Toucan利用真实的MCP环境生成多样、真实且具有挑战性的任务，其轨迹涉及真实工具的执行。我们的流程首先使用五种不同模型生成广泛的工具使用查询，应用基于模型的质量过滤，然后使用两个代理框架通过三个教师模型生成代理轨迹。严格的基于规则和基于模型的验证确保了高质量输出。我们还引入了三种扩展机制，以进一步多样化任务并模拟多轮对话。在Toucan上微调的模型在BFCL V3基准测试中超越了更大的闭源模型，并在MCP-Universe Bench上推动了帕累托前沿的进步。

English

Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi-tool and multi-turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool-agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool-use queries using five distinct models, applies model-based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule-based and model-based validation ensures high-quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi-turn conversations. Models fine-tuned on Toucan outperform larger closed-source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP-Universe Bench.