MCP公司:打造具备任务专用工具的通用智能体
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
October 22, 2025
作者: Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue
cs.AI
摘要
自模型上下文协议(MCP)引入以来,大型语言模型(LLMs)可用工具的数量显著增加。这些针对特定任务的工具集为通用工具(如网页浏览器)提供了替代方案,同时比图形用户界面(GUI)更易于开发和维护。然而,当前通用智能体主要依赖网页浏览器与环境交互。在此,我们介绍TheMCPCompany,这是一个用于评估调用工具智能体在与各种现实世界服务交互任务中的基准。我们利用这些服务的REST API创建了MCP服务器,其中包含超过18,000种工具。我们还为每项任务提供了手动标注的真实工具。在实验中,我们使用真实工具展示了调用工具智能体在假设完美工具检索情况下,既能提升性能又能降低成本的潜力。接着,我们通过工具检索探索智能体性能,以研究基于工具智能体在现实世界中的实用性。虽然所有具备工具检索的模型表现与基于浏览器的智能体相当或更优,但较小模型无法通过检索充分利用可用工具。另一方面,GPT-5在工具检索下的表现非常接近其使用真实工具时的表现。总体而言,我们的工作表明,最先进的推理模型在简单环境中能有效发现工具,但在复杂的企业环境中导航时却面临严重困难。TheMCPCompany揭示,导航数以万计的工具并以非平凡方式组合它们来解决复杂问题,对当前模型而言仍是一项挑战,需要更好的推理和检索模型。
English
Since the introduction of the Model Context Protocol (MCP), the number of
available tools for Large Language Models (LLMs) has increased significantly.
These task-specific tool sets offer an alternative to general-purpose tools
such as web browsers, while being easier to develop and maintain than GUIs.
However, current general-purpose agents predominantly rely on web browsers for
interacting with the environment. Here, we introduce TheMCPCompany, a benchmark
for evaluating tool-calling agents on tasks that involve interacting with
various real-world services. We use the REST APIs of these services to create
MCP servers, which include over 18,000 tools. We also provide manually
annotated ground-truth tools for each task. In our experiments, we use the
ground truth tools to show the potential of tool-calling agents for both
improving performance and reducing costs assuming perfect tool retrieval. Next,
we explore agent performance using tool retrieval to study the real-world
practicality of tool-based agents. While all models with tool retrieval perform
similarly or better than browser-based agents, smaller models cannot take full
advantage of the available tools through retrieval. On the other hand, GPT-5's
performance with tool retrieval is very close to its performance with
ground-truth tools. Overall, our work shows that the most advanced reasoning
models are effective at discovering tools in simpler environments, but
seriously struggle with navigating complex enterprise environments.
TheMCPCompany reveals that navigating tens of thousands of tools and combining
them in non-trivial ways to solve complex problems is still a challenging task
for current models and requires both better reasoning and better retrieval
models.