TheMCPCompany: 작업별 도구를 활용한 범용 에이전트 개발

초록

모델 컨텍스트 프로토콜(Model Context Protocol, MCP)이 도입된 이후, 대규모 언어 모델(LLMs)을 위한 도구의 수가 크게 증가했습니다. 이러한 작업별 도구 세트는 웹 브라우저와 같은 범용 도구에 대한 대안을 제공하면서도 GUI보다 개발 및 유지 관리가 더 쉽습니다. 그러나 현재의 범용 에이전트는 주로 환경과 상호작용하기 위해 웹 브라우저에 의존하고 있습니다. 여기서 우리는 다양한 실제 서비스와 상호작용하는 작업에서 도구 호출 에이전트를 평가하기 위한 벤치마크인 TheMCPCompany를 소개합니다. 우리는 이러한 서비스의 REST API를 사용하여 18,000개 이상의 도구를 포함하는 MCP 서버를 생성합니다. 또한 각 작업에 대해 수동으로 주석이 달린 정답 도구를 제공합니다. 실험에서 우리는 완벽한 도구 검색을 가정했을 때 도구 호출 에이전트가 성능 향상과 비용 절감에 대한 잠재력을 보여주기 위해 정답 도구를 사용합니다. 다음으로, 도구 기반 에이전트의 실제 적용 가능성을 연구하기 위해 도구 검색을 사용한 에이전트 성능을 탐구합니다. 도구 검색을 사용한 모든 모델은 브라우저 기반 에이전트와 유사하거나 더 나은 성능을 보이지만, 더 작은 모델은 검색을 통해 사용 가능한 도구를 완전히 활용하지 못합니다. 반면, GPT-5의 도구 검색 성능은 정답 도구를 사용한 성능과 매우 근접합니다. 전반적으로, 우리의 작업은 가장 진보된 추론 모델이 단순한 환경에서 도구를 발견하는 데 효과적이지만, 복잡한 기업 환경을 탐색하는 데는 심각한 어려움을 겪는다는 것을 보여줍니다. TheMCPCompany는 수만 개의 도구를 탐색하고 이를 비범한 방식으로 결합하여 복잡한 문제를 해결하는 것이 현재 모델에게 여전히 어려운 과제이며, 더 나은 추론과 검색 모델이 필요함을 드러냅니다.

English

Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.

TheMCPCompany: 작업별 도구를 활용한 범용 에이전트 개발

TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

초록

Support