TheMCPCompany: タスク特化型ツールを用いた汎用エージェントの構築

要旨

モデルコンテキストプロトコル（MCP）の導入以来、大規模言語モデル（LLM）向けの利用可能なツールの数は大幅に増加しました。これらのタスク特化型ツールセットは、ウェブブラウザのような汎用ツールに代わる選択肢を提供しつつ、GUIよりも開発と保守が容易です。しかし、現在の汎用エージェントは主に環境とのインタラクションにウェブブラウザを依存しています。本論文では、様々な現実世界のサービスとのインタラクションを伴うタスクにおいて、ツール呼び出しエージェントを評価するためのベンチマーク「TheMCPCompany」を紹介します。これらのサービスのREST APIを使用してMCPサーバーを作成し、18,000以上のツールを含めています。また、各タスクに対して手動でアノテーションされたグラウンドトゥルースツールを提供します。実験では、グラウンドトゥルースツールを使用して、完璧なツール検索を仮定した場合のパフォーマンス向上とコスト削減の可能性を示します。次に、ツール検索を使用したエージェントのパフォーマンスを探り、ツールベースのエージェントの実用性を研究します。ツール検索を使用したすべてのモデルは、ブラウザベースのエージェントと同等またはそれ以上のパフォーマンスを示しますが、小規模なモデルは検索を通じて利用可能なツールを十分に活用できません。一方、GPT-5のツール検索を使用したパフォーマンスは、グラウンドトゥルースツールを使用した場合と非常に近いものです。全体として、我々の研究は、最も先進的な推論モデルが単純な環境でのツール発見に有効である一方、複雑な企業環境のナビゲーションには深刻な苦戦を強いられることを示しています。TheMCPCompanyは、数万のツールをナビゲートし、それらを非自明な方法で組み合わせて複雑な問題を解決することは、現在のモデルにとって依然として困難な課題であり、より優れた推論モデルと検索モデルの両方が必要であることを明らかにしています。

English

Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.

TheMCPCompany: タスク特化型ツールを用いた汎用エージェントの構築

TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

要旨

Support