LiveMCPBench: エージェントはMCPツールの海を航海できるか？

要旨

Model Context Protocol（MCP）の急速な発展に伴い、MCPサーバーの数は10,000を超えました。しかし、既存のMCPベンチマークは、単一サーバー設定と少数のツールに限定されており、大規模な実世界シナリオにおけるエージェント能力の効果的な評価を妨げています。この制約に対処するため、我々はLiveMCPBenchを提案します。これは、MCPエコシステムに基づく95の実世界タスクから成る初の包括的なベンチマークであり、多様なサーバーにわたるLLMエージェントの大規模評価を目的としています。大規模MCP環境におけるスケーラブルで再現可能な評価パイプラインを支援するため、我々はLiveMCPToolをキュレーションしました。これは、70のMCPサーバーと527のツールから成る多様で即座に展開可能なコレクションです。さらに、我々はLiveMCPEvalを導入します。これは、動的で時間変動するタスク環境における自動的かつ適応的な評価を可能にするLLM-as-a-Judgeフレームワークであり、人間のレビュアーとの一致率は81%に達します。最後に、我々はMCP Copilot Agentを提案します。これは、動的計画のためにツールをルーティングし、LiveMCPToolスイート全体にわたるAPIインタラクションのためにツールを実行する多段階エージェントです。我々の評価は10の主要モデルをカバーし、最高性能のモデル（Claude-Sonnet-4）は78.95%の成功率を達成しました。しかし、モデル間で大きな性能のばらつきが観察され、いくつかの広く使用されているモデルは、LiveMCPBenchの複雑でツール豊富な環境で低い性能を示しました。全体として、LiveMCPBenchは、現実的でツール豊富な動的MCP環境におけるLLMエージェントのベンチマークのための初の統一フレームワークを提供し、エージェント能力に関するスケーラブルで再現可能な研究のための堅固な基盤を築きます。我々のコードとデータはhttps://icip-cas.github.io/LiveMCPBenchで公開されます。

English

With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.

LiveMCPBench: エージェントはMCPツールの海を航海できるか？

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

要旨

Support