LiveMCPBench：智能體能否在MCP工具的海洋中航行？

摘要

隨著模型上下文協議（Model Context Protocol, MCP）的迅速發展，MCP伺服器的數量已超過10,000台。然而，現有的MCP基準測試僅限於單一伺服器設置，且僅包含少數工具，這阻礙了在大規模、真實場景中對代理能力的有效評估。為解決這一限制，我們提出了LiveMCPBench，這是首個基於MCP生態系統的綜合基準測試，包含95個真實世界任務，旨在跨多樣化伺服器大規模評估LLM代理。為支持大規模MCP環境中可擴展且可重現的評估流程，我們精心策劃了LiveMCPTool，這是一個多樣化且易於部署的集合，包含70個MCP伺服器和527種工具。此外，我們引入了LiveMCPEval，這是一個LLM-as-a-Judge框架，能夠在動態、時變的任務環境中實現自動化和適應性評估，與人類評審員的一致性達到81%。最後，我們提出了MCP Copilot Agent，這是一個多步驟代理，能夠為動態規劃路由工具，並在整個LiveMCPTool套件中執行工具以進行API交互。我們的評估涵蓋了10個領先模型，表現最佳的模型（Claude-Sonnet-4）達到了78.95%的成功率。然而，我們觀察到不同模型之間的性能差異很大，並且幾個廣泛使用的模型在LiveMCPBench複雜且工具豐富的環境中表現不佳。總體而言，LiveMCPBench提供了首個統一框架，用於在真實、工具豐富且動態的MCP環境中對LLM代理進行基準測試，為代理能力的可擴展和可重現研究奠定了堅實基礎。我們的代碼和數據將在https://icip-cas.github.io/LiveMCPBench公開提供。

English

With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.

LiveMCPBench：智能體能否在MCP工具的海洋中航行？

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

摘要

Support