LiveMCPBench：智能体能否驾驭MCP工具的海洋？

摘要

随着模型上下文协议（MCP）的迅猛发展，MCP服务器的数量已突破10,000台。然而，现有的MCP基准测试仅限于单服务器环境，且仅包含少量工具，这阻碍了对大规模现实场景中智能体能力的有效评估。为克服这一局限，我们推出了LiveMCPBench，这是首个基于MCP生态系统的综合性基准测试，包含95个现实任务，旨在跨多样服务器大规模评估LLM智能体。为了支持大规模MCP环境中可扩展且可复现的评估流程，我们精心打造了LiveMCPTool，这是一个包含70台MCP服务器和527种工具的多样化、即插即用集合。此外，我们引入了LiveMCPEval，一个LLM-as-a-Judge框架，能够在动态、时变的任务环境中实现自动化且自适应的评估，与人类评审员的一致性达到81%。最后，我们提出了MCP Copilot Agent，一个多步骤智能体，它能够为动态规划路由工具，并在整个LiveMCPTool套件中执行API交互工具。我们的评估覆盖了10个领先模型，表现最佳的模型（Claude-Sonnet-4）成功率达到了78.95%。然而，我们观察到各模型间性能差异显著，多个广泛使用的模型在LiveMCPBench复杂且工具丰富的环境中表现不佳。总体而言，LiveMCPBench为在真实、工具丰富且动态的MCP环境中基准测试LLM智能体提供了首个统一框架，为智能体能力的可扩展和可复现研究奠定了坚实基础。我们的代码和数据将公开于https://icip-cas.github.io/LiveMCPBench。

English

With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.

LiveMCPBench：智能体能否驾驭MCP工具的海洋？

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

摘要

Support