ChatPaper.aiChatPaper

LiveMCPBench:智能体能否驾驭MCP工具的海洋?

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

August 3, 2025
作者: Guozhao Mo, Wenliang Zhong, Jiawei Chen, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
cs.AI

摘要

随着模型上下文协议(MCP)的迅猛发展,MCP服务器的数量已突破10,000台。然而,现有的MCP基准测试仅限于单服务器环境,且仅包含少量工具,这阻碍了对大规模现实场景中智能体能力的有效评估。为克服这一局限,我们推出了LiveMCPBench,这是首个基于MCP生态系统的综合性基准测试,包含95个现实任务,旨在跨多样服务器大规模评估LLM智能体。为了支持大规模MCP环境中可扩展且可复现的评估流程,我们精心打造了LiveMCPTool,这是一个包含70台MCP服务器和527种工具的多样化、即插即用集合。此外,我们引入了LiveMCPEval,一个LLM-as-a-Judge框架,能够在动态、时变的任务环境中实现自动化且自适应的评估,与人类评审员的一致性达到81%。最后,我们提出了MCP Copilot Agent,一个多步骤智能体,它能够为动态规划路由工具,并在整个LiveMCPTool套件中执行API交互工具。我们的评估覆盖了10个领先模型,表现最佳的模型(Claude-Sonnet-4)成功率达到了78.95%。然而,我们观察到各模型间性能差异显著,多个广泛使用的模型在LiveMCPBench复杂且工具丰富的环境中表现不佳。总体而言,LiveMCPBench为在真实、工具丰富且动态的MCP环境中基准测试LLM智能体提供了首个统一框架,为智能体能力的可扩展和可复现研究奠定了坚实基础。我们的代码和数据将公开于https://icip-cas.github.io/LiveMCPBench。
English
With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.
PDF82August 6, 2025