LiveMCPBench:智能體能否在MCP工具的海洋中航行?
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
August 3, 2025
作者: Guozhao Mo, Wenliang Zhong, Jiawei Chen, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
cs.AI
摘要
隨著模型上下文協議(Model Context Protocol, MCP)的迅速發展,MCP伺服器的數量已超過10,000台。然而,現有的MCP基準測試僅限於單一伺服器設置,且僅包含少數工具,這阻礙了在大規模、真實場景中對代理能力的有效評估。為解決這一限制,我們提出了LiveMCPBench,這是首個基於MCP生態系統的綜合基準測試,包含95個真實世界任務,旨在跨多樣化伺服器大規模評估LLM代理。為支持大規模MCP環境中可擴展且可重現的評估流程,我們精心策劃了LiveMCPTool,這是一個多樣化且易於部署的集合,包含70個MCP伺服器和527種工具。此外,我們引入了LiveMCPEval,這是一個LLM-as-a-Judge框架,能夠在動態、時變的任務環境中實現自動化和適應性評估,與人類評審員的一致性達到81%。最後,我們提出了MCP Copilot Agent,這是一個多步驟代理,能夠為動態規劃路由工具,並在整個LiveMCPTool套件中執行工具以進行API交互。我們的評估涵蓋了10個領先模型,表現最佳的模型(Claude-Sonnet-4)達到了78.95%的成功率。然而,我們觀察到不同模型之間的性能差異很大,並且幾個廣泛使用的模型在LiveMCPBench複雜且工具豐富的環境中表現不佳。總體而言,LiveMCPBench提供了首個統一框架,用於在真實、工具豐富且動態的MCP環境中對LLM代理進行基準測試,為代理能力的可擴展和可重現研究奠定了堅實基礎。我們的代碼和數據將在https://icip-cas.github.io/LiveMCPBench公開提供。
English
With the rapid development of Model Context Protocol (MCP), the number of MCP
servers has surpassed 10,000. However, existing MCP benchmarks are limited to
single-server settings with only a few tools, hindering effective evaluation of
agent capabilities in large-scale, real-world scenarios. To address this
limitation, we present LiveMCPBench, the first comprehensive benchmark
comprising 95 real-world tasks grounded in the MCP ecosystem, designed to
evaluate LLM agents at scale across diverse servers. To support a scalable and
reproducible evaluation pipeline in large-scale MCP environments, we curate
LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and
527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework
that enables automated and adaptive evaluation in dynamic, time-varying task
environments, achieving 81% agreement with human reviewers. Finally, we propose
the MCP Copilot Agent, a multi-step agent that routes tools for dynamic
planning and executes tools for API interaction across the entire LiveMCPTool
suite. Our evaluation covers 10 leading models, with the best-performing model
(Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large
performance variance across models, and several widely-used models perform
poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench
offers the first unified framework for benchmarking LLM agents in realistic,
tool-rich, and dynamic MCP environments, laying a solid foundation for scalable
and reproducible research on agent capabilities. Our code and data will be
publicly available at https://icip-cas.github.io/LiveMCPBench.