ChatPaper.aiChatPaper

LiveMCPBench:智能體能否在MCP工具的海洋中航行?

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

August 3, 2025
作者: Guozhao Mo, Wenliang Zhong, Jiawei Chen, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
cs.AI

摘要

隨著模型上下文協議(Model Context Protocol, MCP)的迅速發展,MCP伺服器的數量已超過10,000台。然而,現有的MCP基準測試僅限於單一伺服器設置,且僅包含少數工具,這阻礙了在大規模、真實場景中對代理能力的有效評估。為解決這一限制,我們提出了LiveMCPBench,這是首個基於MCP生態系統的綜合基準測試,包含95個真實世界任務,旨在跨多樣化伺服器大規模評估LLM代理。為支持大規模MCP環境中可擴展且可重現的評估流程,我們精心策劃了LiveMCPTool,這是一個多樣化且易於部署的集合,包含70個MCP伺服器和527種工具。此外,我們引入了LiveMCPEval,這是一個LLM-as-a-Judge框架,能夠在動態、時變的任務環境中實現自動化和適應性評估,與人類評審員的一致性達到81%。最後,我們提出了MCP Copilot Agent,這是一個多步驟代理,能夠為動態規劃路由工具,並在整個LiveMCPTool套件中執行工具以進行API交互。我們的評估涵蓋了10個領先模型,表現最佳的模型(Claude-Sonnet-4)達到了78.95%的成功率。然而,我們觀察到不同模型之間的性能差異很大,並且幾個廣泛使用的模型在LiveMCPBench複雜且工具豐富的環境中表現不佳。總體而言,LiveMCPBench提供了首個統一框架,用於在真實、工具豐富且動態的MCP環境中對LLM代理進行基準測試,為代理能力的可擴展和可重現研究奠定了堅實基礎。我們的代碼和數據將在https://icip-cas.github.io/LiveMCPBench公開提供。
English
With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.
PDF82August 6, 2025