MCP-Bench:透過MCP伺服器對使用工具的LLM代理進行複雜現實世界任務的基準測試
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
August 28, 2025
作者: Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
cs.AI
摘要
我們推出了MCP-Bench,這是一個用於評估大型語言模型(LLMs)在現實多步驟任務中表現的基準測試平台,這些任務要求模型具備工具使用、跨工具協調、精確參數控制以及解決問題的規劃與推理能力。基於模型上下文協議(MCP),MCP-Bench將LLMs連接到28個代表性的實時MCP服務器,涵蓋金融、旅行、科學計算和學術搜索等領域的250種工具。與以往基於API的基準測試不同,每個MCP服務器提供一組互補工具,旨在協同工作,從而構建出具有豐富輸入輸出耦合的真實多步驟任務。MCP-Bench中的任務測試代理在模糊指令下檢索相關工具的能力(無需明確工具名稱)、為複雜目標規劃多跳執行軌跡的能力、基於中間工具輸出進行響應的能力,以及協調跨領域工作流程的能力——這些能力是現有依賴於明確工具規格、淺層少步驟工作流程和孤立領域操作的基準測試所無法充分評估的。我們提出了一個多維度的評估框架,涵蓋工具層次的模式理解與使用、軌跡層次的規劃以及任務完成度。對20個先進LLMs的實驗揭示了MCP-Bench中持續存在的挑戰。代碼與數據請訪問:https://github.com/Accenture/mcp-bench。
English
We introduce MCP-Bench, a benchmark for evaluating large language models
(LLMs) on realistic, multi-step tasks that demand tool use, cross-tool
coordination, precise parameter control, and planning/reasoning for solving
tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28
representative live MCP servers spanning 250 tools across domains such as
finance, traveling, scientific computing, and academic search. Unlike prior
API-based benchmarks, each MCP server provides a set of complementary tools
designed to work together, enabling the construction of authentic, multi-step
tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability
to retrieve relevant tools from fuzzy instructions without explicit tool names,
plan multi-hop execution trajectories for complex objectives, ground responses
in intermediate tool outputs, and orchestrate cross-domain workflows -
capabilities not adequately evaluated by existing benchmarks that rely on
explicit tool specifications, shallow few-step workflows, and isolated domain
operations. We propose a multi-faceted evaluation framework covering tool-level
schema understanding and usage, trajectory-level planning, and task completion.
Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code
and data: https://github.com/Accenture/mcp-bench.