ChatPaper.aiChatPaper

MCP-Bench:透過MCP伺服器對使用工具的LLM代理進行複雜現實世界任務的基準測試

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

August 28, 2025
作者: Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
cs.AI

摘要

我們推出了MCP-Bench,這是一個用於評估大型語言模型(LLMs)在現實多步驟任務中表現的基準測試平台,這些任務要求模型具備工具使用、跨工具協調、精確參數控制以及解決問題的規劃與推理能力。基於模型上下文協議(MCP),MCP-Bench將LLMs連接到28個代表性的實時MCP服務器,涵蓋金融、旅行、科學計算和學術搜索等領域的250種工具。與以往基於API的基準測試不同,每個MCP服務器提供一組互補工具,旨在協同工作,從而構建出具有豐富輸入輸出耦合的真實多步驟任務。MCP-Bench中的任務測試代理在模糊指令下檢索相關工具的能力(無需明確工具名稱)、為複雜目標規劃多跳執行軌跡的能力、基於中間工具輸出進行響應的能力,以及協調跨領域工作流程的能力——這些能力是現有依賴於明確工具規格、淺層少步驟工作流程和孤立領域操作的基準測試所無法充分評估的。我們提出了一個多維度的評估框架,涵蓋工具層次的模式理解與使用、軌跡層次的規劃以及任務完成度。對20個先進LLMs的實驗揭示了MCP-Bench中持續存在的挑戰。代碼與數據請訪問:https://github.com/Accenture/mcp-bench。
English
We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
PDF444August 29, 2025