MCP-Bench：透過MCP伺服器對使用工具的LLM代理進行複雜現實世界任務的基準測試

摘要

我們推出了MCP-Bench，這是一個用於評估大型語言模型（LLMs）在現實多步驟任務中表現的基準測試平台，這些任務要求模型具備工具使用、跨工具協調、精確參數控制以及解決問題的規劃與推理能力。基於模型上下文協議（MCP），MCP-Bench將LLMs連接到28個代表性的實時MCP服務器，涵蓋金融、旅行、科學計算和學術搜索等領域的250種工具。與以往基於API的基準測試不同，每個MCP服務器提供一組互補工具，旨在協同工作，從而構建出具有豐富輸入輸出耦合的真實多步驟任務。MCP-Bench中的任務測試代理在模糊指令下檢索相關工具的能力（無需明確工具名稱）、為複雜目標規劃多跳執行軌跡的能力、基於中間工具輸出進行響應的能力，以及協調跨領域工作流程的能力——這些能力是現有依賴於明確工具規格、淺層少步驟工作流程和孤立領域操作的基準測試所無法充分評估的。我們提出了一個多維度的評估框架，涵蓋工具層次的模式理解與使用、軌跡層次的規劃以及任務完成度。對20個先進LLMs的實驗揭示了MCP-Bench中持續存在的挑戰。代碼與數據請訪問：https://github.com/Accenture/mcp-bench。

English

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

MCP-Bench：透過MCP伺服器對使用工具的LLM代理進行複雜現實世界任務的基準測試

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

摘要

Support