ChatPaper.aiChatPaper

MCP-Bench:通过MCP服务器评估处理复杂现实任务的大型语言模型工具使用能力

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

August 28, 2025
作者: Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow
cs.AI

摘要

我们推出了MCP-Bench,这是一个用于评估大语言模型(LLMs)在现实多步骤任务中表现的基准测试平台,这些任务要求模型具备工具使用、跨工具协调、精确参数控制以及任务解决中的规划与推理能力。基于模型上下文协议(MCP)构建,MCP-Bench将LLMs与28个代表性的实时MCP服务器相连,覆盖金融、旅游、科学计算及学术搜索等领域,共计250种工具。与以往基于API的基准测试不同,每个MCP服务器提供一组互补工具,旨在协同工作,从而构建出具有丰富输入输出耦合的真实多步骤任务。MCP-Bench中的任务测试了代理在模糊指令下检索相关工具(无需明确工具名称)、为复杂目标规划多跳执行路径、将响应基于中间工具输出以及编排跨领域工作流的能力——这些能力在依赖明确工具规范、浅层次少步骤工作流及孤立领域操作的现有基准测试中未能得到充分评估。我们提出了一套多维度评估框架,涵盖工具层面的模式理解与使用、路径层面的规划以及任务完成度。对20个先进LLMs的实验揭示了MCP-Bench中持续存在的挑战。代码与数据请访问:https://github.com/Accenture/mcp-bench。
English
We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
PDF413August 29, 2025