ChatPaper.aiChatPaper

MCPMark:一个用于压力测试现实与全面MCP的基准平台

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

September 28, 2025
作者: Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh
cs.AI

摘要

MCP(模型调用协议)规范了大型语言模型(LLMs)与外部系统的交互方式,为通用智能体奠定了基石。然而,现有的MCP基准测试在范围上仍显局限:它们侧重于读取密集型任务或交互深度有限的任务,未能充分捕捉现实世界工作流程的复杂性与真实性。为填补这一空白,我们提出了MCPMark,一个旨在以更为真实且全面的方式评估MCP应用的基准测试。该测试集由领域专家与AI智能体共同协作创建,包含127项高质量任务。每项任务均始于精心设计的初始状态,并配备有用于自动验证的程序脚本。这些任务要求与环境进行更为丰富多样的交互,涵盖广泛的创建、读取、更新及删除(CRUD)操作。我们采用一个在工具调用循环中运行的最小化智能体框架,对前沿LLMs进行了全面评估。实证结果显示,表现最佳的模型gpt-5-medium仅达到52.56%的pass@1和33.86%的pass^4,而其他广受认可的强模型,如claude-sonnet-4和o3,其pass@1和pass^4分别低于30%和15%。平均而言,LLMs每项任务需执行16.2次操作轮次和17.4次工具调用,远超以往MCP基准测试的数据,凸显了MCPMark作为压力测试的本质。
English
MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.
PDF141October 1, 2025