MCPMark:一個用於壓力測試現實且全面的MCP的基準
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
September 28, 2025
作者: Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh
cs.AI
摘要
MCP(模型控制协议)标准化了大型语言模型(LLMs)与外部系统的交互方式,为通用智能体奠定了基础。然而,现有的MCP基准测试范围仍然狭窄:它们主要关注读取密集型任务或交互深度有限的任务,未能捕捉到现实世界工作流程的复杂性和真实性。为了弥补这一不足,我们提出了MCPMark,这是一个旨在以更现实和全面的方式评估MCP应用的基准测试。它由领域专家和AI代理共同创建的127个高质量任务组成。每个任务从一个精心策划的初始状态开始,并包含一个用于自动验证的程序化脚本。这些任务要求与环境进行更丰富和多样化的交互,涉及广泛的创建、读取、更新和删除(CRUD)操作。我们使用一个在工具调用循环中运行的最小化代理框架,对前沿的LLMs进行了全面评估。实证结果表明,表现最佳的模型gpt-5-medium仅达到52.56%的pass@1和33.86%的pass^4,而其他广受认可的强模型,包括claude-sonnet-4和o3,pass@1低于30%,pass^4低于15%。平均而言,LLMs每个任务需要16.2次执行轮次和17.4次工具调用,显著超过了之前的MCP基准测试,凸显了MCPMark的压力测试性质。
English
MCP standardizes how LLMs interact with external systems, forming the
foundation for general agents. However, existing MCP benchmarks remain narrow
in scope: they focus on read-heavy tasks or tasks with limited interaction
depth, and fail to capture the complexity and realism of real-world workflows.
To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP
use in a more realistic and comprehensive manner. It consists of 127
high-quality tasks collaboratively created by domain experts and AI agents.
Each task begins with a curated initial state and includes a programmatic
script for automatic verification. These tasks demand richer and more diverse
interactions with the environment, involving a broad range of create, read,
update, and delete (CRUD) operations. We conduct a comprehensive evaluation of
cutting-edge LLMs using a minimal agent framework that operates in a
tool-calling loop. Empirical results show that the best-performing model,
gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other
widely regarded strong models, including claude-sonnet-4 and o3, fall below
30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution
turns and 17.4 tool calls per task, significantly surpassing those in
previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.