MCPMark：一个用于压力测试现实与全面MCP的基准平台

摘要

MCP（模型调用协议）规范了大型语言模型（LLMs）与外部系统的交互方式，为通用智能体奠定了基石。然而，现有的MCP基准测试在范围上仍显局限：它们侧重于读取密集型任务或交互深度有限的任务，未能充分捕捉现实世界工作流程的复杂性与真实性。为填补这一空白，我们提出了MCPMark，一个旨在以更为真实且全面的方式评估MCP应用的基准测试。该测试集由领域专家与AI智能体共同协作创建，包含127项高质量任务。每项任务均始于精心设计的初始状态，并配备有用于自动验证的程序脚本。这些任务要求与环境进行更为丰富多样的交互，涵盖广泛的创建、读取、更新及删除（CRUD）操作。我们采用一个在工具调用循环中运行的最小化智能体框架，对前沿LLMs进行了全面评估。实证结果显示，表现最佳的模型gpt-5-medium仅达到52.56%的pass@1和33.86%的pass^4，而其他广受认可的强模型，如claude-sonnet-4和o3，其pass@1和pass^4分别低于30%和15%。平均而言，LLMs每项任务需执行16.2次操作轮次和17.4次工具调用，远超以往MCP基准测试的数据，凸显了MCPMark作为压力测试的本质。

English

MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

MCPMark：一个用于压力测试现实与全面MCP的基准平台

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

摘要

Support