MCPMark：一個用於壓力測試現實且全面的MCP的基準

摘要

MCP（模型控制协议）标准化了大型语言模型（LLMs）与外部系统的交互方式，为通用智能体奠定了基础。然而，现有的MCP基准测试范围仍然狭窄：它们主要关注读取密集型任务或交互深度有限的任务，未能捕捉到现实世界工作流程的复杂性和真实性。为了弥补这一不足，我们提出了MCPMark，这是一个旨在以更现实和全面的方式评估MCP应用的基准测试。它由领域专家和AI代理共同创建的127个高质量任务组成。每个任务从一个精心策划的初始状态开始，并包含一个用于自动验证的程序化脚本。这些任务要求与环境进行更丰富和多样化的交互，涉及广泛的创建、读取、更新和删除（CRUD）操作。我们使用一个在工具调用循环中运行的最小化代理框架，对前沿的LLMs进行了全面评估。实证结果表明，表现最佳的模型gpt-5-medium仅达到52.56%的pass@1和33.86%的pass^4，而其他广受认可的强模型，包括claude-sonnet-4和o3，pass@1低于30%，pass^4低于15%。平均而言，LLMs每个任务需要16.2次执行轮次和17.4次工具调用，显著超过了之前的MCP基准测试，凸显了MCPMark的压力测试性质。

English

MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

MCPMark：一個用於壓力測試現實且全面的MCP的基準

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

摘要

Support