MCPMark: 現実的かつ包括的なMCPのストレステストのためのベンチマーク

要旨

MCPは、大規模言語モデル（LLM）が外部システムと相互作用する方法を標準化し、汎用エージェントの基盤を形成します。しかし、既存のMCPベンチマークは範囲が狭く、読み取り中心のタスクや相互作用の深さが限られたタスクに焦点を当てており、現実世界のワークフローの複雑さと現実性を捉えられていません。このギャップを埋めるため、我々はMCPMarkを提案します。これは、MCPの使用をより現実的かつ包括的に評価するために設計されたベンチマークです。MCPMarkは、ドメインエキスパートとAIエージェントが共同で作成した127の高品質なタスクで構成されています。各タスクは、精選された初期状態から始まり、自動検証のためのプログラムスクリプトを含んでいます。これらのタスクは、環境とのより豊かで多様な相互作用を要求し、幅広い作成、読み取り、更新、削除（CRUD）操作を伴います。我々は、ツール呼び出しループで動作する最小限のエージェントフレームワークを使用して、最先端のLLMを包括的に評価しました。実験結果によると、最高性能のモデルであるgpt-5-mediumは、pass@1で52.56%、pass^4で33.86%に達するのみで、claude-sonnet-4やo3など、広く強力とされる他のモデルは、pass@1で30%未満、pass^4で15%未満に留まります。平均して、LLMはタスクごとに16.2回の実行ターンと17.4回のツール呼び出しを必要とし、以前のMCPベンチマークを大幅に上回り、MCPMarkのストレステスト的な性質を浮き彫りにしています。

English

MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

MCPMark: 現実的かつ包括的なMCPのストレステストのためのベンチマーク

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

要旨

Support