MCP-AgentBench：利用MCP中介工具评估现实世界语言代理性能

摘要

模型上下文协议（MCP）正迅速崛起为一项关键的开放标准，旨在增强智能体与工具的集成及互操作性，并有望开启一个强大、互联且真正实用的智能体AI新时代。然而，尽管MCP的采用日益广泛，现有基准测试往往未能捕捉到这一新范式下智能体的真实世界表现，导致对其实际操作价值的误解，以及无法可靠区分其能力水平。为填补这一关键评估空白，我们推出了MCP-AgentBench——一个专门设计用于严格评估在MCP介导的工具交互中语言智能体能力的综合基准。MCP-AgentBench的核心贡献包括：构建了一个包含33个运行服务器和188种不同工具的稳健MCP测试平台；开发了一个包含600个系统设计查询的基准，这些查询分布在6个不同复杂度的交互类别中；以及引入了MCP-Eval，一种新颖的以结果为导向的评估方法，优先考虑现实世界任务的成功率。通过对领先语言智能体的广泛实证评估，我们提供了基础性见解。MCP-AgentBench旨在为研究社区提供一个标准化且可靠的框架，以构建、验证并推进能够充分利用MCP变革性优势的智能体，从而加速实现真正具备能力和互操作性AI系统的进程。

English

The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.

MCP-AgentBench：利用MCP中介工具评估现实世界语言代理性能

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

摘要

Support