MCP-AgentBench:利用MCP介导工具评估现实世界语言代理性能
MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools
September 10, 2025
作者: Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, Zhendong Mao
cs.AI
摘要
模型上下文协议(MCP)正迅速崛起为一项关键的开放标准,旨在增强代理与工具的集成及互操作性,并有望开启一个强大、互联且真正实用的代理式人工智能新时代。然而,尽管MCP的采用日益广泛,现有基准测试往往未能捕捉到这一新范式下代理在现实世界中的表现,导致对其真实操作价值的认知偏差,以及无法可靠地区分其能力水平。为弥合这一关键评估缺口,我们推出了MCP-AgentBench——一个专门设计的综合性基准测试,用于严格评估语言代理在MCP介导的工具交互中的能力。MCP-AgentBench的核心贡献包括:构建了一个包含33个运行服务器和188种独特工具的稳健MCP测试平台;开发了一个包含600个系统设计查询的基准测试,这些查询分布在6个不同类别中,涵盖多种交互复杂度;以及引入了MCP-Eval,一种新颖的以结果为导向的评估方法,优先考虑现实世界任务的成功。通过对领先语言代理的广泛实证评估,我们提供了基础性见解。MCP-AgentBench旨在为研究界提供一个标准化且可靠的框架,以构建、验证并推进能够充分利用MCP变革性优势的代理,从而加速真正具备能力和互操作性的人工智能系统的进步。
English
The Model Context Protocol (MCP) is rapidly emerging as a pivotal open
standard, designed to enhance agent-tool integration and interoperability, and
is positioned to unlock a new era of powerful, interconnected, and genuinely
utilitarian agentic AI. However, despite MCP's growing adoption, existing
benchmarks often fail to capture real-world agent performance within this new
paradigm, leading to a distorted perception of their true operational value and
an inability to reliably differentiate proficiencies. To bridge this critical
evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark
specifically engineered to rigorously assess language agent capabilities in
MCP-mediated tool interactions. Core contributions of MCP-AgentBench include:
the establishment of a robust MCP testbed comprising 33 operational servers
with 188 distinct tools; the development of a benchmark featuring 600
systematically designed queries distributed across 6 distinct categories of
varying interaction complexity; and the introduction of MCP-Eval, a novel
outcome-oriented evaluation methodology prioritizing real-world task success.
Through extensive empirical evaluation of leading language agents, we provide
foundational insights. MCP-AgentBench aims to equip the research community with
a standardized and reliable framework to build, validate, and advance agents
capable of fully leveraging MCP's transformative benefits, thereby accelerating
progress toward truly capable and interoperable AI systems.