MCP-AgentBench : Évaluation des performances des agents linguistiques en contexte réel avec des outils médiés par MCP

papers.abstract

Le Protocole de Contexte Modèle (MCP) émerge rapidement comme une norme ouverte essentielle, conçue pour améliorer l'intégration et l'interopérabilité entre agents et outils, et se positionne pour inaugurer une nouvelle ère d'IA agentique puissante, interconnectée et véritablement utilitaire. Cependant, malgré l'adoption croissante du MCP, les benchmarks existants échouent souvent à capturer les performances réelles des agents dans ce nouveau paradigme, conduisant à une perception déformée de leur véritable valeur opérationnelle et à une incapacité à différencier de manière fiable leurs compétences. Pour combler cette lacune critique en matière d'évaluation, nous introduisons MCP-AgentBench — un benchmark complet spécifiquement conçu pour évaluer rigoureusement les capacités des agents linguistiques dans les interactions d'outils médiées par le MCP. Les contributions principales de MCP-AgentBench incluent : la mise en place d'un banc d'essai MCP robuste comprenant 33 serveurs opérationnels avec 188 outils distincts ; le développement d'un benchmark comportant 600 requêtes systématiquement conçues réparties dans 6 catégories distinctes de complexité d'interaction variable ; et l'introduction de MCP-Eval, une nouvelle méthodologie d'évaluation axée sur les résultats, privilégiant la réussite des tâches dans le monde réel. Grâce à une évaluation empirique approfondie des principaux agents linguistiques, nous fournissons des insights fondamentaux. MCP-AgentBench vise à doter la communauté de recherche d'un cadre standardisé et fiable pour construire, valider et faire progresser des agents capables de tirer pleinement parti des avantages transformateurs du MCP, accélérant ainsi les progrès vers des systèmes d'IA véritablement compétents et interopérables.

English

The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.

MCP-AgentBench : Évaluation des performances des agents linguistiques en contexte réel avec des outils médiés par MCP

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

papers.abstract

Support