MCPEval：基于MCP的AI代理模型自动深度评估系统

摘要

大型语言模型（LLM）智能代理的迅速崛起，凸显了对稳健、可扩展评估框架的迫切需求。现有方法依赖于静态基准测试和劳动密集型的数据收集，限制了实际评估的可行性。我们推出了\oursystemname，一个基于模型上下文协议（MCP）的开源框架，它能够自动化生成端到端任务，并对跨多个领域的LLM代理进行深度评估。MCPEval标准化了评估指标，无缝集成原生代理工具，并消除了构建评估管道所需的手动操作。在五个现实世界领域的实证结果表明，其在揭示细致入微、领域特定性能方面的有效性。我们公开发布了MCPEval（https://github.com/SalesforceAIResearch/MCPEval），以促进可复现和标准化的LLM代理评估。

English

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

MCPEval：基于MCP的AI代理模型自动深度评估系统

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

摘要

Support