MCPEval：基於MCP的自動化深度評估用於AI代理模型

摘要

大型語言模型（LLMs）驅動的智能代理迅速崛起，凸顯了建立堅固且可擴展評估框架的迫切需求。現有方法依賴於靜態基準測試及耗時費力的數據收集，限制了實際評估的可行性。我們推出\oursystemname，這是一個基於模型上下文協議（MCP）的開源框架，它自動化地實現了跨多領域LLM代理的端到端任務生成與深度評估。MCPEval標準化了評估指標，無縫整合了原生代理工具，並消除了構建評估管道的手動操作。在五個現實領域的實證結果顯示，其在揭示細膩、領域特定性能方面的有效性。我們公開釋出MCPEval（https://github.com/SalesforceAIResearch/MCPEval），以促進可重現且標準化的LLM代理評估。

English

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

MCPEval：基於MCP的自動化深度評估用於AI代理模型

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

摘要

Support