MCPEval: AI 에이전트 모델을 위한 MCP 기반 자동 심층 평가

초록

대규모 언어 모델(LLM) 기반 지능형 에이전트의 급속한 부상은 견고하고 확장 가능한 평가 프레임워크의 필요성을 강조한다. 기존 방법은 정적 벤치마크와 노동 집약적인 데이터 수집에 의존하여 실질적인 평가를 제한한다. 우리는 \oursystemname을 소개한다. 이는 오픈소스 모델 컨텍스트 프로토콜(MCP) 기반 프레임워크로, 다양한 도메인에서 LLM 에이전트의 종단 간 작업 생성과 심층 평가를 자동화한다. MCPEval은 메트릭을 표준화하고, 네이티브 에이전트 도구와 원활하게 통합하며, 평가 파이프라인 구축에서 수작업을 제거한다. 다섯 가지 실제 도메인에서의 실험 결과는 MCPEval이 세밀하고 도메인 특화된 성능을 드러내는 데 효과적임을 보여준다. 우리는 재현 가능하고 표준화된 LLM 에이전트 평가를 촉진하기 위해 MCPEval을 공개한다(https://github.com/SalesforceAIResearch/MCPEval).

English

The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

MCPEval: AI 에이전트 모델을 위한 MCP 기반 자동 심층 평가

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

초록

Support