LiveMCPBench: 에이전트가 MCP 도구의 바다를 항해할 수 있을까?

초록

Model Context Protocol(MCP)의 급속한 발전과 함께 MCP 서버의 수가 10,000개를 넘어섰습니다. 그러나 기존의 MCP 벤치마크는 단일 서버 설정과 소수의 도구로 제한되어 있어, 대규모 실제 시나리오에서 에이전트 능력을 효과적으로 평가하는 데 방해가 되고 있습니다. 이러한 한계를 해결하기 위해, 우리는 MCP 생태계에 기반을 둔 95개의 실제 작업으로 구성된 첫 번째 종합 벤치마크인 LiveMCPBench를 제안합니다. 이 벤치마크는 다양한 서버에서 대규모로 LLM 에이전트를 평가하도록 설계되었습니다. 대규모 MCP 환경에서 확장 가능하고 재현 가능한 평가 파이프라인을 지원하기 위해, 우리는 70개의 MCP 서버와 527개의 도구로 구성된 다양하고 즉시 배포 가능한 LiveMCPTool 컬렉션을 구축했습니다. 또한, 우리는 동적이고 시간에 따라 변화하는 작업 환경에서 자동화되고 적응적인 평가를 가능하게 하는 LLM-as-a-Judge 프레임워크인 LiveMCPEval을 도입했습니다. 이 프레임워크는 인간 평가자와 81%의 일치율을 달성했습니다. 마지막으로, 우리는 전체 LiveMCPTool 제품군에서 동적 계획을 위한 도구를 라우팅하고 API 상호 작용을 위한 도구를 실행하는 다단계 에이전트인 MCP Copilot Agent를 제안합니다. 우리의 평가는 10개의 주요 모델을 대상으로 진행되었으며, 가장 성능이 우수한 모델(Claude-Sonnet-4)은 78.95%의 성공률을 기록했습니다. 그러나 모델 간의 성능 차이가 크며, 여러 널리 사용되는 모델이 LiveMCPBench의 복잡하고 도구가 풍부한 환경에서 낮은 성능을 보였습니다. 전반적으로, LiveMCPBench는 현실적이고 도구가 풍부하며 동적인 MCP 환경에서 LLM 에이전트를 벤치마킹하기 위한 첫 번째 통합 프레임워크를 제공하며, 에이전트 능력에 대한 확장 가능하고 재현 가능한 연구를 위한 견고한 기반을 마련합니다. 우리의 코드와 데이터는 https://icip-cas.github.io/LiveMCPBench에서 공개될 예정입니다.

English

With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench's complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.