MCP-Universe: 실제 모델 컨텍스트 프로토콜 서버를 활용한 대규모 언어 모델 벤치마킹

초록

모델 컨텍스트 프로토콜(MCP)은 대형 언어 모델(LLM)을 외부 데이터 소스 및 도구와 연결하기 위한 혁신적인 표준으로 부상하며, 주요 AI 제공업체 및 개발 플랫폼 전반에 걸쳐 빠르게 채택되고 있습니다. 그러나 기존 벤치마크는 지나치게 단순하며 장기적 추론이나 크고 익숙하지 않은 도구 공간과 같은 실제 애플리케이션의 도전 과제를 제대로 반영하지 못하고 있습니다. 이러한 중요한 격차를 해결하기 위해, 우리는 실제 MCP 서버와의 상호작용을 통해 현실적이고 어려운 작업에서 LLM을 평가하도록 특별히 설계된 첫 번째 포괄적인 벤치마크인 MCP-Universe를 소개합니다. 우리의 벤치마크는 위치 탐색, 저장소 관리, 재무 분석, 3D 설계, 브라우저 자동화, 웹 검색 등 11개의 서로 다른 MCP 서버를 아우르는 6개의 핵심 도메인을 포함합니다. 엄격한 평가를 보장하기 위해, 우리는 에이전트 형식 준수를 위한 형식 평가자, 시간에 불변하는 콘텐츠 매칭을 위한 정적 평가자, 그리고 시간에 민감한 작업을 위해 실시간 기준 데이터를 자동으로 검색하는 동적 평가자를 포함한 실행 기반 평가자를 구현했습니다. 주요 LLM에 대한 광범위한 평가를 통해 GPT-5(43.72%), Grok-4(33.33%), Claude-4.0-Sonnet(29.44%)와 같은 최첨단 모델조차도 상당한 성능 한계를 보이는 것을 확인했습니다. 또한, 우리의 벤치마크는 상호작용 단계 수가 증가함에 따라 입력 토큰 수가 급격히 증가함으로써 LLM 에이전트에게 상당한 장기 컨텍스트 도전 과제를 제시합니다. 더욱이, LLM 에이전트가 MCP 서버의 정확한 사용법에 익숙하지 않은 경우가 많아 알려지지 않은 도구 도전 과제를 도입합니다. 특히, Cursor와 같은 기업 수준의 에이전트도 표준 ReAct 프레임워크보다 더 나은 성능을 달성할 수 없습니다. 평가를 넘어, 우리는 UI 지원이 포함된 확장 가능한 평가 프레임워크를 오픈소스로 공개하여 연구자와 실무자가 새로운 에이전트와 MCP 서버를 원활하게 통합할 수 있도록 하고, 빠르게 진화하는 MCP 생태계 내에서 혁신을 촉진합니다.

English

The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

MCP-Universe: 실제 모델 컨텍스트 프로토콜 서버를 활용한 대규모 언어 모델 벤치마킹

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

초록

Support