MCP-Universe:基於真實世界模型上下文協議服務器的大型語言模型基準測試
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
August 20, 2025
作者: Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li
cs.AI
摘要
模型上下文協議(Model Context Protocol)已成為連接大型語言模型與外部數據源及工具的變革性標準,迅速在各大AI供應商和開發平台中獲得廣泛採用。然而,現有的基準測試過於簡化,未能捕捉到實際應用中的挑戰,如長時程推理和龐大且陌生的工具空間。為填補這一關鍵缺口,我們推出了MCP-Universe,這是首個專門設計的綜合基準測試,旨在通過與真實世界的MCP服務器交互來評估大型語言模型在現實且困難任務中的表現。我們的基準測試涵蓋了6個核心領域,涉及11個不同的MCP服務器:位置導航、倉庫管理、財務分析、3D設計、瀏覽器自動化和網絡搜索。為確保嚴謹的評估,我們實施了基於執行的評估器,包括用於代理格式合規性的格式評估器、用於時間不變內容匹配的靜態評估器,以及自動檢索實時真實數據以應對時間敏感任務的動態評估器。通過對領先的大型語言模型進行廣泛評估,我們發現即使是如GPT-5(43.72%)、Grok-4(33.33%)和Claude-4.0-Sonnet(29.44%)這樣的頂尖模型,也表現出顯著的性能限制。此外,我們的基準測試對大型語言模型代理提出了重大的長上下文挑戰,因為輸入的令牌數量隨著交互步驟的增加而迅速增長。同時,它還引入了未知工具的挑戰,因為大型語言模型代理通常對MCP服務器的精確使用缺乏熟悉度。值得注意的是,像Cursor這樣的企業級代理也無法超越標準的ReAct框架表現。除了評估之外,我們還開源了帶有UI支持的可擴展評估框架,使研究人員和實踐者能夠無縫集成新的代理和MCP服務器,同時促進快速發展的MCP生態系統中的創新。
English
The Model Context Protocol has emerged as a transformative standard for
connecting large language models to external data sources and tools, rapidly
gaining adoption across major AI providers and development platforms. However,
existing benchmarks are overly simplistic and fail to capture real application
challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To
address this critical gap, we introduce MCP-Universe, the first comprehensive
benchmark specifically designed to evaluate LLMs in realistic and hard tasks
through interaction with real-world MCP servers. Our benchmark encompasses 6
core domains spanning 11 different MCP servers: Location Navigation, Repository
Management, Financial Analysis, 3D Design, Browser Automation, and Web
Searching. To ensure rigorous evaluation, we implement execution-based
evaluators, including format evaluators for agent format compliance, static
evaluators for time-invariant content matching, and dynamic evaluators that
automatically retrieve real-time ground truth for temporally sensitive tasks.
Through extensive evaluation of leading LLMs, we find that even SOTA models
such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit
significant performance limitations. In addition, our benchmark poses a
significant long-context challenge for LLM agents, as the number of input
tokens increases rapidly with the number of interaction steps. Moreover, it
introduces an unknown-tools challenge, as LLM agents often lack familiarity
with the precise usage of the MCP servers. Notably, enterprise-level agents
like Cursor cannot achieve better performance than standard ReAct frameworks.
Beyond evaluation, we open-source our extensible evaluation framework with UI
support, enabling researchers and practitioners to seamlessly integrate new
agents and MCP servers while fostering innovation in the rapidly evolving MCP
ecosystem.