LiveMCP-101：對支援MCP的代理進行壓力測試與疑難排解，應對挑戰性查詢

摘要

工具调用已成為AI代理與現實世界互動並解決複雜任務的關鍵能力。雖然模型上下文協議（MCP）為工具整合提供了一個強大的標準化框架，但在評估AI代理如何在現實、動態場景中有效利用多種MCP工具解決多步驟任務方面，仍存在顯著差距。在本研究中，我們提出了LiveMCP-101，這是一個包含101個精心策劃的真實世界查詢的基準，這些查詢通過迭代的LLM重寫和人工審查進行了精煉，需要協調使用多種MCP工具，包括網絡搜索、文件操作、數學推理和數據分析。此外，我們引入了一種新穎的評估方法，該方法利用真實執行計劃而非原始API輸出，更好地反映了現實環境的動態特性。實驗表明，即使是前沿的LLM，其成功率也低於60%，凸顯了工具協調方面的重大挑戰。詳細的消融實驗和錯誤分析進一步揭示了不同的失敗模式和令牌使用的低效性，為改進當前模型指明了具體方向。LiveMCP-101為評估真實世界代理能力設立了嚴格的標準，推動了通過工具使用可靠執行複雜任務的自主AI系統的發展。

English

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

LiveMCP-101：對支援MCP的代理進行壓力測試與疑難排解，應對挑戰性查詢

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

摘要

Support