LiveMCP-101：在复杂查询场景下对MCP赋能智能体的压力测试与诊断

摘要

工具调用已成为AI代理与现实世界交互并解决复杂任务的关键能力。尽管模型上下文协议（MCP）为工具集成提供了强大的标准化框架，但在评估AI代理如何在真实、动态场景中有效利用多样化MCP工具解决多步骤任务方面，仍存在显著差距。在本研究中，我们提出了LiveMCP-101，这是一个包含101个精心挑选的现实世界查询的基准，这些查询通过迭代的LLM重写和人工审查得到优化，要求协调使用包括网络搜索、文件操作、数学推理和数据分析在内的多种MCP工具。此外，我们引入了一种新颖的评估方法，该方法利用真实执行计划而非原始API输出，更好地反映了现实环境的动态特性。实验表明，即使是前沿的LLM，其成功率也低于60%，突显了工具编排中的重大挑战。详细的消融实验和错误分析进一步揭示了不同的失败模式和令牌使用效率低下，为当前模型的改进指明了具体方向。LiveMCP-101为评估现实世界代理能力设立了严格标准，推动着通过工具使用可靠执行复杂任务的自主AI系统向前发展。

English

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

LiveMCP-101：在复杂查询场景下对MCP赋能智能体的压力测试与诊断

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

摘要

Support