LiveMCP-101: MCP対応エージェントのストレステストと困難なクエリに対する診断

要旨

ツール呼び出しは、AIエージェントが現実世界と相互作用し、複雑なタスクを解決するための重要な能力として注目を集めています。モデルコンテキストプロトコル（MCP）は、ツール統合のための強力な標準化フレームワークを提供しますが、現実的で動的なシナリオにおいて、多様なMCPツールを活用してAIエージェントが多段階タスクを効果的に解決できるかをベンチマークする点で大きなギャップが存在します。本研究では、LiveMCP-101を紹介します。これは、反復的なLLMによる書き直しと手動レビューを通じて精選された101の現実世界のクエリからなるベンチマークで、ウェブ検索、ファイル操作、数学的推論、データ分析など、複数のMCPツールを協調的に使用する必要があります。さらに、生のAPI出力ではなく、真の実行計画を活用する新しい評価手法を導入し、現実世界の環境の変化する性質をよりよく反映します。実験結果からは、最先端のLLMでさえ成功率が60%未満であり、ツールオーケストレーションにおける大きな課題が浮き彫りになりました。詳細なアブレーション研究とエラー分析により、トークン使用における異なる失敗モードと非効率性が明らかになり、現在のモデルを進化させるための具体的な方向性が示されました。LiveMCP-101は、ツール使用を通じて複雑なタスクを確実に実行する自律AIシステムに向けた、現実世界のエージェント能力を評価するための厳格な基準を設定します。

English

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

LiveMCP-101: MCP対応エージェントのストレステストと困難なクエリに対する診断

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

要旨

Support