ツールが機能しなくなった場合：LLMエージェントにおける動的再計画と異常回復のベンチマーク評価

要旨

既存のベンチマークは、LLMにおけるツール統合推論（TIR）を理想的な「ハッピーパス」で評価しており、現実世界のツール障害をほとんど無視している。我々は、TIRエージェントにおける動的な経路発見とエラー復旧のためのベンチマークであるToolMazeを紹介する。系統的な再計画と盲目的な試行錯誤を区別するために、ToolMazeは2次元の設計を採用している。すなわち、DAGベースのトポロジカル複雑性と、ツール摂動の2×2分類（明示的/暗黙的、一時的/永続的）である。評価の結果、摂動はほぼすべてのモデルの性能を低下させ、暗黙的な意味的障害において最も顕著な低下が見られた。破損した出力に対する系統的な過信に起因して、これらのシナリオでは摂動回復率（PRR）が約37％急落し、複雑なトポロジはエージェントを無駄な試行錯誤のループに陥らせる。重要なことに、エージェントのフォールトトレランスはモデル規模に応じて向上するが、その速度は基本的なタスク実行よりも3.66倍遅く、動的な再計画がモデルスケーリングやプロンプティングでは対処されていない明確なボトルネックであることが浮き彫りになる。データとコードはhttps://github.com/Zhudongsheng75/ToolMazeで入手可能である。

English

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.