当工具失效时：LLM智能体中动态重规划与异常恢复的基准测试

摘要

现有基准测试在评估大语言模型的工具集成推理能力时，均基于理想化的“顺境路径”，严重忽视了现实中的工具故障。为此，我们提出ToolMaze——一个面向工具集成推理智能体动态路径发现与错误恢复的基准测试。为区分系统性重规划与盲目试错，ToolMaze采用二维设计：基于有向无环图的拓扑复杂度，以及工具扰动（显式/隐式、瞬时/永久）的2×2分类体系。评估结果表明，所有模型在面临扰动时性能均有所下降，其中隐式语义故障场景下的降幅最为显著。受系统性过度信任受损输出的驱动，此类场景下的扰动恢复率骤降约37%，而复杂拓扑结构则使智能体陷入无效试错循环。关键的是，智能体的容错能力随模型规模提升的速度比基本任务执行慢3.66倍，凸显动态重规划是模型扩展或提示工程无法解决的特殊瓶颈。数据和代码已开源至https://github.com/Zhudongsheng75/ToolMaze。

English

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.