当工具失效时:LLM智能体中动态重规划与异常恢复的基准测试
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
June 4, 2026
作者: Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
cs.AI
摘要
现有基准测试在评估大语言模型的工具集成推理能力时,均基于理想化的“顺境路径”,严重忽视了现实中的工具故障。为此,我们提出ToolMaze——一个面向工具集成推理智能体动态路径发现与错误恢复的基准测试。为区分系统性重规划与盲目试错,ToolMaze采用二维设计:基于有向无环图的拓扑复杂度,以及工具扰动(显式/隐式、瞬时/永久)的2×2分类体系。评估结果表明,所有模型在面临扰动时性能均有所下降,其中隐式语义故障场景下的降幅最为显著。受系统性过度信任受损输出的驱动,此类场景下的扰动恢复率骤降约37%,而复杂拓扑结构则使智能体陷入无效试错循环。关键的是,智能体的容错能力随模型规模提升的速度比基本任务执行慢3.66倍,凸显动态重规划是模型扩展或提示工程无法解决的特殊瓶颈。数据和代码已开源至https://github.com/Zhudongsheng75/ToolMaze。
English
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.