ChatPaper.aiChatPaper

當工具失效時:大語言模型代理中動態重規劃與異常恢復的基準測試

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

June 4, 2026
作者: Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
cs.AI

摘要

現有的基準測試在評估大型語言模型的工具集成推理(TIR)時,大多聚焦於理想化的「順暢路徑」,而忽略了實際世界中工具可能發生的故障。我們提出 ToolMaze,一個專門用於評估 TIR 智能體在動態路徑發現與錯誤恢復能力的基準測試。為了將系統性重新規劃與盲目試誤區分開來,ToolMaze 採用二維設計:基於有向無環圖(DAG)的拓撲複雜度,以及一個 2×2 的工具擾動分類(顯式/隱式、暫時性/永久性)。評估結果顯示,擾動幾乎對所有模型都造成性能下降,其中隱式語義故障導致的最嚴重下降尤為明顯。由於模型對受損輸出存在系統性的過度信任,在這些情境下擾動恢復率(PRR)驟降約 37%,而複雜拓撲結構則使智能體陷入無效的試誤循環。關鍵在於,智能體的容錯能力隨模型規模提升的速度,比基本任務執行慢了約 3.66 倍,這凸顯了動態重新規劃是一個獨立的瓶頸,無法單純透過模型擴展或提示工程來解決。資料與程式碼已公開於 https://github.com/Zhudongsheng75/ToolMaze。
English
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.