當工具失效時：大語言模型代理中動態重規劃與異常恢復的基準測試

摘要

現有的基準測試在評估大型語言模型的工具集成推理（TIR）時，大多聚焦於理想化的「順暢路徑」，而忽略了實際世界中工具可能發生的故障。我們提出 ToolMaze，一個專門用於評估 TIR 智能體在動態路徑發現與錯誤恢復能力的基準測試。為了將系統性重新規劃與盲目試誤區分開來，ToolMaze 採用二維設計：基於有向無環圖（DAG）的拓撲複雜度，以及一個 2×2 的工具擾動分類（顯式/隱式、暫時性/永久性）。評估結果顯示，擾動幾乎對所有模型都造成性能下降，其中隱式語義故障導致的最嚴重下降尤為明顯。由於模型對受損輸出存在系統性的過度信任，在這些情境下擾動恢復率（PRR）驟降約 37%，而複雜拓撲結構則使智能體陷入無效的試誤循環。關鍵在於，智能體的容錯能力隨模型規模提升的速度，比基本任務執行慢了約 3.66 倍，這凸顯了動態重新規劃是一個獨立的瓶頸，無法單純透過模型擴展或提示工程來解決。資料與程式碼已公開於 https://github.com/Zhudongsheng75/ToolMaze。

English

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.