ChatPaper.aiChatPaper

大型语言模型代理的失败之处及其从失败中学习的途径

Where LLM Agents Fail and How They can Learn From Failures

September 29, 2025
作者: Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, Jiaxuan You
cs.AI

摘要

大型语言模型(LLM)代理,通过整合规划、记忆、反思及工具使用模块,在解决复杂多步骤任务中展现出潜力。然而,其复杂的架构也放大了级联故障的脆弱性,即单一根本原因错误会蔓延至后续决策,导致任务失败。现有系统缺乏一个能够以模块化和系统化方式全面理解代理错误的框架,因而无法相应检测这些错误。针对这一空白,我们做出了三项贡献。首先,我们提出了AgentErrorTaxonomy,一种涵盖记忆、反思、规划、行动及系统级操作的模块化故障模式分类体系。其次,我们构建了AgentErrorBench,这是首个基于ALFWorld、GAIA和WebShop中系统标注的失败轨迹数据集,为错误分析提供了现实世界代理运行的实证基础。第三,我们提出了AgentDebug,一个调试框架,能够隔离根本原因故障并提供纠正反馈,使代理能够恢复并迭代改进。在AgentErrorBench上的实验表明,与最强基线相比,AgentDebug在全正确准确率上提升了24%,在步骤准确率上提升了17%。除了检测功能外,AgentDebug生成的针对性反馈使LLM代理能够从失败中迭代恢复,在ALFWorld、GAIA和WebShop上的任务成功率相对提升了最高达26%。这些成果确立了基于原则的调试作为通向更可靠、适应性更强的LLM代理的途径。代码与数据将发布于https://github.com/ulab-uiuc/AgentDebug。
English
Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug
PDF71October 1, 2025