大規模言語モデル（LLM）エージェントが失敗する場面と、その失敗から学ぶ方法

要旨

大規模言語モデル（LLM）エージェントは、計画、記憶、反省、ツール使用モジュールを統合し、複雑な多段階タスクの解決において有望な成果を示しています。しかし、その洗練されたアーキテクチャは、単一の根本原因エラーがその後の意思決定に伝播し、タスク失敗に至るカスケード障害に対する脆弱性を増幅させます。現在のシステムには、エージェントのエラーをモジュール的かつ体系的に包括的に理解し、それに応じてこれらのエラーを検出するためのフレームワークが欠如しています。このギャップを埋めるために、我々は3つの貢献を行います。第一に、記憶、反省、計画、行動、システムレベルの操作にまたがる障害モードのモジュール分類であるAgentErrorTaxonomyを導入します。第二に、ALFWorld、GAIA、WebShopからの体系的に注釈付けされた失敗軌跡の最初のデータセットであるAgentErrorBenchを構築し、現実世界のエージェント展開に基づいたエラー分析を提供します。第三に、根本原因の失敗を特定し、修正フィードバックを提供するデバッグフレームワークAgentDebugを提案し、エージェントが回復し反復的に改善することを可能にします。AgentErrorBenchでの実験では、AgentDebugが最も強力なベースラインと比較して、全正解精度で24%、ステップ精度で17%高い結果を示しました。検出を超えて、AgentDebugが生成するターゲットフィードバックは、LLMエージェントが失敗から反復的に回復することを可能にし、ALFWorld、GAIA、WebShop全体でタスク成功率が最大26%向上しました。これらの結果は、原則に基づいたデバッグが、より信頼性が高く適応性のあるLLMエージェントへの道筋を確立することを示しています。コードとデータはhttps://github.com/ulab-uiuc/AgentDebugで公開されます。

English

Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug

大規模言語モデル（LLM）エージェントが失敗する場面と、その失敗から学ぶ方法

Where LLM Agents Fail and How They can Learn From Failures

要旨

Support