LLM 에이전트가 실패하는 지점과 실패로부터 학습하는 방법

초록

대형 언어 모델(LLM) 에이전트는 계획, 메모리, 반성, 도구 사용 모듈을 통합하여 복잡한 다단계 작업을 해결하는 데 유망한 가능성을 보여주고 있습니다. 그러나 이러한 정교한 아키텍처는 연쇄적 실패에 대한 취약성을 증폭시켜, 단일 근본 원인 오류가 후속 결정으로 전파되어 작업 실패로 이어지는 경우가 많습니다. 현재 시스템은 모듈적이고 체계적인 방식으로 에이전트 오류를 포괄적으로 이해할 수 있는 프레임워크가 부족하여 이러한 오류를 적절히 감지하지 못하고 있습니다. 우리는 이 문제를 해결하기 위해 세 가지 기여를 제안합니다. 첫째, 메모리, 반성, 계획, 행동 및 시스템 수준 운영에 걸친 실패 모드를 모듈적으로 분류한 AgentErrorTaxonomy를 소개합니다. 둘째, ALFWorld, GAIA, WebShop에서 체계적으로 주석이 달린 실패 궤적 데이터셋인 AgentErrorBench를 구축하여 실제 에이전트 실행에서 오류 분석을 근거로 합니다. 셋째, 근본 원인 실패를 격리하고 수정 피드백을 제공하여 에이전트가 회복하고 반복적으로 개선할 수 있도록 하는 디버깅 프레임워크인 AgentDebug를 제안합니다. AgentErrorBench에서의 실험 결과, AgentDebug는 가장 강력한 베이스라인 대비 전체 정확도에서 24%, 단계 정확도에서 17% 더 높은 성능을 달성했습니다. 또한, AgentDebug가 생성한 타겟팅된 피드백은 LLM 에이전트가 실패에서 반복적으로 회복할 수 있도록 하여 ALFWorld, GAIA, WebShop에서 작업 성공률이 최대 26% 상대적으로 향상되었습니다. 이러한 결과는 원칙적인 디버깅이 더 신뢰할 수 있고 적응적인 LLM 에이전트로 가는 길임을 입증합니다. 코드와 데이터는 https://github.com/ulab-uiuc/AgentDebug에서 확인할 수 있습니다.

English

Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug

LLM 에이전트가 실패하는 지점과 실패로부터 학습하는 방법

Where LLM Agents Fail and How They can Learn From Failures

초록

Support