어떤 에이전트가 작업 실패를 유발하며 언제 발생하는가? LLM 다중 에이전트 시스템의 자동화된 실패 귀속에 관하여

초록

LLM 다중 에이전트 시스템에서의 실패 귀인(실패를 초래한 에이전트와 단계 식별)은 시스템 디버깅에 중요한 단서를 제공하지만, 아직까지 충분히 연구되지 않았으며 수작업이 많이 필요한 분야입니다. 본 논문에서는 LLM 다중 에이전트 시스템을 위한 자동화된 실패 귀인이라는 새로운 연구 영역을 제안하고 정형화합니다. 이를 지원하기 위해, 127개의 LLM 다중 에이전트 시스템에서 수집된 방대한 실패 로그와 특정 에이전트 및 결정적인 오류 단계를 연결한 세밀한 주석을 포함한 Who&When 데이터셋을 소개합니다. Who&When을 활용하여 세 가지 자동화된 실패 귀인 방법을 개발하고 평가하며, 각 방법의 장단점을 요약합니다. 최고 성능을 보인 방법은 실패를 초래한 에이전트를 식별하는 데 53.5%의 정확도를 달성했지만, 실패 단계를 정확히 찾아내는 데는 14.2%에 그쳤으며, 일부 방법은 무작위 추론보다 낮은 성능을 보였습니다. OpenAI o1 및 DeepSeek R1과 같은 최첨단 추론 모델조차도 실용적인 수준의 성능을 달성하지 못했습니다. 이러한 결과는 이 작업의 복잡성과 해당 분야의 추가 연구 필요성을 강조합니다. 코드와 데이터셋은 https://github.com/mingyin1/Agents_Failure_Attribution에서 확인할 수 있습니다.

English

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

어떤 에이전트가 작업 실패를 유발하며 언제 발생하는가? LLM 다중 에이전트 시스템의 자동화된 실패 귀속에 관하여

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

초록

Support