何种代理导致任务失败？何时发生？——论大语言模型多代理系统的自动化故障归因

摘要

大语言模型多智能体系统中的故障归因——即识别任务失败的责任智能体及关键步骤——为系统调试提供了重要线索，但这一领域仍处于探索不足且劳动密集的状态。本文提出并定义了一个新的研究方向：大语言模型多智能体系统的自动化故障归因。为支持这一研究，我们引入了Who&When数据集，该数据集包含来自127个大语言模型多智能体系统的广泛故障日志，并配有细粒度标注，将故障与特定智能体及决定性错误步骤相关联。基于Who&When，我们开发并评估了三种自动化故障归因方法，总结了各自的优缺点。最佳方法在识别责任智能体方面达到了53.5%的准确率，但在定位故障步骤时仅达到14.2%，部分方法表现甚至低于随机水平。即便是如OpenAI o1和DeepSeek R1这样的先进推理模型，也未能实现实际可用性。这些结果凸显了该任务的复杂性，以及在这一领域进一步研究的必要性。代码和数据集已发布于https://github.com/mingyin1/Agents_Failure_Attribution。

English

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

何种代理导致任务失败？何时发生？——论大语言模型多代理系统的自动化故障归因

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

摘要

Support