哪個代理導致任務失敗，以及何時發生？論大型語言模型多代理系統的自動化故障歸因

摘要

在多智能体大语言模型（LLM）系统中，故障归因——即识别导致任务失败的特定智能体及关键步骤——为系统调试提供了至关重要的线索，然而这一领域仍处于探索不足且劳动密集的状态。本文提出并定义了一个新的研究方向：面向LLM多智能体系统的自动化故障归因。为支持这一研究，我们引入了Who&When数据集，该数据集包含了来自127个LLM多智能体系统的大量故障日志，并附有精细标注，将故障与具体智能体及决定性错误步骤相关联。基于Who&When，我们开发并评估了三种自动化故障归因方法，总结了各自的优缺点。其中最佳方法在识别故障责任智能体上达到了53.5%的准确率，但在定位故障步骤上仅达到14.2%，部分方法的表现甚至低于随机水平。即便是如OpenAI o1和DeepSeek R1这样的先进推理模型，也未能实现实际应用价值。这些结果凸显了该任务的复杂性以及进一步研究的必要性。代码与数据集已公开于https://github.com/mingyin1/Agents_Failure_Attribution。

English

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

哪個代理導致任務失敗，以及何時發生？論大型語言模型多代理系統的自動化故障歸因

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

摘要

Support