どのエージェントがいつタスク失敗を引き起こすのか？LLMマルチエージェントシステムにおける自動化された失敗帰属について

要旨

LLMマルチエージェントシステムにおける失敗の帰属-タスクの失敗に関与したエージェントとステップの特定-は、システムデバッグにおいて重要な手がかりを提供しますが、未開拓であり、労力を要する作業です。本論文では、LLMマルチエージェントシステムの自動化された失敗帰属という新しい研究領域を提案し、定式化します。この取り組みを支援するため、127のLLMマルチエージェントシステムから収集した広範な失敗ログと、失敗を特定のエージェントと決定的なエラーステップにリンクする詳細なアノテーションを含むWho&Whenデータセットを紹介します。Who&Whenを使用して、3つの自動化された失敗帰属手法を開発し、評価し、それぞれの長所と短所をまとめます。最良の手法は、失敗に関与したエージェントを特定する際に53.5%の精度を達成しますが、失敗ステップを特定する際には14.2%の精度しか達成できず、一部の手法はランダム以下に留まります。OpenAI o1やDeepSeek R1などの最先端の推論モデルでさえ、実用的な使用性を達成できません。これらの結果は、このタスクの複雑さと、この分野におけるさらなる研究の必要性を強調しています。コードとデータセットはhttps://github.com/mingyin1/Agents_Failure_Attributionで公開されています。

English

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

どのエージェントがいつタスク失敗を引き起こすのか？LLMマルチエージェントシステムにおける自動化された失敗帰属について

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

要旨

Support