多智能体大语言模型系统为何失效?
Why Do Multi-Agent LLM Systems Fail?
March 17, 2025
作者: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
cs.AI
摘要
尽管人们对多智能体系统(MAS)——即多个大型语言模型(LLM)智能体协作完成任务——的热情日益高涨,但与单智能体框架相比,其在流行基准测试上的性能提升仍微乎其微。这一差距凸显了分析阻碍MAS效能挑战的必要性。
本文中,我们首次对MAS面临的挑战进行了全面研究。我们分析了五个流行的MAS框架,覆盖超过150项任务,并邀请了六位专家级人类标注者参与。我们识别出14种独特的失败模式,并提出了一套适用于多种MAS框架的综合分类体系。该分类体系通过每项研究中三位专家标注者的一致意见迭代形成,Cohen's Kappa得分达到0.88。这些细粒度的失败模式被归为三大类:(i) 规范与系统设计失败,(ii) 智能体间协调失准,以及(iii) 任务验证与终止问题。为支持可扩展的评估,我们将MASFT与“LLM作为评判者”相结合。此外,我们探讨了通过提出两种干预措施——改进智能体角色规范和优化协调策略——是否能够轻易预防已识别的失败。我们的研究结果表明,已识别的失败需要更为复杂的解决方案,这为未来研究指明了清晰的路线图。我们开源了我们的数据集和LLM标注工具。
English
Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM
agents collaborate to accomplish tasks, their performance gains across popular
benchmarks remain minimal compared to single-agent frameworks. This gap
highlights the need to analyze the challenges hindering MAS effectiveness.
In this paper, we present the first comprehensive study of MAS challenges. We
analyze five popular MAS frameworks across over 150 tasks, involving six expert
human annotators. We identify 14 unique failure modes and propose a
comprehensive taxonomy applicable to various MAS frameworks. This taxonomy
emerges iteratively from agreements among three expert annotators per study,
achieving a Cohen's Kappa score of 0.88. These fine-grained failure modes are
organized into 3 categories, (i) specification and system design failures, (ii)
inter-agent misalignment, and (iii) task verification and termination. To
support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also
explore if identified failures could be easily prevented by proposing two
interventions: improved specification of agent roles and enhanced orchestration
strategies. Our findings reveal that identified failures require more complex
solutions, highlighting a clear roadmap for future research. We open-source our
dataset and LLM annotator.Summary
AI-Generated Summary