ChatPaper.aiChatPaper

多智能體大型語言模型系統為何失敗?

Why Do Multi-Agent LLM Systems Fail?

March 17, 2025
作者: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
cs.AI

摘要

儘管多智能體系統(MAS)日益受到熱捧,其中多個大型語言模型(LLM)智能體協作完成任務,但與單智能體框架相比,其在流行基準測試中的性能提升仍然微乎其微。這一差距凸顯了分析阻礙MAS有效性挑戰的必要性。 在本文中,我們首次對MAS挑戰進行了全面研究。我們分析了五種流行的MAS框架,涵蓋超過150項任務,並邀請了六位專家級人類註釋員參與。我們識別出14種獨特的故障模式,並提出了一個適用於各種MAS框架的綜合分類法。該分類法通過每項研究中三位專家註釋員的共識迭代形成,Cohen's Kappa得分達到0.88。這些細粒度的故障模式被組織成三類:(i) 規格與系統設計故障,(ii) 智能體間對齊失準,以及(iii) 任務驗證與終止。為了支持可擴展的評估,我們將MASFT與LLM-as-a-Judge相結合。我們還探討了是否能夠通過提出兩種干預措施輕鬆預防已識別的故障:改進智能體角色的規格和增強協調策略。我們的研究結果表明,已識別的故障需要更複雜的解決方案,這為未來研究指明了一條清晰的路徑。我們開源了我們的數據集和LLM註釋器。
English
Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to accomplish tasks, their performance gains across popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the need to analyze the challenges hindering MAS effectiveness. In this paper, we present the first comprehensive study of MAS challenges. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerges iteratively from agreements among three expert annotators per study, achieving a Cohen's Kappa score of 0.88. These fine-grained failure modes are organized into 3 categories, (i) specification and system design failures, (ii) inter-agent misalignment, and (iii) task verification and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore if identified failures could be easily prevented by proposing two interventions: improved specification of agent roles and enhanced orchestration strategies. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open-source our dataset and LLM annotator.

Summary

AI-Generated Summary

PDF452March 21, 2025