DoVer:面向大型語言模型多智能體系統的介入驅動自動除錯框架
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems
December 7, 2025
作者: Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
cs.AI
摘要
基於大型語言模型的多智能體系統在除錯方面面臨挑戰,因為故障往往源自冗長且分支繁多的互動軌跡。當前主流做法是運用LLM進行基於日誌的故障定位,將錯誤歸因於特定智能體及步驟。然而此範式存在兩大侷限性:(i)僅依賴日誌的除錯缺乏驗證機制,只能產生未經檢驗的假設;(ii)單步驟或單智能體歸因往往不具適切性,我們發現多種獨立干預措施均可單獨修復失敗任務。針對首項侷限,我們提出DoVer——一種干預驅動的除錯框架,透過定向干預(如編輯訊息、調整計畫)將假設生成與主動驗證相結合。對於第二項侷限,我們不側重歸因準確性評估,而是聚焦於衡量系統能否解決故障或實現任務成功的量化進展,體現更以結果為導向的除錯視角。在Magnetic-One智能體框架中,基於GAIA與AssistantBench的數據集實驗顯示,DoVer能將18-28%的失敗案例轉為成功,達成最高16%的里程碑進度,並能驗證或推翻30-60%的故障假設。在GSMPlus數據集與AG2智能體框架的跨場景測試中,DoVer亦成功恢復49%的失敗案例。這些成果凸顯干預機制對提升智能體系統可靠性的實用價值,為LLM多智能體系統開拓了更強健、可擴展的除錯方法。項目網站與程式碼將發佈於https://aka.ms/DoVer。
English
Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.