ChatPaper.aiChatPaper

DoVer:面向大语言模型多智能体系统的干预驱动自动调试框架

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

December 7, 2025
作者: Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
cs.AI

摘要

基于大语言模型的多智能体系统调试难度较高,因为故障往往源于长链条的分支式交互轨迹。当前主流做法是利用大语言模型进行基于日志的故障定位,将错误归因于特定智能体及操作步骤。然而该范式存在两大局限:(i)纯日志调试缺乏验证环节,仅能生成未经检验的假设;(ii)单步骤或单智能体归因往往定义不当,因为我们发现存在多种独立干预措施均可修复失败任务。针对首个局限,我们提出干预式调试框架DoVer,通过定向干预(如编辑消息、调整计划)将假设生成与主动验证相结合。对于第二个局限,我们不再聚焦于归因准确性评估,转而关注系统能否解决故障或推动任务取得可量化的进展,这体现了更注重结果的调试视角。在Magnetic-One智能体框架中,基于GAIA和AssistantBench数据集的实验表明:DoVer将18-28%的失败案例转化为成功,实现最高16%的里程碑进展,并能验证或推翻30-60%的故障假设。在GSMPlus数据集和AG2智能体框架的跨场景测试中,DoVer成功修复了49%的失败案例。这些结果证明干预是提升智能体系统可靠性的有效机制,为基于大语言模型的多智能体系统开辟了更稳健、可扩展的调试方法路径。项目网站与代码详见:https://aka.ms/DoVer。
English
Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.
PDF254December 10, 2025