DoVer: Debugging Automatico Guidato da Interventi per Sistemi Multi-Agente basati su LLM

Abstract

I sistemi multi-agente basati su Large Language Model (LLM) sono difficili da debuggare poiché i fallimenti spesso originano da tracce di interazione lunghe e ramificate. La pratica prevalente consiste nell'utilizzare gli LLM per la localizzazione dei fallimenti basata sui log, attribuendo gli errori a un agente e a uno step specifici. Tuttavia, questo paradigma presenta due limitazioni chiave: (i) il debugging basato solo sui log manca di validazione, producendo ipotesi non verificate, e (ii) l'attribuzione a un singolo step o a un singolo agente è spesso mal posta, poiché abbiamo riscontrato che interventi distinti multipli possono riparare indipendentemente il task fallito. Per affrontare la prima limitazione, introduciamo DoVer, un framework di debugging guidato da interventi, che integra la generazione di ipotesi con una verifica attiva attraverso interventi mirati (ad esempio, modificando messaggi, alterando piani). Per la seconda limitazione, anziché valutare l'accuratezza dell'attribuzione, ci concentriamo sul misurare se il sistema risolve il fallimento o compie progressi quantificabili verso il successo del task, riflettendo una visione del debugging più orientata ai risultati. All'interno del framework agent Magnetic-One, sui dataset derivati da GAIA e AssistantBench, DoVer converte il 18-28% dei tentativi falliti in successi, raggiunge fino al 16% di progresso verso milestone e convalida o confuta il 30-60% delle ipotesi di fallimento. DoVer si dimostra efficace anche su un dataset diverso (GSMPlus) e un framework agent differente (AG2), dove recupera il 49% dei tentativi falliti. Questi risultati evidenziano l'intervento come meccanismo pratico per migliorare l'affidabilità nei sistemi agentici e aprono opportunità per metodi di debugging più robusti e scalabili per sistemi multi-agente basati su LLM. Il sito web del progetto e il codice saranno disponibili su https://aka.ms/DoVer.

English

Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.

DoVer: Debugging Automatico Guidato da Interventi per Sistemi Multi-Agente basati su LLM

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Abstract

Support