Agente-come-Giudice: Valutare Agenti con Agenti

Abstract

Le tecniche di valutazione contemporanee risultano inadeguate per i sistemi agentici. Questi approcci si concentrano esclusivamente sui risultati finali, ignorando la natura passo-passo dei sistemi agentici, oppure richiedono un eccessivo lavoro manuale. Per affrontare questa problematica, introduciamo il framework Agente-come-Giudice, in cui i sistemi agentici vengono impiegati per valutare altri sistemi agentici. Questa è un'estensione organica del framework LLM-come-Giudice, che incorpora caratteristiche agentiche che consentono un feedback intermedio per l'intero processo di risoluzione del compito. Applichiamo l'Agente-come-Giudice al compito di generazione di codice. Per superare le problematiche legate ai benchmark esistenti e fornire un banco di prova di concetto per l'Agente-come-Giudice, presentiamo DevAI, un nuovo benchmark di 55 realistici compiti di sviluppo automatico di intelligenza artificiale. Esso include ricche annotazioni manuali, come un totale di 365 requisiti utente gerarchici. Valutiamo tre dei popolari sistemi agentici utilizzando l'Agente-come-Giudice e scopriamo che esso supera nettamente il LLM-come-Giudice ed è altrettanto affidabile del nostro riferimento di valutazione umana. Complessivamente, riteniamo che l'Agente-come-Giudice segni un concreto passo avanti per i moderni sistemi agentici, fornendo segnali di ricompensa ricchi e affidabili necessari per un auto-miglioramento dinamico e scalabile.

English

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Agente-come-Giudice: Valutare Agenti con Agenti

Agent-as-a-Judge: Evaluate Agents with Agents

Abstract

Summary

Support

Support