代理人作為評判：使用代理人評估代理人

摘要

當前的評估技術對於主動型系統來說是不足夠的。這些方法要麼專注於最終結果，忽略了主動型系統的逐步性質，要麼需要過多的人工勞動。為了應對這一問題，我們引入了「Agent-as-a-Judge」框架，其中主動型系統被用來評估主動型系統。這是LLM-as-a-Judge框架的有機延伸，融入了使整個任務解決過程能夠獲得中間反饋的主動型特徵。我們將Agent-as-a-Judge應用於代碼生成任務。為了克服現有基準的問題，並為Agent-as-a-Judge提供一個概念驗證平臺，我們提出了DevAI，一個包含55個現實自動化AI開發任務的新基準。它包括豐富的手動標註，如總共365個分層用戶需求。我們使用Agent-as-a-Judge對三個流行的主動型系統進行基準測試，發現它在性能上遠遠優於LLM-as-a-Judge，並與我們的人類評估基準一樣可靠。總的來說，我們認為Agent-as-a-Judge標誌著現代主動型系統的一個具體進步，它提供了豐富且可靠的獎勵信號，這對於動態和可擴展的自我改進是必要的。

English

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

代理人作為評判：使用代理人評估代理人

Agent-as-a-Judge: Evaluate Agents with Agents

摘要

Support