ChatPaper.aiChatPaper

代理人作為評判:使用代理人評估代理人

Agent-as-a-Judge: Evaluate Agents with Agents

October 14, 2024
作者: Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
cs.AI

摘要

當前的評估技術對於主動型系統來說是不足夠的。這些方法要麼專注於最終結果,忽略了主動型系統的逐步性質,要麼需要過多的人工勞動。為了應對這一問題,我們引入了「Agent-as-a-Judge」框架,其中主動型系統被用來評估主動型系統。這是LLM-as-a-Judge框架的有機延伸,融入了使整個任務解決過程能夠獲得中間反饋的主動型特徵。我們將Agent-as-a-Judge應用於代碼生成任務。為了克服現有基準的問題,並為Agent-as-a-Judge提供一個概念驗證平臺,我們提出了DevAI,一個包含55個現實自動化AI開發任務的新基準。它包括豐富的手動標註,如總共365個分層用戶需求。我們使用Agent-as-a-Judge對三個流行的主動型系統進行基準測試,發現它在性能上遠遠優於LLM-as-a-Judge,並與我們的人類評估基準一樣可靠。總的來說,我們認為Agent-as-a-Judge標誌著現代主動型系統的一個具體進步,它提供了豐富且可靠的獎勵信號,這對於動態和可擴展的自我改進是必要的。
English
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Summary

AI-Generated Summary

PDF222November 16, 2024