代理人作為評判:使用代理人評估代理人
Agent-as-a-Judge: Evaluate Agents with Agents
October 14, 2024
作者: Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
cs.AI
摘要
當前的評估技術對於主動型系統來說是不足夠的。這些方法要麼專注於最終結果,忽略了主動型系統的逐步性質,要麼需要過多的人工勞動。為了應對這一問題,我們引入了「Agent-as-a-Judge」框架,其中主動型系統被用來評估主動型系統。這是LLM-as-a-Judge框架的有機延伸,融入了使整個任務解決過程能夠獲得中間反饋的主動型特徵。我們將Agent-as-a-Judge應用於代碼生成任務。為了克服現有基準的問題,並為Agent-as-a-Judge提供一個概念驗證平臺,我們提出了DevAI,一個包含55個現實自動化AI開發任務的新基準。它包括豐富的手動標註,如總共365個分層用戶需求。我們使用Agent-as-a-Judge對三個流行的主動型系統進行基準測試,發現它在性能上遠遠優於LLM-as-a-Judge,並與我們的人類評估基準一樣可靠。總的來說,我們認為Agent-as-a-Judge標誌著現代主動型系統的一個具體進步,它提供了豐富且可靠的獎勵信號,這對於動態和可擴展的自我改進是必要的。
English
Contemporary evaluation techniques are inadequate for agentic systems. These
approaches either focus exclusively on final outcomes -- ignoring the
step-by-step nature of agentic systems, or require excessive manual labour. To
address this, we introduce the Agent-as-a-Judge framework, wherein agentic
systems are used to evaluate agentic systems. This is an organic extension of
the LLM-as-a-Judge framework, incorporating agentic features that enable
intermediate feedback for the entire task-solving process. We apply the
Agent-as-a-Judge to the task of code generation. To overcome issues with
existing benchmarks and provide a proof-of-concept testbed for
Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated
AI development tasks. It includes rich manual annotations, like a total of 365
hierarchical user requirements. We benchmark three of the popular agentic
systems using Agent-as-a-Judge and find it dramatically outperforms
LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether,
we believe that Agent-as-a-Judge marks a concrete step forward for modern
agentic systems -- by providing rich and reliable reward signals necessary for
dynamic and scalable self-improvement.Summary
AI-Generated Summary