ChatPaper.aiChatPaper

代理作为评判者:用代理评估代理

Agent-as-a-Judge: Evaluate Agents with Agents

October 14, 2024
作者: Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
cs.AI

摘要

当代的评估技术对主体系统来说是不足够的。这些方法要么仅关注最终结果,忽略了主体系统的逐步性质,要么需要过多的人工劳动。为了解决这个问题,我们引入了“Agent-as-a-Judge”框架,其中主体系统被用来评估主体系统。这是LLM-as-a-Judge框架的有机延伸,融入了使整个任务解决过程能够提供中间反馈的主体特征。我们将Agent-as-a-Judge应用于代码生成任务。为了克服现有基准的问题,并为Agent-as-a-Judge提供一个概念验证测试平台,我们提出了DevAI,一个包含55个现实自动化AI开发任务的新基准。它包括丰富的手动注释,例如总共365个分层用户需求。我们使用Agent-as-a-Judge对三种流行的主体系统进行基准测试,发现它在性能上远远优于LLM-as-a-Judge,并且与我们的人类评估基线一样可靠。总的来说,我们认为Agent-as-a-Judge对现代主体系统是一个具体的进步,通过提供丰富且可靠的奖励信号,为动态和可扩展的自我改进提供必要支持。
English
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Summary

AI-Generated Summary

PDF222November 16, 2024