代理作为评判者:用代理评估代理
Agent-as-a-Judge: Evaluate Agents with Agents
October 14, 2024
作者: Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
cs.AI
摘要
当代的评估技术对主体系统来说是不足够的。这些方法要么仅关注最终结果,忽略了主体系统的逐步性质,要么需要过多的人工劳动。为了解决这个问题,我们引入了“Agent-as-a-Judge”框架,其中主体系统被用来评估主体系统。这是LLM-as-a-Judge框架的有机延伸,融入了使整个任务解决过程能够提供中间反馈的主体特征。我们将Agent-as-a-Judge应用于代码生成任务。为了克服现有基准的问题,并为Agent-as-a-Judge提供一个概念验证测试平台,我们提出了DevAI,一个包含55个现实自动化AI开发任务的新基准。它包括丰富的手动注释,例如总共365个分层用户需求。我们使用Agent-as-a-Judge对三种流行的主体系统进行基准测试,发现它在性能上远远优于LLM-as-a-Judge,并且与我们的人类评估基线一样可靠。总的来说,我们认为Agent-as-a-Judge对现代主体系统是一个具体的进步,通过提供丰富且可靠的奖励信号,为动态和可扩展的自我改进提供必要支持。
English
Contemporary evaluation techniques are inadequate for agentic systems. These
approaches either focus exclusively on final outcomes -- ignoring the
step-by-step nature of agentic systems, or require excessive manual labour. To
address this, we introduce the Agent-as-a-Judge framework, wherein agentic
systems are used to evaluate agentic systems. This is an organic extension of
the LLM-as-a-Judge framework, incorporating agentic features that enable
intermediate feedback for the entire task-solving process. We apply the
Agent-as-a-Judge to the task of code generation. To overcome issues with
existing benchmarks and provide a proof-of-concept testbed for
Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated
AI development tasks. It includes rich manual annotations, like a total of 365
hierarchical user requirements. We benchmark three of the popular agentic
systems using Agent-as-a-Judge and find it dramatically outperforms
LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether,
we believe that Agent-as-a-Judge marks a concrete step forward for modern
agentic systems -- by providing rich and reliable reward signals necessary for
dynamic and scalable self-improvement.Summary
AI-Generated Summary