代理作为评判者：用代理评估代理

摘要

当代的评估技术对主体系统来说是不足够的。这些方法要么仅关注最终结果，忽略了主体系统的逐步性质，要么需要过多的人工劳动。为了解决这个问题，我们引入了“Agent-as-a-Judge”框架，其中主体系统被用来评估主体系统。这是LLM-as-a-Judge框架的有机延伸，融入了使整个任务解决过程能够提供中间反馈的主体特征。我们将Agent-as-a-Judge应用于代码生成任务。为了克服现有基准的问题，并为Agent-as-a-Judge提供一个概念验证测试平台，我们提出了DevAI，一个包含55个现实自动化AI开发任务的新基准。它包括丰富的手动注释，例如总共365个分层用户需求。我们使用Agent-as-a-Judge对三种流行的主体系统进行基准测试，发现它在性能上远远优于LLM-as-a-Judge，并且与我们的人类评估基线一样可靠。总的来说，我们认为Agent-as-a-Judge对现代主体系统是一个具体的进步，通过提供丰富且可靠的奖励信号，为动态和可扩展的自我改进提供必要支持。

English

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

代理作为评判者：用代理评估代理

Agent-as-a-Judge: Evaluate Agents with Agents

摘要

Support