ChatPaper.aiChatPaper

代理即裁判

Agent-as-a-Judge

January 8, 2026
作者: Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li
cs.AI

摘要

大型语言模型即评判(LLM-as-a-Judge)通过利用大语言模型进行规模化评估,彻底改变了人工智能评估范式。然而随着评估对象日益复杂化、专业化且呈现多步骤特性,该模式的可靠性正受到固有偏见、浅层单次推理以及无法对照现实观察验证评估结果的制约。这催生了向智能体即评判(Agent-as-a-Judge)的范式转变——智能体评审通过规划决策、工具增强验证、多智能体协作及持久化记忆等机制,实现更稳健、可验证且精细化的评估。尽管智能体评估系统正快速涌现,该领域仍缺乏统一框架来梳理这一变革图景。为弥补这一空白,我们首次提出追踪此演进历程的综合研究。具体而言,我们识别了表征这一范式转变的关键维度,建立了发展谱系分类法,系统梳理了核心方法论并综述了通用领域与专业领域的应用实践。此外,我们剖析了前沿挑战并指明具有前景的研究方向,最终为下一代智能体评估提供清晰的发展路线图。
English
LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
PDF60January 10, 2026