智能体即裁判
Agent-as-a-Judge
January 8, 2026
作者: Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li
cs.AI
摘要
随着大语言模型即评委(LLM-as-a-Judge)技术通过规模化评估革新了人工智能评价体系,其局限性在评估对象日益复杂化、专业化与多步骤化的背景下逐渐凸显。该模式受固有偏见、浅层单次推理能力以及缺乏现实观察验证的制约,催生了智能体即评委(Agent-as-a-Judge)的范式转型。智能体评委通过规划决策、工具增强验证、多智能体协作与持久化记忆机制,实现了更鲁棒、可验证且精细化的评估。尽管智能体评估系统呈现爆发式增长,该领域仍缺乏统一框架以梳理这一变革图景。为此,我们首次提出系统性的演进综述框架:通过界定范式转变的关键维度建立发展谱系,梳理核心方法论并考察通用与专业领域的应用实践,进而剖析前沿挑战与可行研究方向,最终为新一代智能体评估技术提供清晰的发展路径图。
English
LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.