ChatPaper.aiChatPaper

大语言模型智能体评估研究综述

Survey on Evaluation of LLM-based Agents

March 20, 2025
作者: Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
cs.AI

摘要

基于大语言模型(LLM)的智能体兴起标志着人工智能领域的一次范式转变,使自主系统能够在与动态环境交互时进行规划、推理、工具使用及记忆维护。本文首次全面综述了针对这些日益强大智能体的评估方法。我们系统性地分析了四大关键维度的评估基准与框架:(1)智能体基础能力,包括规划、工具使用、自我反思及记忆;(2)针对网络、软件工程、科学及对话等特定应用场景的基准测试;(3)通用智能体的评估基准;以及(4)智能体评估框架。我们的分析揭示了新兴趋势,如向更真实、更具挑战性且持续更新的评估基准转变。同时,我们也指出了未来研究亟需填补的关键空白——特别是在评估成本效益、安全性与鲁棒性,以及开发细粒度、可扩展的评估方法方面。本综述描绘了智能体评估领域快速演进的图景,揭示了该领域的新兴趋势,识别了当前局限,并为未来研究指明了方向。
English
The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

Summary

AI-Generated Summary

PDF892March 21, 2025