ChatPaper.aiChatPaper

大型語言模型基於代理的評估綜述

Survey on Evaluation of LLM-based Agents

March 20, 2025
作者: Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer
cs.AI

摘要

基於大型語言模型(LLM)的智能體之興起,標誌著人工智慧領域的一次範式轉移,使自主系統能夠在與動態環境互動時進行規劃、推理、使用工具並維持記憶。本文首次全面綜述了針對這些日益強大的智能體的評估方法。我們系統性地分析了四大關鍵維度上的評估基準與框架:(1) 智能體的基本能力,包括規劃、工具使用、自我反思與記憶;(2) 針對網路、軟體工程、科學及對話型智能體的應用特定基準;(3) 通用型智能體的評估基準;以及(4) 智能體評估框架。我們的分析揭示了新興趨勢,包括轉向更為真實且具挑戰性的評估,並伴隨著持續更新的基準。同時,我們也指出了未來研究必須解決的關鍵缺口——特別是在評估成本效益、安全性與魯棒性,以及開發細粒度且可擴展的評估方法方面。本綜述描繪了智能體評估領域快速演進的圖景,揭示了該領域的新興趨勢,指出了當前限制,並為未來研究提出了方向。
English
The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

Summary

AI-Generated Summary

PDF892March 21, 2025