LLMベースエージェントの評価に関する調査

要旨

LLMベースのエージェントの出現は、AIにおけるパラダイムシフトを象徴し、自律システムが動的な環境と相互作用しながら計画、推論、ツールの使用、記憶の維持を可能にします。本論文は、これらのますます高度化するエージェントの評価方法論に関する初の包括的な調査を提供します。私たちは、評価ベンチマークとフレームワークを以下の4つの重要な次元にわたって体系的に分析します：(1) 計画、ツール使用、自己反省、記憶を含む基本的なエージェント能力、(2) Web、ソフトウェア工学、科学、会話型エージェントのためのアプリケーション固有のベンチマーク、(3) 汎用エージェントのためのベンチマーク、(4) エージェントを評価するためのフレームワーク。私たちの分析は、継続的に更新されるベンチマークを用いた、より現実的で挑戦的な評価への移行といった新たなトレンドを明らかにします。また、コスト効率、安全性、堅牢性の評価、および細粒度でスケーラブルな評価方法の開発において、将来の研究が取り組むべき重要なギャップを特定します。本調査は、急速に進化するエージェント評価の状況をマッピングし、分野における新たなトレンドを明らかにし、現在の限界を指摘し、将来の研究の方向性を提案します。

English

The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

LLMベースエージェントの評価に関する調査

Survey on Evaluation of LLM-based Agents

要旨

Support