AgentBench: エージェントとしての大規模言語モデルの評価

要旨

大規模言語モデル（LLM）は、従来の自然言語処理タスクを超えた現実世界の実用的なミッションを対象として、ますます知的で自律的になってきています。その結果、インタラクティブな環境における挑戦的なタスクにおいて、LLMをエージェントとして評価する必要性が急務となっています。本論文では、AgentBenchを紹介します。これは、多面的で進化するベンチマークであり、現在8つの異なる環境から構成され、多ターンのオープンエンド生成設定におけるLLMエージェントの推論および意思決定能力を評価します。25のLLM（APIおよびオープンソースモデルを含む）に対する広範なテストの結果、トップクラスの商用LLMは複雑な環境においてエージェントとしての強い能力を示す一方で、それらとオープンソースの競合モデルとの間には性能に大きな隔たりがあることが明らかになりました。また、AgentBenchは、より広範なカバレッジと体系的なLLM評価に向けた深い考察を伴う進行中のプロジェクトの一環としても機能します。AgentBenchのデータセット、環境、および統合評価パッケージは、https://github.com/THUDM/AgentBench で公開されています。

English

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench

AgentBench: エージェントとしての大規模言語モデルの評価

AgentBench: Evaluating LLMs as Agents

要旨

Support