AgentBench：评估LLM作为智能体

摘要

大型语言模型（LLMs）正变得越来越智能和自主，针对传统自然语言处理任务之外的真实世界实用任务。因此，迫切需要评估LLMs在互动环境中作为代理的挑战性任务能力。我们提出了AgentBench，这是一个多维演进基准，目前包括8个不同环境，用于评估LLM作为代理在多轮开放式生成设置中的推理和决策能力。我们对25个LLMs（包括API和开源模型）进行了广泛测试，结果显示，虽然顶级商业LLMs在复杂环境中作为代理的能力很强，但它们与开源竞争对手之间的性能存在显著差异。这也是一个正在进行的项目的组成部分，该项目具有更广泛的覆盖范围和更深入考虑系统化LLM评估。AgentBench的数据集、环境和集成评估包已发布在https://github.com/THUDM/AgentBench。

English

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench

AgentBench：评估LLM作为智能体

AgentBench: Evaluating LLMs as Agents

摘要

Support