ChatPaper.aiChatPaper

AgentBench:评估LLM作为智能体

AgentBench: Evaluating LLMs as Agents

August 7, 2023
作者: Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
cs.AI

摘要

大型语言模型(LLMs)正变得越来越智能和自主,针对传统自然语言处理任务之外的真实世界实用任务。因此,迫切需要评估LLMs在互动环境中作为代理的挑战性任务能力。我们提出了AgentBench,这是一个多维演进基准,目前包括8个不同环境,用于评估LLM作为代理在多轮开放式生成设置中的推理和决策能力。我们对25个LLMs(包括API和开源模型)进行了广泛测试,结果显示,虽然顶级商业LLMs在复杂环境中作为代理的能力很强,但它们与开源竞争对手之间的性能存在显著差异。这也是一个正在进行的项目的组成部分,该项目具有更广泛的覆盖范围和更深入考虑系统化LLM评估。AgentBench的数据集、环境和集成评估包已发布在https://github.com/THUDM/AgentBench。
English
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench
PDF250December 15, 2024