ChatPaper.aiChatPaper

AgentBench:評估大型語言模型作為代理人

AgentBench: Evaluating LLMs as Agents

August 7, 2023
作者: Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
cs.AI

摘要

大型語言模型(LLMs)正變得越來越智能和自主,專注於傳統自然語言處理任務以外的現實世界實用任務。因此,迫切需要評估LLMs在具有挑戰性的互動環境中作為代理人的表現。我們提出了AgentBench,這是一個多維度不斷發展的基準,目前包含8個不同環境,用於評估LLM作為代理人在多輪開放式生成設置中的推理和決策能力。我們對25個LLMs進行了廣泛測試(包括API和開源模型),結果顯示,儘管頂尖商業LLMs在複雜環境中表現出色,但它們與開源競爭對手之間的性能存在顯著差異。這也是一個正在進行的項目的一部分,該項目具有更廣泛的覆蓋範圍和更深入的考慮,以實現對LLM的系統評估。AgentBench的數據集、環境和集成評估套件已在https://github.com/THUDM/AgentBench 上發布。
English
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench
PDF250December 15, 2024