AgentBench：評估大型語言模型作為代理人

摘要

大型語言模型（LLMs）正變得越來越智能和自主，專注於傳統自然語言處理任務以外的現實世界實用任務。因此，迫切需要評估LLMs在具有挑戰性的互動環境中作為代理人的表現。我們提出了AgentBench，這是一個多維度不斷發展的基準，目前包含8個不同環境，用於評估LLM作為代理人在多輪開放式生成設置中的推理和決策能力。我們對25個LLMs進行了廣泛測試（包括API和開源模型），結果顯示，儘管頂尖商業LLMs在複雜環境中表現出色，但它們與開源競爭對手之間的性能存在顯著差異。這也是一個正在進行的項目的一部分，該項目具有更廣泛的覆蓋範圍和更深入的考慮，以實現對LLM的系統評估。AgentBench的數據集、環境和集成評估套件已在https://github.com/THUDM/AgentBench 上發布。

English

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench

AgentBench：評估大型語言模型作為代理人

AgentBench: Evaluating LLMs as Agents

摘要

Support