AgentBench: 에이전트로서의 대형 언어 모델 평가

초록

대형 언어 모델(LLMs)은 점점 더 지능적이고 자율적으로 발전하며, 전통적인 자연어 처리(NLP) 작업을 넘어 실용적인 현실 세계의 임무를 목표로 하고 있습니다. 이에 따라, 상호작용 환경에서의 도전적인 과제에 대한 에이전트로서의 LLMs 평가가 시급히 요구되고 있습니다. 본 논문에서는 다차원적이고 진화하는 벤치마크인 AgentBench를 소개합니다. 이 벤치마크는 현재 8개의 독특한 환경으로 구성되어 있으며, 다중 턴 오픈 엔드 생성 설정에서 LLM-as-Agent의 추론 및 의사 결정 능력을 평가합니다. 25개의 LLMs(API 및 오픈소스 모델 포함)에 대한 광범위한 테스트 결과, 최상위 상용 LLMs는 복잡한 환경에서 에이전트로서의 강력한 능력을 보여주지만, 이들과 오픈소스 경쟁 모델 간에는 상당한 성능 차이가 있음이 확인되었습니다. 또한, 이 연구는 체계적인 LLM 평가를 위한 더 넓은 범위와 깊은 고려를 포함하는 진행 중인 프로젝트의 일부로 기능합니다. AgentBench의 데이터셋, 환경, 통합 평가 패키지는 https://github.com/THUDM/AgentBench에서 공개되었습니다.

English

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench

AgentBench: 에이전트로서의 대형 언어 모델 평가

AgentBench: Evaluating LLMs as Agents

초록

Support