AgentBench: Valutazione degli LLM come Agenti

Abstract

I Modelli Linguistici di Grandi Dimensioni (LLM) stanno diventando sempre più intelligenti e autonomi, puntando a missioni pragmatiche nel mondo reale che vanno oltre i tradizionali compiti di NLP. Di conseguenza, è emersa l'urgente necessità di valutare gli LLM come agenti in compiti complessi all'interno di ambienti interattivi. Presentiamo AgentBench, un benchmark multidimensionale in evoluzione che attualmente comprende 8 ambienti distinti per valutare le capacità di ragionamento e decision-making degli LLM in contesti di generazione aperta e multi-turn. I nostri test estesi su 25 LLM (inclusi modelli API e open-source) dimostrano che, sebbene i migliori LLM commerciali mostrino una forte capacità di agire come agenti in ambienti complessi, esiste una significativa disparità di prestazioni rispetto ai concorrenti open-source. AgentBench funge anche come componente di un progetto in corso con una copertura più ampia e una considerazione più profonda verso una valutazione sistematica degli LLM. I dataset, gli ambienti e un pacchetto di valutazione integrato per AgentBench sono disponibili all'indirizzo https://github.com/THUDM/AgentBench.

English

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench

AgentBench: Valutazione degli LLM come Agenti

AgentBench: Evaluating LLMs as Agents

Abstract

Support