AssistantBench: Gli Agenti Web Possono Risolvere Compiti Realistici e Dispensiosi in Termini di Tempo?

Abstract

Gli agenti linguistici, costruiti su modelli di linguaggio (LM), sono sistemi in grado di interagire con ambienti complessi, come il web aperto. In questo lavoro, esaminiamo se tali agenti possono eseguire compiti realistici e dispendiosi in termini di tempo sul web, ad esempio monitorare i mercati immobiliari o individuare attività commerciali rilevanti nelle vicinanze. Introduciamo AssistantBench, un nuovo benchmark impegnativo composto da 214 compiti realistici che possono essere valutati automaticamente, coprendo diversi scenari e domini. Troviamo che AssistantBench mette in luce i limiti dei sistemi attuali, inclusi i modelli di linguaggio e i modelli di linguaggio potenziati con il recupero di informazioni, poiché nessun modello raggiunge un'accuratezza superiore a 25 punti. Sebbene i LM "closed-book" performino bene, mostrano una bassa precisione poiché tendono a generare fatti inventati. Gli agenti web all'avanguardia raggiungono un punteggio vicino allo zero. Inoltre, introduciamo SeePlanAct (SPA), un nuovo agente web che supera significativamente i precedenti agenti, e un ensemble di SPA e modelli closed-book raggiunge la migliore performance complessiva. Inoltre, analizziamo i fallimenti dei sistemi attuali e sottolineiamo che la navigazione web rimane una sfida importante.

English

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

AssistantBench: Gli Agenti Web Possono Risolvere Compiti Realistici e Dispensiosi in Termini di Tempo?

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Abstract

Support