Agenti Web con modelli del mondo: Apprendimento e sfruttamento della dinamica dell'ambiente nella navigazione Web

Abstract

I grandi modelli linguistici (LLM) hanno recentemente attirato molta attenzione nella costruzione di agenti autonomi. Tuttavia, le prestazioni degli attuali agenti web basati su LLM in compiti a lungo termine sono ben lontane dall'essere ottimali, spesso generando errori come l'acquisto ripetuto di un biglietto aereo non rimborsabile. Al contrario, gli esseri umani possono evitare un tale errore irreversibile, poiché abbiamo consapevolezza delle possibili conseguenze (ad esempio, la perdita di denaro) delle nostre azioni, nota anche come "modello del mondo". Motivato da ciò, il nostro studio inizia con analisi preliminari, confermando l'assenza di modelli del mondo nei LLM attuali (ad esempio, GPT-4o, Claude-3.5-Sonnet, ecc.). Successivamente, presentiamo un agente web potenziato dal modello del mondo (WMA), che simula gli esiti delle sue azioni per prendere decisioni migliori. Per superare le sfide nel addestrare i LLM come modelli del mondo che prevedono le osservazioni successive, come elementi ripetuti tra le osservazioni e lunghi input HTML, proponiamo un'astrazione delle osservazioni focalizzata sulla transizione, in cui gli obiettivi di previsione sono descrizioni in linguaggio naturale libero che mettono in evidenza esclusivamente le differenze di stato importanti tra i passaggi temporali. Gli esperimenti su WebArena e Mind2Web mostrano che i nostri modelli del mondo migliorano la selezione delle politiche degli agenti senza addestramento e dimostrano l'efficienza in termini di costo e tempo dei nostri agenti rispetto agli agenti recenti basati sulla ricerca ad albero.

English

Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.

Agenti Web con modelli del mondo: Apprendimento e sfruttamento della dinamica dell'ambiente nella navigazione Web

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Abstract

Summary

Support

Support