Come Addestrare il Tuo Agente Web LLM: Una Diagnosi Statistica

Abstract

Gli agenti web basati su LLM hanno recentemente compiuto progressi significativi, ma gran parte di questi si è verificata in sistemi closed-source, ampliando il divario con le alternative open-source. Il progresso è stato frenato da due sfide chiave: in primo luogo, una focalizzazione ristretta su compiti a singolo passaggio che trascura la complessità delle interazioni web multi-step; e in secondo luogo, gli elevati costi computazionali necessari per il post-training degli agenti web basati su LLM. Per affrontare questo problema, presentiamo il primo studio statisticamente fondato sull'allocazione delle risorse computazionali per il post-training di agenti web LLM. Il nostro approccio utilizza una pipeline a due stadi, addestrando uno studente Llama 3.1 8B a imitare un insegnante Llama 3.3 70B tramite fine-tuning supervisionato (SFT), seguito da apprendimento per rinforzo on-policy. Abbiamo riscontrato che questo processo è altamente sensibile alla scelta degli iperparametri, rendendo impraticabili ricerche esaustive. Per risparmiare ad altri costosi tentativi ed errori, abbiamo campionato 1.370 configurazioni e utilizzato il bootstrapping per stimare iperparametri efficaci. I nostri risultati mostrano che combinare SFT con RL on-policy supera costantemente entrambi gli approcci singolarmente sia su WorkArena che su MiniWob++. Inoltre, questa strategia richiede solo il 55% delle risorse computazionali per eguagliare le prestazioni di picco del puro SFT su MiniWob++, spingendo efficacemente la frontiera di Pareto computazione-prestazioni, ed è l'unica strategia in grado di colmare il divario con i modelli closed-source.

English

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

Come Addestrare il Tuo Agente Web LLM: Una Diagnosi Statistica

How to Train Your LLM Web Agent: A Statistical Diagnosis

Abstract

Support