ChatPaper.aiChatPaper

如何训练你的大型语言模型网络代理:一项统计分析

How to Train Your LLM Web Agent: A Statistical Diagnosis

July 5, 2025
作者: Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia
cs.AI

摘要

基于大语言模型(LLM)的网络代理近期取得了显著进展,但大部分成果集中在闭源系统中,进一步拉大了与开源替代方案之间的差距。这一进展主要受限于两大挑战:其一,过度聚焦于单步任务,忽视了多步网络交互的复杂性;其二,LLM网络代理的后训练所需的高昂计算成本。为此,我们首次提出了针对LLM网络代理后训练计算资源分配的统计基础研究。我们的方法采用两阶段流程,首先通过监督微调(SFT)训练一个Llama 3.1 8B学生模型模仿Llama 3.3 70B教师模型,随后进行策略内强化学习。我们发现这一过程对超参数选择极为敏感,使得全面搜索变得不切实际。为避免他人重复昂贵的试错过程,我们采样了1,370种配置,并利用自举法估算有效超参数。结果显示,在WorkArena和MiniWob++上,结合SFT与策略内RL的方法始终优于单独使用任一方法。此外,该策略仅需55%的计算资源即可在MiniWob++上达到纯SFT的峰值性能,有效推进了计算-性能的帕累托前沿,并且是唯一能够缩小与闭源模型差距的策略。
English
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
PDF442July 9, 2025