ChatPaper.aiChatPaper

如何訓練你的大型語言模型網路代理:統計診斷指南

How to Train Your LLM Web Agent: A Statistical Diagnosis

July 5, 2025
作者: Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia
cs.AI

摘要

基於大型語言模型(LLM)的網路代理近期取得了顯著進展,但這些進展大多發生在閉源系統中,進一步拉大了與開源替代方案之間的差距。這一進展受到兩個關鍵挑戰的阻礙:首先,過於專注於單步任務,忽視了多步網路互動的複雜性;其次,對LLM網路代理進行後續訓練所需的高昂計算成本。為解決這些問題,我們首次提出了針對LLM網路代理後續訓練的計算資源分配的統計基礎研究。我們的方法採用了一個兩階段管道,首先通過監督微調(SFT)訓練一個Llama 3.1 8B學生模型來模仿Llama 3.3 70B教師模型,隨後進行策略內強化學習。我們發現這一過程對超參數選擇極為敏感,使得全面搜索變得不切實際。為了避免他人進行昂貴的試錯,我們採樣了1,370種配置,並使用自舉法來估計有效的超參數。我們的結果顯示,在WorkArena和MiniWob++上,結合SFT與策略內RL的方法始終優於單獨使用任一方法。此外,這一策略僅需55%的計算資源即可在MiniWob++上達到純SFT的峰值性能,有效推進了計算性能的帕累托前沿,並且是唯一能夠縮小與閉源模型差距的策略。
English
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
PDF442July 9, 2025