如何訓練你的大型語言模型網路代理：統計診斷指南

摘要

基於大型語言模型（LLM）的網路代理近期取得了顯著進展，但這些進展大多發生在閉源系統中，進一步拉大了與開源替代方案之間的差距。這一進展受到兩個關鍵挑戰的阻礙：首先，過於專注於單步任務，忽視了多步網路互動的複雜性；其次，對LLM網路代理進行後續訓練所需的高昂計算成本。為解決這些問題，我們首次提出了針對LLM網路代理後續訓練的計算資源分配的統計基礎研究。我們的方法採用了一個兩階段管道，首先通過監督微調（SFT）訓練一個Llama 3.1 8B學生模型來模仿Llama 3.3 70B教師模型，隨後進行策略內強化學習。我們發現這一過程對超參數選擇極為敏感，使得全面搜索變得不切實際。為了避免他人進行昂貴的試錯，我們採樣了1,370種配置，並使用自舉法來估計有效的超參數。我們的結果顯示，在WorkArena和MiniWob++上，結合SFT與策略內RL的方法始終優於單獨使用任一方法。此外，這一策略僅需55%的計算資源即可在MiniWob++上達到純SFT的峰值性能，有效推進了計算性能的帕累托前沿，並且是唯一能夠縮小與閉源模型差距的策略。

English

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

如何訓練你的大型語言模型網路代理：統計診斷指南

How to Train Your LLM Web Agent: A Statistical Diagnosis

摘要

Support