Web-Shepherd：推进概率路線圖法以強化網路代理

摘要

網頁導航是一個獨特的領域，它能夠自動化許多重複性的現實任務，並且具有挑戰性，因為它需要超越典型多模態大型語言模型（MLLM）任務的長時序序列決策。然而，至今仍缺乏可在訓練和測試期間使用的專門獎勵模型來指導網頁導航。儘管速度和成本效益至關重要，先前的研究卻將MLLMs用作獎勵模型，這對實際部署造成了顯著限制。為解決這一問題，本研究提出了首個過程獎勵模型（PRM），名為Web-Shepherd，它能夠在步驟層面評估網頁導航軌跡。為實現這一目標，我們首先構建了WebPRM Collection，這是一個包含40K步驟級偏好對和跨多個領域及難度級別的註釋清單的大規模數據集。接著，我們還引入了WebRewardBench，這是首個用於評估PRMs的元評估基準。在實驗中，我們觀察到，與使用GPT-4o相比，我們的Web-Shepherd在WebRewardBench上的準確率提高了約30分。此外，在WebArena-lite上測試時，以GPT-4o-mini作為策略並以Web-Shepherd作為驗證器，我們實現了比使用GPT-4o-mini作為驗證器時高出10.9分的性能，且成本降低了10倍。我們的模型、數據集和代碼已在LINK公開提供。

English

Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.

Web-Shepherd：推进概率路線圖法以強化網路代理

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

摘要

Support