使用指令微調基礎模型進行多模態網頁導航

摘要

自主網頁導航的進展受到幾個方面的限制，包括對數十億次在線強化學習的探索性互動的依賴，以及特定領域模型設計的困難，使得難以利用豐富的跨領域數據進行泛化。在這項研究中，我們研究了基於數據驅動的離線訓練，用於具有視覺語言基礎模型的網頁代理。我們提出了一種指令跟隨的多模式代理WebGUM，它觀察網頁截圖和HTML頁面，並輸出網頁導航操作，如點擊和輸入。WebGUM通過在大量示範語料庫上聯合微調指令微調的語言模型和視覺變換器來進行訓練。我們在實驗中證明了這種方法能夠提高代理的基於視覺感知、HTML理解和多步推理的能力，明顯優於以往的研究。在MiniWoB基準測試中，我們比以前最佳的離線方法提高了31.9%以上，接近達到在線微調的最佳水準。在WebShop基準測試中，我們的30億參數模型的性能優於現有的最佳水準PaLM-540B。我們還使用我們訓練的模型收集了347K個高質量示範，比以前的工作大了38倍，並提供這些示範以促進未來在這個方向上的研究。

English

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

使用指令微調基礎模型進行多模態網頁導航

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

摘要

Support