使用指令微調基礎模型進行多模態網頁導航
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
May 19, 2023
作者: Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, Izzeddin Gur
cs.AI
摘要
自主網頁導航的進展受到幾個方面的限制,包括對數十億次在線強化學習的探索性互動的依賴,以及特定領域模型設計的困難,使得難以利用豐富的跨領域數據進行泛化。在這項研究中,我們研究了基於數據驅動的離線訓練,用於具有視覺語言基礎模型的網頁代理。我們提出了一種指令跟隨的多模式代理WebGUM,它觀察網頁截圖和HTML頁面,並輸出網頁導航操作,如點擊和輸入。WebGUM通過在大量示範語料庫上聯合微調指令微調的語言模型和視覺變換器來進行訓練。我們在實驗中證明了這種方法能夠提高代理的基於視覺感知、HTML理解和多步推理的能力,明顯優於以往的研究。在MiniWoB基準測試中,我們比以前最佳的離線方法提高了31.9%以上,接近達到在線微調的最佳水準。在WebShop基準測試中,我們的30億參數模型的性能優於現有的最佳水準PaLM-540B。我們還使用我們訓練的模型收集了347K個高質量示範,比以前的工作大了38倍,並提供這些示範以促進未來在這個方向上的研究。
English
The progress of autonomous web navigation has been hindered by the dependence
on billions of exploratory interactions via online reinforcement learning, and
domain-specific model designs that make it difficult to leverage generalization
from rich out-of-domain data. In this work, we study data-driven offline
training for web agents with vision-language foundation models. We propose an
instruction-following multimodal agent, WebGUM, that observes both webpage
screenshots and HTML pages and outputs web navigation actions, such as click
and type. WebGUM is trained by jointly finetuning an instruction-finetuned
language model and a vision transformer on a large corpus of demonstrations. We
empirically demonstrate this recipe improves the agent's ability of grounded
visual perception, HTML comprehension and multi-step reasoning, outperforming
prior works by a significant margin. On the MiniWoB benchmark, we improve over
the previous best offline methods by more than 31.9%, being close to reaching
online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model
achieves superior performance to the existing SoTA, PaLM-540B. We also collect
347K high-quality demonstrations using our trained models, 38 times larger than
prior work, and make them available to promote future research in this
direction.