使用指令微调的基础模型进行多模态网络导航

摘要

自主网页导航的进展受到了对数十亿次在线强化学习探索交互和领域特定模型设计的依赖的阻碍，这使得难以利用丰富的跨领域数据进行泛化。在这项工作中，我们研究了基于数据驱动的离线训练，用于具有视觉-语言基础模型的网络代理。我们提出了一个指令跟随的多模态代理WebGUM，它观察网页截图和HTML页面，并输出网页导航动作，如点击和输入。WebGUM通过在大量演示语料库上联合微调指令微调的语言模型和视觉变换器进行训练。我们凭经验证明这一方法显著提高了代理的基于视觉的感知、HTML理解和多步推理能力，优于以往的工作。在MiniWoB基准测试中，我们比以前最佳的离线方法提高了31.9%以上，接近达到在线微调的最新技术水平。在WebShop基准测试中，我们的30亿参数模型的性能优于现有的最新技术水平PaLM-540B。我们还使用我们训练的模型收集了347K个高质量演示，比以前的工作大38倍，并提供给促进未来研究方向的研究者。

English

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded visual perception, HTML comprehension and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB benchmark, we improve over the previous best offline methods by more than 31.9%, being close to reaching online-finetuned SoTA. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

使用指令微调的基础模型进行多模态网络导航

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

摘要

Support