GPT-4V(ision)是一个通用网络代理，如果有基础。

摘要

最近关于大型多模态模型（LMMs）的发展，特别是GPT-4V(ision)和Gemini，迅速拓展了多模态模型的能力边界，超越了传统任务，如图像描述和视觉问答。在这项工作中，我们探讨了像GPT-4V这样的LMMs作为通用网络代理的潜力，它可以遵循自然语言指令在任何给定的网站上完成任务。我们提出了SEEACT，一个利用LMMs的力量进行综合视觉理解和在网络上执行操作的通用网络代理。我们在最近的MIND2WEB基准上进行评估。除了对缓存网站进行标准离线评估外，我们通过开发一个工具，使得能够在实时网站上运行网络代理，实现了新的在线评估设置。我们展示了GPT-4V对于网络代理具有巨大潜力 - 如果我们手动将其文本计划与网站上的操作相匹配，它可以成功完成50%的实时网站任务。这大大优于专门为网络代理进行微调的仅文本的LMMs，如GPT-4或较小的模型（FLAN-T5和BLIP-2）。然而，匹配仍然是一个主要挑战。现有的LMM匹配策略，如集合标记提示，对于网络代理并不有效，而我们在本文中开发的最佳匹配策略利用了HTML文本和视觉。然而，与理想匹配仍然存在实质差距，为进一步改进留下了充足的空间。

English

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents - it can successfully complete 50% of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML text and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement.

GPT-4V(ision)是一个通用网络代理，如果有基础。

GPT-4V(ision) is a Generalist Web Agent, if Grounded

摘要

Support