GPT-4V(ision)是一個通用的網絡代理，如果有基礎。

摘要

最近對大型多模型模型（LMMs）的發展，特別是GPT-4V(ision)和Gemini，迅速擴展了多模型模型的能力邊界，超越了傳統任務，如圖像標題生成和視覺問答。在這項工作中，我們探索了像GPT-4V這樣的LMMs作為通用網頁代理的潛力，它可以按照自然語言指令在任何給定的網站上完成任務。我們提出了SEEACT，一個利用LMMs的強大功能進行綜合視覺理解和在網頁上執行操作的通用網頁代理。我們在最近的MIND2WEB基準上進行評估。除了在緩存網站上進行標準離線評估外，我們通過開發一個工具，使得可以在實時網站上運行網頁代理，實現了一種新的在線評估設置。我們展示了GPT-4V對於網頁代理具有巨大潛力-如果我們手動將其文本計劃與網站上的操作相結合，它可以成功完成50%的任務。這在特定為網頁代理進行微調的僅文本LMMs（如GPT-4或較小的模型FLAN-T5和BLIP-2）方面，表現顯著優於。然而，對文本計劃進行結合仍然是一個主要挑戰。現有的LMM結合策略，如一組標記提示，對於網頁代理並不有效，而我們在本文中開發的最佳結合策略利用了HTML文本和視覺。然而，與理想結合仍存在顯著差距，為進一步改進留下了充足的空間。

English

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents - it can successfully complete 50% of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML text and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement.

GPT-4V(ision)是一個通用的網絡代理，如果有基礎。

GPT-4V(ision) is a Generalist Web Agent, if Grounded

摘要

Support