基于纯视觉的GUI代理的全能解析器
OmniParser for Pure Vision Based GUI Agent
August 1, 2024
作者: Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
cs.AI
摘要
大型视觉语言模型最近取得的成功显示了在用户界面上操作的代理系统中具有巨大潜力。然而,我们认为像GPT-4V这样的强大多模态模型作为跨不同应用程序的多个操作系统上的通用代理的能力被严重低估,这是因为缺乏一种强大的屏幕解析技术,能够:1)可靠地识别用户界面中的可交互图标,以及2)理解截屏中各种元素的语义,并准确地将预期操作与屏幕上相应区域关联起来。为了填补这些空白,我们引入了OmniParser,这是一种将用户界面截屏解析为结构化元素的综合方法,显著增强了GPT-4V生成能够准确基于界面相应区域的操作的能力。我们首先使用流行网页和图标描述数据集筛选了一个可交互图标检测数据集。利用这些数据集来微调专门的模型:一个检测模型来解析屏幕上的可交互区域,以及一个标题模型来提取检测到的元素的功能语义。OmniParser显著提高了GPT-4V在ScreenSpot基准测试中的性能。在Mind2Web和AITW基准测试中,仅使用截屏输入的OmniParser优于需要截屏以外的额外信息的GPT-4V基线。
English
The recent success of large vision language models shows great potential in
driving the agent system operating on user interfaces. However, we argue that
the power multimodal models like GPT-4V as a general agent on multiple
operating systems across different applications is largely underestimated due
to the lack of a robust screen parsing technique capable of: 1) reliably
identifying interactable icons within the user interface, and 2) understanding
the semantics of various elements in a screenshot and accurately associate the
intended action with the corresponding region on the screen. To fill these
gaps, we introduce OmniParser, a comprehensive method for parsing user
interface screenshots into structured elements, which significantly enhances
the ability of GPT-4V to generate actions that can be accurately grounded in
the corresponding regions of the interface. We first curated an interactable
icon detection dataset using popular webpages and an icon description dataset.
These datasets were utilized to fine-tune specialized models: a detection model
to parse interactable regions on the screen and a caption model to extract the
functional semantics of the detected elements. OmniParser
significantly improves GPT-4V's performance on ScreenSpot benchmark. And on
Mind2Web and AITW benchmark, OmniParser with screenshot only input
outperforms the GPT-4V baselines requiring additional information outside of
screenshot.Summary
AI-Generated Summary