純視覺導向GUI代理的OmniParser
OmniParser for Pure Vision Based GUI Agent
August 1, 2024
作者: Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
cs.AI
摘要
最近大型視覺語言模型的成功顯示了在驅動操作於使用者界面的代理系統方面具有巨大潛力。然而,我們認為像 GPT-4V 這樣的強大多模型作為多個操作系統上的通用代理,跨越不同應用程式,其潛力被大大低估,原因在於缺乏一種強大的屏幕解析技術,能夠:1)可靠地識別使用者界面中的可交互圖標,以及2)理解截圖中各種元素的語義,並將預期動作準確地與屏幕上對應的區域關聯起來。為了填補這些空白,我們引入了 OmniParser,這是一種將使用者界面截圖解析為結構化元素的全面方法,顯著增強了 GPT-4V 生成能夠準確基於界面對應區域的動作的能力。我們首先使用流行網頁和圖標描述數據集來精選一個可交互圖標檢測數據集。這些數據集被用於微調專門的模型:一個用於解析屏幕上可交互區域的檢測模型,以及一個用於提取檢測元素的功能語義的標題模型。OmniParser 在 ScreenSpot 基準測試中顯著提高了 GPT-4V 的性能。在 Mind2Web 和 AITW 基準測試中,只使用截圖作為輸入的 OmniParser 優於需要截圖以外額外信息的 GPT-4V 基準。
English
The recent success of large vision language models shows great potential in
driving the agent system operating on user interfaces. However, we argue that
the power multimodal models like GPT-4V as a general agent on multiple
operating systems across different applications is largely underestimated due
to the lack of a robust screen parsing technique capable of: 1) reliably
identifying interactable icons within the user interface, and 2) understanding
the semantics of various elements in a screenshot and accurately associate the
intended action with the corresponding region on the screen. To fill these
gaps, we introduce OmniParser, a comprehensive method for parsing user
interface screenshots into structured elements, which significantly enhances
the ability of GPT-4V to generate actions that can be accurately grounded in
the corresponding regions of the interface. We first curated an interactable
icon detection dataset using popular webpages and an icon description dataset.
These datasets were utilized to fine-tune specialized models: a detection model
to parse interactable regions on the screen and a caption model to extract the
functional semantics of the detected elements. OmniParser
significantly improves GPT-4V's performance on ScreenSpot benchmark. And on
Mind2Web and AITW benchmark, OmniParser with screenshot only input
outperforms the GPT-4V baselines requiring additional information outside of
screenshot.Summary
AI-Generated Summary