像人类一样在数字世界中导航:GUI代理的通用视觉基础
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
October 7, 2024
作者: Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su
cs.AI
摘要
多模态大型语言模型(MLLMs)正在改变图形用户界面(GUI)代理的能力,促进它们从受控模拟过渡到跨各种平台的复杂实际应用。然而,这些代理的有效性取决于它们的基础能力的稳健性。当前的GUI代理主要利用诸如HTML或可访问性树之类的基于文本的表示,尽管它们很实用,但往往会引入噪音、不完整性和增加计算开销。在本文中,我们主张为GUI代理赋予类似人类的具象化,完全通过视觉感知环境并直接在GUI上进行像素级操作。关键在于视觉基础模型,它能够准确地将GUI元素的各种指称表达映射到不同平台上的GUI坐标。我们展示了一个简单的方法,包括基于网络的合成数据和对LLaVA架构的轻微调整,对于训练这种视觉基础模型是非常有效的。我们迄今收集了迄今为止最大的GUI视觉基础数据集,包含1000万GUI元素及其指称表达在130万屏幕截图上,用它来训练UGround,一种强大的通用GUI代理视觉基础模型。在涵盖三个类别(基础、离线代理和在线代理)的六个基准测试上的实证结果显示,1)UGround在GUI代理的视觉基础模型中明显优于现有模型,绝对优势高达20%,2)具有UGround的代理优于最先进的代理,尽管现有代理使用额外的基于文本的输入,而我们的只使用视觉感知。这些结果为像人类一样在数字世界中导航的GUI代理的可行性和前景提供了有力支持。
English
Multimodal large language models (MLLMs) are transforming the capabilities of
graphical user interface (GUI) agents, facilitating their transition from
controlled simulations to complex, real-world applications across various
platforms. However, the effectiveness of these agents hinges on the robustness
of their grounding capability. Current GUI agents predominantly utilize
text-based representations such as HTML or accessibility trees, which, despite
their utility, often introduce noise, incompleteness, and increased
computational overhead. In this paper, we advocate a human-like embodiment for
GUI agents that perceive the environment entirely visually and directly take
pixel-level operations on the GUI. The key is visual grounding models that can
accurately map diverse referring expressions of GUI elements to their
coordinates on the GUI across different platforms. We show that a simple
recipe, which includes web-based synthetic data and slight adaptation of the
LLaVA architecture, is surprisingly effective for training such visual
grounding models. We collect the largest dataset for GUI visual grounding so
far, containing 10M GUI elements and their referring expressions over 1.3M
screenshots, and use it to train UGround, a strong universal visual grounding
model for GUI agents. Empirical results on six benchmarks spanning three
categories (grounding, offline agent, and online agent) show that 1) UGround
substantially outperforms existing visual grounding models for GUI agents, by
up to 20% absolute, and 2) agents with UGround outperform state-of-the-art
agents, despite the fact that existing agents use additional text-based input
while ours only uses visual perception. These results provide strong support
for the feasibility and promises of GUI agents that navigate the digital world
as humans do.Summary
AI-Generated Summary