인간이 하는 대로 디지털 세계를 탐색하기: GUI 에이전트를 위한 범용 시각 기반 설정

초록

다중 모달 대형 언어 모델(MLLMs)은 그래픽 사용자 인터페이스(GUI) 에이전트의 능력을 변화시키며, 제어된 시뮬레이션에서 복잡한 실제 응용 프로그램으로의 전환을 가능케 합니다. 그러나 이러한 에이전트의 효과성은 그들의 기초 능력의 견고성에 달려 있습니다. 현재 GUI 에이전트는 주로 HTML이나 접근성 트리와 같은 텍스트 기반 표현을 활용하며, 이는 그들의 유용성에도 불구하고 종종 잡음, 불완전성 및 증가된 계산 오버헤드를 도입합니다. 본 논문에서는 GUI 에이전트에 대한 인간과 유사한 구현을 제안하며, 환경을 완전히 시각적으로 인식하고 GUI에 대한 픽셀 수준의 작업을 직접 수행하는 것입니다. 핵심은 시각적 기초 모델로, 다양한 GUI 요소의 지칭 표현을 다양한 플랫폼에서 GUI 상의 좌표로 정확하게 매핑할 수 있는 것입니다. 웹 기반 합성 데이터와 LLaVA 아키텍처의 약간의 적응을 포함하는 간단한 레시피가 이러한 시각적 기초 모델을 훈련하는 데 놀라울 정도로 효과적임을 보여줍니다. 우리는 지금까지 GUI 시각적 기초에 대한 가장 큰 데이터셋을 수집했으며, 1.3백만 개의 스크린샷에서 1000만 개의 GUI 요소와 그들의 지칭 표현을 포함하고 있으며, GUI 에이전트를 위한 강력한 범용 시각적 기초 모델인 UGround를 훈련하는 데 사용합니다. 세 가지 범주(기초, 오프라인 에이전트 및 온라인 에이전트)에 걸쳐 있는 여섯 가지 벤치마크에서의 경험적 결과는 다음과 같습니다: 1) UGround는 GUI 에이전트를 위한 기존 시각적 기초 모델을 20%까지 절대적으로 능가하며, 2) 기존 에이전트가 추가적인 텍스트 기반 입력을 사용하는 반면 우리는 시각적 지각만 사용함에도 불구하고 UGround를 사용하는 에이전트가 최첨단 에이전트를 능가합니다. 이러한 결과는 인간처럼 디지털 세계를 탐색하는 GUI 에이전트의 실행 가능성과 약속을 강력하게 지지합니다.

English

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

인간이 하는 대로 디지털 세계를 탐색하기: GUI 에이전트를 위한 범용 시각 기반 설정

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

초록

Support