GUI-Actor: GUI 에이전트를 위한 좌표 없는 시각적 그라운딩

초록

VLM 기반 GUI 에이전트를 구축하는 데 있어 주요 과제 중 하나는 시각적 접지(visual grounding)입니다. 이는 시각적 콘텐츠와 텍스트 기반 계획을 모두 고려하여 동작 실행을 위한 적절한 화면 영역을 찾아내는 작업을 의미합니다. 기존 연구 대부분은 이를 텍스트 기반 좌표 생성 작업으로 공식화했습니다. 그러나 이러한 접근법은 몇 가지 한계를 가지고 있습니다: 약한 공간-의미적 정렬, 모호한 감독 대상 처리의 어려움, 그리고 화면 좌표의 밀집성과 Vision Transformer와 같은 모델이 추출하는 패치 수준의 거친 시각적 특징 간의 불일치 등이 그것입니다. 본 논문에서는 좌표 없이 GUI 접지를 수행하는 VLM 기반 방법인 GUI-Actor를 제안합니다. GUI-Actor의 핵심은 전용 <ACTOR> 토큰을 모든 관련 시각적 패치 토큰과 정렬하도록 학습하는 주의 기반 동작 헤드를 도입한 것으로, 이를 통해 모델이 단일 순방향 전파에서 하나 이상의 동작 영역을 제안할 수 있게 합니다. 이를 바탕으로, 우리는 동작 실행을 위해 제안된 후보들 중에서 가장 타당한 동작 영역을 평가하고 선택하기 위한 접지 검증기(grounding verifier)를 추가로 설계했습니다. 광범위한 실험을 통해 GUI-Actor가 여러 GUI 동작 접지 벤치마크에서 기존 최첨단 방법들을 능가하며, 보이지 않는 화면 해상도와 레이아웃에 대한 일반화 능력도 향상되었음을 확인했습니다. 특히, GUI-Actor-7B는 ScreenSpot-Pro에서 UI-TARS-72B(38.1)를 능가하며, Qwen2-VL을 백본으로 사용했을 때 40.7, Qwen2.5-VL을 사용했을 때 44.6의 점수를 기록했습니다. 또한, 검증기를 통합함으로써 새로 도입된 동작 헤드(~100M 파라미터, 7B 모델 기준)만을 미세 조정하고 VLM 백본을 고정 상태로 유지하는 것만으로도 기존 최첨단 모델과 비슷한 성능을 달성할 수 있음을 확인했습니다. 이는 GUI-Actor가 기본 VLM의 일반적인 강점을 훼손하지 않으면서도 효과적인 접지 능력을 부여할 수 있음을 보여줍니다.

English

One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

GUI-Actor: GUI 에이전트를 위한 좌표 없는 시각적 그라운딩

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

초록

Support