GUI-Actor：面向GUI代理的无坐标视觉定位系统

摘要

构建基于视觉语言模型（VLM）的图形用户界面（GUI）代理时，一个主要挑战是视觉定位，即根据视觉内容和文本计划，定位执行操作的适当屏幕区域。现有研究大多将其视为基于文本的坐标生成任务。然而，这些方法存在若干局限：空间语义对齐能力弱、难以处理模糊的监督目标，以及屏幕坐标的密集性与视觉变换器（Vision Transformers）等模型提取的粗粒度视觉特征之间的不匹配。本文提出GUI-Actor，一种基于VLM的无坐标GUI定位方法。其核心在于引入一个基于注意力的操作头，该操作头学习将专用的<ACTOR>标记与所有相关视觉块标记对齐，使模型能够在单次前向传播中提出一个或多个操作区域。为此，我们进一步设计了一个定位验证器，用于评估并选择最可行的操作区域进行执行。大量实验表明，GUI-Actor在多个GUI操作定位基准上均优于先前的最先进方法，且在未见过的屏幕分辨率和布局上展现出更好的泛化能力。值得注意的是，在ScreenSpot-Pro基准上，以Qwen2-VL和Qwen2.5-VL为骨干的GUI-Actor-7B分别取得了40.7和44.6的分数，超越了UI-TARS-72B（38.1）。此外，通过引入验证器，我们发现仅微调新引入的操作头（7B模型约1亿参数）而保持VLM骨干冻结，即可达到与先前最先进模型相当的性能，这表明GUI-Actor能够在不损害VLM通用能力的前提下，赋予其有效的定位能力。

English

One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

GUI-Actor：面向GUI代理的无坐标视觉定位系统

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

摘要

Support