GUI-Actor：面向GUI代理的無座標視覺定位系統

摘要

構建基於視覺語言模型（VLM）的圖形用戶界面（GUI）代理時，主要挑戰之一在於視覺定位，即根據視覺內容和文本計劃，定位執行操作的適當屏幕區域。現有研究大多將此任務視為基於文本的座標生成問題。然而，這些方法存在多種侷限性：空間語義對齊能力弱、難以處理模糊的監督目標，以及屏幕座標的密集性與視覺變換器等模型提取的粗粒度視覺特徵之間的不匹配。本文提出GUI-Actor，一種基於VLM的無座標GUI定位方法。其核心在於引入一個基於注意力的操作頭，該操作頭學習將專用的<ACTOR>標記與所有相關的視覺補丁標記對齊，使模型能夠在一次前向傳播中提出一個或多個操作區域。與此相應，我們進一步設計了一個定位驗證器，用於評估並從候選操作區域中選擇最合理的執行區域。大量實驗表明，GUI-Actor在多個GUI操作定位基準上超越了先前的最先進方法，並在未見屏幕分辨率和佈局上展現出更好的泛化能力。值得注意的是，在ScreenSpot-Pro基準上，GUI-Actor-7B甚至超越了UI-TARS-72B（38.1分），以Qwen2-VL和Qwen2.5-VL為骨幹分別取得了40.7分和44.6分的成績。此外，通過引入驗證器，我們發現僅微調新引入的操作頭（對於7B模型約1億參數）而保持VLM骨幹凍結，即可達到與先前最先進模型相當的性能，這表明GUI-Actor能夠在不損害其通用能力的前提下，賦予底層VLM有效的定位能力。

English

One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

GUI-Actor：面向GUI代理的無座標視覺定位系統

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

摘要

Support