FocusUI:基于位置保持视觉令牌选择的高效UI界面定位方法
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
January 7, 2026
作者: Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
cs.AI
摘要
视觉语言模型(VLMs)在处理高分辨率屏幕截图方面展现出卓越能力,在用户界面(UI)定位任务中表现突出。然而屏幕截图被标记化为数千个视觉标记(如2K分辨率约4700个),这会带来巨大计算开销并稀释注意力。相比之下,人类与UI交互时通常聚焦于感兴趣区域。本研究开创性地提出高效UI定位任务,基于对任务特性与挑战的实践分析,我们提出FocusUI框架——通过筛选与指令最相关的图像块同时保持位置连续性来实现精确定位。该框架攻克两大核心挑战:(1)消除视觉编码中的冗余标记。我们融合指令条件评分与基于规则的UI图谱评分(通过降低大尺寸同质区域权重)构建图像块级监督机制,从而筛选出独特且与指令相关的视觉标记;(2)保持视觉标记选择过程中的位置连续性。研究发现通用视觉标记剪枝方法会破坏位置信息,导致UI定位任务精度严重下降。我们创新性提出PosPad策略,将连续丢弃的视觉标记序列压缩为特殊标记并置于序列末端,从而保持位置连续性。在四个定位基准上的综合实验表明,FocusUI超越了GUI专用基线模型:在ScreenSpot-Pro基准测试中,FocusUI-7B相较GUI-Actor-7B实现3.7%的性能提升;即使仅保留30%视觉标记,FocusUI-7B性能仅下降3.2%,同时推理速度提升1.44倍,峰值GPU内存降低17%。
English
Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.