通过显式位置到坐标映射提升GUI定位能力

摘要

GUI定位任务，即将自然语言指令映射到像素坐标，对于自主代理至关重要，但对当前视觉语言模型（VLMs）而言仍具挑战。核心瓶颈在于可靠的区域到像素映射，这在推广到训练中未见的高分辨率显示时容易失效。现有方法直接从视觉特征生成文本形式的坐标标记，迫使模型隐式推断复杂的位置到像素映射关系；因此，在新分辨率下，准确性下降且错误频发。我们通过两项互补创新来解决这一问题。首先，RULER标记作为显式坐标指示器，使模型能够像地图上的网格线一样引用位置，并调整而非从头生成坐标。其次，交错式多分辨率旋转位置编码（I-MRoPE）通过确保宽度和高度维度得到同等表示，改进了空间编码，解决了标准位置方案的不对称性问题。在ScreenSpot、ScreenSpot-V2和ScreenSpot-Pro数据集上的实验显示，定位准确性持续提升，尤其是在高分辨率界面上改善最为显著。通过提供显式的空间指导而非依赖隐式学习，我们的方法实现了跨多样分辨率和平台更可靠的GUI自动化。

English

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

通过显式位置到坐标映射提升GUI定位能力

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

摘要

Support