透過明確的位置到座標映射提升GUI基礎理解

摘要

GUI 定位任務，即將自然語言指令映射到像素座標，對於自主代理至關重要，但對當前的視覺語言模型（VLMs）而言仍具挑戰。核心瓶頸在於可靠的局部到像素的映射，當推廣到訓練期間未見過的高分辨率顯示時，這種映射會失效。現有方法直接從視覺特徵生成座標作為文本標記，迫使模型隱式推斷複雜的位置到像素映射；結果，在新分辨率下，準確性下降且失敗案例增多。我們通過兩項互補的創新來解決這一問題。首先，RULER 標記作為顯式的座標標記，讓模型能夠像地圖上的網格線一樣參考位置，並調整而非從零生成座標。其次，交錯的多分辨率旋轉位置編碼（I-MRoPE）通過確保寬度和高度維度得到同等表示，改善了空間編碼，解決了標準位置方案的不對稱性。在 ScreenSpot、ScreenSpot-V2 和 ScreenSpot-Pro 上的實驗顯示，定位準確性持續提升，尤其是在高分辨率界面上改進最為顯著。通過提供顯式的空間指導而非依賴隱式學習，我們的方法實現了跨多種分辨率和平台的更可靠的 GUI 自動化。

English

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

透過明確的位置到座標映射提升GUI基礎理解

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

摘要

Support