GUIのグラウンディング向上のための明示的な位置座標マッピング

要旨

GUIグラウンディング、すなわち自然言語の指示をピクセル座標にマッピングするタスクは、自律エージェントにとって極めて重要であるが、現在の視覚言語モデル（VLM）にとって依然として困難な課題である。その核心的なボトルネックは、信頼性の高いパッチからピクセルへのマッピングであり、トレーニング中に見られなかった高解像度ディスプレイへの外挿時に破綻する。現在のアプローチでは、視覚的特徴から直接テキストトークンとして座標を生成するため、モデルは複雑な位置からピクセルへのマッピングを暗黙的に推論することを強いられ、その結果、精度が低下し、新しい解像度での失敗が増える。我々はこれを2つの補完的なイノベーションで解決する。まず、RULERトークンは明示的な座標マーカーとして機能し、モデルが地図上のグリッドラインのように位置を参照し、座標をゼロから生成するのではなく調整できるようにする。次に、Interleaved MRoPE（I-MRoPE）は、幅と高さの次元が均等に表現されるようにすることで空間エンコーディングを改善し、標準的な位置符号化スキームの非対称性に対処する。ScreenSpot、ScreenSpot-V2、およびScreenSpot-Proでの実験では、グラウンディング精度の一貫した向上が確認され、特に高解像度インターフェースで最大の改善が見られた。暗黙的な学習に依存するのではなく、明示的な空間ガイダンスを提供することで、我々のアプローチは多様な解像度やプラットフォームにわたるより信頼性の高いGUI自動化を可能にする。

English

GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

GUIのグラウンディング向上のための明示的な位置座標マッピング

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

要旨

Support