GUI-G^2: GUIグラウンディングのためのガウス報酬モデリング

要旨

グラフィカルユーザーインターフェース（GUI）のグラウンディングは、自然言語の指示を自律的なインタラクションのための正確なインターフェース位置にマッピングする。現在の強化学習アプローチでは、要素を当たり外れのターゲットとして扱う二値報酬を使用しており、空間的インタラクションの連続的な性質を無視した疎な信号を生成している。ターゲット要素を中心に自然にガウス分布を形成する人間のクリック行動に着想を得て、我々はGUIガウスグラウンディング報酬（GUI-G^2）を導入する。これは、GUI要素をインターフェース平面上の連続的なガウス分布としてモデル化する原則的な報酬フレームワークである。GUI-G^2は、2つの相乗的なメカニズムを組み込んでいる：ガウスポイント報酬は、要素の重心を中心とした指数関数的に減衰する分布を通じて正確な位置特定をモデル化し、カバレッジ報酬は、予測されたガウス分布とターゲット領域の重なりを測定することで空間的整合性を評価する。多様な要素スケールに対処するため、要素の寸法に基づいて報酬分布を調整する適応分散メカニズムを開発した。このフレームワークは、GUIグラウンディングを疎な二値分類から密な連続最適化に変換し、ガウス分布がモデルを最適なインタラクション位置に導く豊かな勾配信号を生成する。ScreenSpot、ScreenSpot-v2、およびScreenSpot-Proベンチマークでの広範な実験により、GUI-G^2が最先端の手法UI-TARS-72Bを大幅に上回り、ScreenSpot-Proでは最大24.7%の改善を示すことが実証された。我々の分析により、連続的モデリングがインターフェースの変動に対する優れたロバスト性と未見のレイアウトに対する強化された一般化を提供し、GUIインタラクションタスクにおける空間推論の新たなパラダイムを確立することが明らかとなった。

English

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G^2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G^2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G^2, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

GUI-G^2: GUIグラウンディングのためのガウス報酬モデリング

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

要旨

Support