GUI-G^2：基于高斯奖励建模的图形用户界面定位

摘要

图形用户界面（GUI）定位技术将自然语言指令映射至精确的界面位置，以实现自主交互。当前强化学习方法采用二元奖励机制，将界面元素视为命中或未命中的目标，由此产生的稀疏信号忽视了空间交互的连续性特征。受人类点击行为自然形成以目标元素为中心的高斯分布启发，我们提出了GUI高斯定位奖励框架（GUI-G^2），该框架将GUI元素建模为界面平面上连续的高斯分布。GUI-G^2整合了两种协同机制：高斯点奖励通过以元素质心为中心的指数衰减分布来精确建模定位，而覆盖奖励则通过预测高斯分布与目标区域的重叠程度来评估空间对齐。为应对不同元素尺寸，我们开发了一种自适应方差机制，根据元素维度调整奖励分布。这一框架将GUI定位从稀疏的二元分类转变为密集的连续优化问题，其中高斯分布生成丰富的梯度信号，引导模型向最优交互位置收敛。在ScreenSpot、ScreenSpot-v2及ScreenSpot-Pro基准上的广泛实验表明，GUI-G^2显著超越了当前最先进的UI-TARS-72B方法，在ScreenSpot-Pro上实现了24.7%的最大提升。我们的分析揭示，连续建模提供了对界面变化的更强鲁棒性及对未见布局的更好泛化能力，为GUI交互任务中的空间推理确立了新范式。

English

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G^2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G^2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G^2, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

GUI-G^2：基于高斯奖励建模的图形用户界面定位

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

摘要

Support