GUI-G^2:基于高斯奖励建模的图形用户界面定位
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
July 21, 2025
作者: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
cs.AI
摘要
图形用户界面(GUI)定位技术将自然语言指令映射至精确的界面位置,以实现自主交互。当前强化学习方法采用二元奖励机制,将界面元素视为命中或未命中的目标,由此产生的稀疏信号忽视了空间交互的连续性特征。受人类点击行为自然形成以目标元素为中心的高斯分布启发,我们提出了GUI高斯定位奖励框架(GUI-G^2),该框架将GUI元素建模为界面平面上连续的高斯分布。GUI-G^2整合了两种协同机制:高斯点奖励通过以元素质心为中心的指数衰减分布来精确建模定位,而覆盖奖励则通过预测高斯分布与目标区域的重叠程度来评估空间对齐。为应对不同元素尺寸,我们开发了一种自适应方差机制,根据元素维度调整奖励分布。这一框架将GUI定位从稀疏的二元分类转变为密集的连续优化问题,其中高斯分布生成丰富的梯度信号,引导模型向最优交互位置收敛。在ScreenSpot、ScreenSpot-v2及ScreenSpot-Pro基准上的广泛实验表明,GUI-G^2显著超越了当前最先进的UI-TARS-72B方法,在ScreenSpot-Pro上实现了24.7%的最大提升。我们的分析揭示,连续建模提供了对界面变化的更强鲁棒性及对未见布局的更好泛化能力,为GUI交互任务中的空间推理确立了新范式。
English
Graphical User Interface (GUI) grounding maps natural language instructions
to precise interface locations for autonomous interaction. Current
reinforcement learning approaches use binary rewards that treat elements as
hit-or-miss targets, creating sparse signals that ignore the continuous nature
of spatial interactions. Motivated by human clicking behavior that naturally
forms Gaussian distributions centered on target elements, we introduce GUI
Gaussian Grounding Rewards (GUI-G^2), a principled reward framework that
models GUI elements as continuous Gaussian distributions across the interface
plane. GUI-G^2 incorporates two synergistic mechanisms: Gaussian point
rewards model precise localization through exponentially decaying distributions
centered on element centroids, while coverage rewards assess spatial alignment
by measuring the overlap between predicted Gaussian distributions and target
regions. To handle diverse element scales, we develop an adaptive variance
mechanism that calibrates reward distributions based on element dimensions.
This framework transforms GUI grounding from sparse binary classification to
dense continuous optimization, where Gaussian distributions generate rich
gradient signals that guide models toward optimal interaction positions.
Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro
benchmarks demonstrate that GUI-G^2, substantially outperforms
state-of-the-art method UI-TARS-72B, with the most significant improvement of
24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides
superior robustness to interface variations and enhanced generalization to
unseen layouts, establishing a new paradigm for spatial reasoning in GUI
interaction tasks.