GUI-G^2:基於高斯獎勵建模的圖形用戶界面定位
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
July 21, 2025
作者: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
cs.AI
摘要
圖形用戶界面(GUI)基礎映射將自然語言指令精確定位至界面位置,以實現自主交互。現有的強化學習方法採用二元獎勵,將界面元素視為命中或未命中的目標,產生的稀疏信號忽略了空間交互的連續性特徵。受人類點擊行為自然形成以目標元素為中心的高斯分佈啟發,我們引入了GUI高斯基礎獎勵(GUI-G^2),這是一種原則性的獎勵框架,將GUI元素建模為界面平面上的連續高斯分佈。GUI-G^2整合了兩種協同機制:高斯點獎勵通過以元素質心為中心的指數衰減分佈來精確定位,而覆蓋獎勵則通過測量預測高斯分佈與目標區域的重疊來評估空間對齊。為應對多樣化的元素尺度,我們開發了一種自適應方差機制,根據元素尺寸校準獎勵分佈。該框架將GUI基礎從稀疏的二元分類轉變為密集的連續優化,其中高斯分佈生成豐富的梯度信號,引導模型朝向最佳交互位置。在ScreenSpot、ScreenSpot-v2和ScreenSpot-Pro基準上的廣泛實驗表明,GUI-G^2顯著超越了最先進的方法UI-TARS-72B,在ScreenSpot-Pro上取得了最為顯著的24.7%提升。我們的分析揭示,連續建模提供了對界面變化的卓越魯棒性,並增強了對未見佈局的泛化能力,為GUI交互任務中的空間推理建立了新範式。
English
Graphical User Interface (GUI) grounding maps natural language instructions
to precise interface locations for autonomous interaction. Current
reinforcement learning approaches use binary rewards that treat elements as
hit-or-miss targets, creating sparse signals that ignore the continuous nature
of spatial interactions. Motivated by human clicking behavior that naturally
forms Gaussian distributions centered on target elements, we introduce GUI
Gaussian Grounding Rewards (GUI-G^2), a principled reward framework that
models GUI elements as continuous Gaussian distributions across the interface
plane. GUI-G^2 incorporates two synergistic mechanisms: Gaussian point
rewards model precise localization through exponentially decaying distributions
centered on element centroids, while coverage rewards assess spatial alignment
by measuring the overlap between predicted Gaussian distributions and target
regions. To handle diverse element scales, we develop an adaptive variance
mechanism that calibrates reward distributions based on element dimensions.
This framework transforms GUI grounding from sparse binary classification to
dense continuous optimization, where Gaussian distributions generate rich
gradient signals that guide models toward optimal interaction positions.
Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro
benchmarks demonstrate that GUI-G^2, substantially outperforms
state-of-the-art method UI-TARS-72B, with the most significant improvement of
24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides
superior robustness to interface variations and enhanced generalization to
unseen layouts, establishing a new paradigm for spatial reasoning in GUI
interaction tasks.