ChatPaper.aiChatPaper

GUI-G^2:基於高斯獎勵建模的圖形用戶界面定位

GUI-G^2: Gaussian Reward Modeling for GUI Grounding

July 21, 2025
作者: Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
cs.AI

摘要

圖形用戶界面(GUI)基礎映射將自然語言指令精確定位至界面位置,以實現自主交互。現有的強化學習方法採用二元獎勵,將界面元素視為命中或未命中的目標,產生的稀疏信號忽略了空間交互的連續性特徵。受人類點擊行為自然形成以目標元素為中心的高斯分佈啟發,我們引入了GUI高斯基礎獎勵(GUI-G^2),這是一種原則性的獎勵框架,將GUI元素建模為界面平面上的連續高斯分佈。GUI-G^2整合了兩種協同機制:高斯點獎勵通過以元素質心為中心的指數衰減分佈來精確定位,而覆蓋獎勵則通過測量預測高斯分佈與目標區域的重疊來評估空間對齊。為應對多樣化的元素尺度,我們開發了一種自適應方差機制,根據元素尺寸校準獎勵分佈。該框架將GUI基礎從稀疏的二元分類轉變為密集的連續優化,其中高斯分佈生成豐富的梯度信號,引導模型朝向最佳交互位置。在ScreenSpot、ScreenSpot-v2和ScreenSpot-Pro基準上的廣泛實驗表明,GUI-G^2顯著超越了最先進的方法UI-TARS-72B,在ScreenSpot-Pro上取得了最為顯著的24.7%提升。我們的分析揭示,連續建模提供了對界面變化的卓越魯棒性,並增強了對未見佈局的泛化能力,為GUI交互任務中的空間推理建立了新範式。
English
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G^2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G^2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G^2, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.
PDF1185July 22, 2025