SafeGround:基于不确定性校准的GUI接地模型可信度判定系统
SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration
February 2, 2026
作者: Qingni Wang, Yue Fan, Xin Eric Wang
cs.AI
摘要
图形用户界面(GUI) grounding 技术旨在将自然语言指令转化为可执行的屏幕坐标,从而实现自动化界面交互。然而,错误的坐标定位可能引发代价高昂且难以逆转的操作(如错误支付授权),这引发了人们对模型可靠性的担忧。本文提出SafeGround——一种面向GUI定位模型的不确定性感知框架,通过测试前的校准过程实现风险感知预测。该框架采用分布感知的不确定性量化方法,能够捕捉任意给定模型输出中随机样本的空间离散特性。随后通过校准过程,SafeGround可在统计层面保证错误发现率(FDR)受控的前提下,推导出测试阶段的决策阈值。我们在挑战性基准数据集ScreenSpot-Pro上对多种GUI定位模型应用SafeGround。实验结果表明:我们的不确定性度量方法在区分正误预测方面持续优于现有基线,而经过校准的阈值不仅能实现严格的风险控制,更展现出显著提升系统级准确率的潜力。在多种GUI定位模型中,SafeGround相较纯Gemini推理可将系统级准确率最高提升5.38个百分点。
English
Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.