SafeGround：通过不确定性校准判断GUI定位模型的可靠时机

摘要

图形用户界面（GUI） grounding 技术旨在将自然语言指令转化为可执行的屏幕坐标，实现自动化界面交互。然而，不准确的坐标定位可能引发代价高昂且难以逆转的操作（如错误支付授权），这引发了对模型可靠性的担忧。本文提出SafeGround框架，这是一种面向GUI定位模型的不确定性感知系统，通过测试前的校准过程实现风险感知预测。该框架采用分布感知的不确定性量化方法，能够捕捉任意给定模型输出中随机样本的空间离散特性。经过校准流程，SafeGround可得出具有统计保证的误发现率（FDR）控制的测试决策阈值。我们在挑战性基准数据集ScreenSpot-Pro上对多种GUI定位模型应用SafeGround。实验结果表明：我们的不确定性度量在区分正误预测方面持续优于现有基线方法；经过校准的阈值不仅能实现严格的风险控制，更展现出显著提升系统级精度的潜力。在多个GUI定位模型中，SafeGround较纯Gemini推理将系统级精度最高提升5.38个百分点。

English

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.

SafeGround：通过不确定性校准判断GUI定位模型的可靠时机

SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

摘要

Support