HyperClick:透過不確定性校準提升可靠GUI定位技術
HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration
October 31, 2025
作者: Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan
cs.AI
摘要
自主圖形使用者介面(GUI)代理的運作依賴精確的GUI基礎定位技術——即將語言指令映射至螢幕座標以執行使用者命令。然而,當前無論透過監督式微調(SFT)或強化微調(RFT)訓練的模型,皆缺乏對自身能力邊界的認知,導致過度自信與不可靠的預測。我們首先系統性評估通用模型與GUI專用模型中的概率化信心與語言化信心,發現信心值與實際準確度存在錯位,此問題在動態GUI自動化任務中尤為關鍵,因為單次錯誤即可能導致任務失敗。為此,我們提出HyperClick框架,透過不確定性校準來增強GUI基礎定位的可靠性。該框架採用雙重獎勵機制,結合正確動作的二元獎勵與基於截斷高斯分布的空間信心建模,並以布萊爾分數進行校準。此方法能同步優化基礎定位準確度與信心可靠性,促進內省式自我批判。在七項挑戰基準上的廣泛實驗表明,HyperClick在實現最先進性能的同時,能提供良好校準的信心指標。透過實現顯性信心校準與內省式自我批判,HyperClick有效降低過度自信問題,為GUI自動化提供更高可靠性。
English
Autonomous Graphical User Interface (GUI) agents rely on accurate GUI
grounding, which maps language instructions to on-screen coordinates, to
execute user commands. However, current models, whether trained via supervised
fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of
their capability boundaries, leading to overconfidence and unreliable
predictions. We first systematically evaluate probabilistic and verbalized
confidence in general and GUI-specific models, revealing a misalignment between
confidence and actual accuracy, which is particularly critical in dynamic GUI
automation tasks, where single errors can cause task failure. To address this,
we propose HyperClick, a novel framework that enhances reliable GUI grounding
through uncertainty calibration. HyperClick introduces a dual reward mechanism,
combining a binary reward for correct actions with a truncated Gaussian-based
spatial confidence modeling, calibrated using the Brier score. This approach
jointly optimizes grounding accuracy and confidence reliability, fostering
introspective self-criticism. Extensive experiments on seven challenge
benchmarks show that HyperClick achieves state-of-the-art performance while
providing well-calibrated confidence. By enabling explicit confidence
calibration and introspective self-criticism, HyperClick reduces overconfidence
and supports more reliable GUI automation.