UI-Zoomer：基于不确定度驱动的自适应界面局部放大定位技术

摘要

基于自然语言查询从屏幕截图中定位界面元素的GUI接地技术，在面对小型图标和密集布局时仍存在挑战。传统测试时放大方法通过裁剪图像并以更高分辨率重新推理来提升定位效果，但这类方法对所有实例采用固定尺寸的均匀裁剪，忽略了模型对每个案例的实际置信度。我们提出UI-Zoomer，一种无需训练的自适应放大框架，将放大操作的触发条件和缩放尺度转化为预测不确定性量化问题。该框架通过置信感知门控机制，将随机候选框的空间共识与词元级生成置信度相融合，仅在定位不确定时选择性触发放大操作。当触发放大时，不确定性驱动的裁剪尺寸模块将预测方差分解为样本间位置离散度和样本内边界框扩展度，通过全方差定律计算每个实例的自适应裁剪半径。在ScreenSpot-Pro、UI-Vision和ScreenSpot-v2数据集上的大量实验表明，该方法在多种模型架构下均能持续提升基线性能，分别实现最高+13.4%、+10.3%和+4.2%的增益，且无需额外训练。

English

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

UI-Zoomer：基于不确定度驱动的自适应界面局部放大定位技术

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

摘要

Support