UI-Zoomer：基于不确定性的自适应图形界面局部聚焦技术

摘要

界面元素自然语言定位技术（GUI grounding）旨在根据自然语言查询从截图中定位界面元素，但在处理小型图标和密集布局时仍面临挑战。基于测试时放大方法通过裁剪图像并以更高分辨率重新推理来改善定位效果，但这类方法对所有实例采用固定尺寸的均匀裁剪，忽略了模型对每个案例的实际置信度。我们提出UI-Zoomer，一种无需训练的自适应放大框架，将放大操作的触发条件和尺度量化为预测不确定性评估问题。置信感知门控机制通过融合随机候选框的空间一致性与词元级生成置信度，仅在定位不确定时选择性触发放大操作。当触发放大时，基于不确定性的裁剪尺寸模块将预测方差分解为样本间位置离散度和样本内边界框扩展度，通过全方差定律计算每个实例的自适应裁剪半径。在ScreenSpot-Pro、UI-Vision和ScreenSpot-v2数据集上的大量实验表明，该框架在多种模型架构下均能持续提升基线性能，分别实现最高+13.4%、+10.3%和+4.2%的增益，且无需额外训练。

English

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.