UI-Zoomer: GUI 그라운딩을 위한 불확실성 기반 적응형 확대

초록

자연어 질의가 주어진 스크린샷에서 인터페이스 요소를 찾아내는 GUI 그라운딩은 작은 아이콘과 조밀한 레이아웃에서 여전히 어려운 과제입니다. 테스트 타임 확대 방법은 영역을 자르고 더 높은 해상도로 추론을 재수행하여 위치 정확도를 향상시키지만, 모든 인스턴스에 고정된 자르기 크기로 균일하게 적용하여 모델이 각 경우에 실제로 불확실한지를 무시합니다. 우리는 확대 트리거와 규모를 모두 예측 불확실성 정량화 문제로 취급하는 학습이 필요 없는 적응형 확대 프레임워크인 UI-Zoomer를 제안합니다. 신뢰도 인식 게이트는 확률적 후보들 간의 공간적 일치성과 토큰 수준 생성 신뢰도를 융합하여 위치 추정이 불확실할 때만 선택적으로 확대를 트리거합니다. 트리거되면 불확실성 기반 자르기 크기 조정 모듈이 예측 분산을 샘플 간 위치 분포와 샘플 내 바운딩 박스 범위로 분해하여 총분산 법칙을 통해 인스턴스별 자르기 반경을 도출합니다. ScreenSpot-Pro, UI-Vision, ScreenSpot-v2에 대한 광범위한 실험을 통해 여러 모델 아키텍처에서 강력한 베이스라인 대비 각각 최대 +13.4%, +10.3%, +4.2%의 성능 향상을 달성하며 일관된 개선을 입증했으며, 추가 학습이 필요하지 않습니다.

English

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

UI-Zoomer: GUI 그라운딩을 위한 불확실성 기반 적응형 확대

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

초록

Support