FocusUI: 위치 보존 시각 토큰 선택을 통한 효율적인 UI 그라운딩

초록

비전-언어 모델(VLM)은 점차 고해상도 스크린샷을 처리할 수 있는 능력을 바탕으로 사용자 인터페이스(UI) 기반 작업에서 뛰어난 성능을 보여주고 있습니다. 그러나 스크린샷은 수천 개의 시각 토큰(예: 2K 해상도 기준 약 4700개)으로 토큰화되어 상당한 계산 오버헤드를 발생시키고 주의 집중을 분산시킵니다. 이와 대조적으로, 인간은 UI와 상호작용할 때 일반적으로 관심 영역에 초점을 둡니다. 본 연구에서는 효율적인 UI 기반 작업이라는 과제를 선도적으로 다룹니다. 해당 작업의 특성과 과제에 대한 실질적인 분석을 바탕으로, 정확한 기반 작업을 위해 위치 연속성을 유지하면서 지시어와 가장 관련된 패치를 선택하는 효율적인 UI 기반 프레임워크인 FocusUI를 제안합니다. FocusUI는 두 가지 핵심 과제를 해결합니다: (1) 시각 인코딩에서 중복 토큰 제거. 우리는 큰 동질 영역의 가중치를 낮춰 구별되고 지시어와 관련된 시각 토큰을 선택하기 위한 규칙 기반 UI 그래프 점수와 지시어 조건 점수를 융합하여 패치 수준 감독을 구성합니다. (2) 시각 토큰 선택 중 위치 연속성 보존. 일반적인 시각 토큰 프루닝 방법은 손상된 위치 정보로 인해 UI 기반 작업에서 심각한 정확도 저하를 겪는 것을 확인했습니다. 우리는 위치 연속성을 보존하기 위해 삭제된 시각 토큰의 각 연속 시퀀스를 해당 시퀀스의 마지막 인덱스에 배치된 단일 특수 마커로 압축하는 새로운 PosPad 전략을 도입합니다. 4가지 기반 벤치마크에 대한 포괄적인 실험을 통해 FocusUI가 GUI 특화 베이스라인을 능가함을 입증했습니다. ScreenSpot-Pro 벤치마크에서 FocusUI-7B는 GUI-Actor-7B 대비 3.7%의 성능 향상을 달성했습니다. 시각 토큰을 30%만 유지하더라도 FocusUI-7B는 단 3.2%만 하락하면서 최대 1.44배 빠른 추론 속도와 17% 낮은 최대 GPU 메모리 사용량을 달성했습니다.

English

Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.

FocusUI: 위치 보존 시각 토큰 선택을 통한 효율적인 UI 그라운딩

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

초록

Support