FocusUI: 位置情報保持型視覚トークン選択による効率的なUIグラウンディング

要旨

視覚言語モデル（VLM）は、高解像度のスクリーンショットを処理する能力の向上により、ユーザーインターフェース（UI）グラウンディングタスクで顕著な性能を示している。しかし、スクリーンショットは数千もの視覚トークン（例：2K解像度で約4700トークン）にトークン化されるため、多大な計算コストが発生し、注意力が分散されるという課題がある。一方、人間はUIを操作する際、通常は関心領域に焦点を当てる。本研究では、効率的なUIグラウンディングという新たなタスクを開拓する。本タスクの特性と課題に関する実践的分析に基づき、我々はFocusUIを提案する。これは、位置情報の連続性を保ちつつ、指示に関連性の高いパッチを選択することで、精密なグラウンディングを実現する効率的なUIグラウンディングフレームワークである。FocusUIは以下の2つの主要課題に取り組む：(1) 視覚エンコーディングにおける冗長トークンの削減。指示に条件付けされたスコアと、大きな均質領域を重み付け減衰させるルールベースのUIグラウフスコアを融合させ、特徴的で指示に関連する視覚トークンを選択するパッチレベルの教師信号を構築する。(2) 視覚トークン選択時の位置情報連続性の維持。一般的な視覚トークン剪定手法は、位置情報が断絶されるため、UIグラウンディングタスクでは精度が大幅に低下することを見出した。我々は新たなPosPad戦略を導入する。これは、削除された視覚トークンの連続シーケンスそれぞれを、そのシーケンスの最終インデックスに配置された単一の特殊マーカーに圧縮し、位置情報の連続性を保持するものである。4つのグラウンディングベンチマークによる総合的な実験により、FocusUIがGUI特化のベースライン手法を凌駕することを実証した。ScreenSpot-Proベンチマークでは、FocusUI-7BはGUI-Actor-7Bに対して3.7%の性能向上を達成した。視覚トークン保持率がわずか30%の場合でも、FocusUI-7Bの精度低下は3.2%に留まり、推論速度は最大1.44倍高速化、ピークGPUメモリ使用量は17%低減を実現した。

English

Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.

FocusUI: 位置情報保持型視覚トークン選択による効率的なUIグラウンディング

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

要旨

Support