GUI-Actor: GUIエージェントのための座標フリー視覚的グラウンディング

要旨

VLMを活用したGUIエージェントの構築における主要な課題の一つは、視覚的グラウンディング、すなわち視覚的コンテンツとテキストプランに基づいてアクション実行のための適切な画面領域を特定することです。既存の研究の多くはこれをテキストベースの座標生成タスクとして定式化しています。しかし、これらのアプローチにはいくつかの制限があります：空間的・意味的アラインメントの弱さ、曖昧な監督ターゲットの処理能力の欠如、画面座標の密な性質とVision Transformersのようなモデルが抽出する粗いパッチレベルの視覚的特徴とのミスマッチなどです。本論文では、座標フリーのGUIグラウンディングのためのVLMベースの手法であるGUI-Actorを提案します。GUI-Actorの核心は、専用の<ACTOR>トークンを関連するすべての視覚的パッチトークンとアラインメントすることを学習するアテンションベースのアクションヘッドを導入し、モデルが単一のフォワードパスで一つ以上のアクション領域を提案できるようにすることです。これに伴い、アクション実行のために提案された候補から最も妥当なアクション領域を評価・選択するグラウンディング検証器をさらに設計します。広範な実験により、GUI-Actorが複数のGUIアクショングラウンディングベンチマークにおいて従来の最先端手法を上回り、未見の画面解像度やレイアウトに対する一般化能力が向上していることが示されました。特に、GUI-Actor-7BはScreenSpot-ProにおいてUI-TARS-72B（38.1）を上回り、Qwen2-VLをバックボーンとした場合に40.7、Qwen2.5-VLをバックボーンとした場合に44.6のスコアを達成しました。さらに、検証器を組み込むことで、新たに導入されたアクションヘッド（7Bモデルで約100Mパラメータ）のみをファインチューニングし、VLMバックボーンを凍結したままでも、従来の最先端モデルに匹敵する性能を達成できることがわかりました。これは、GUI-Actorが基盤となるVLMに汎用的な強みを損なうことなく効果的なグラウンディング能力を付与できることを示しています。

English

One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

GUI-Actor: GUIエージェントのための座標フリー視覚的グラウンディング

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

要旨

Support