GoClick: 自律的GUI操作のための軽量要素接地モデル

要旨

グラフィカルユーザインタフェース（GUI）要素のグラウンディング（自然言語指示に基づくスクリーンショット上の要素の正確な位置特定）は、GUIと対話するエージェントにとって基本的な技術である。この機能をスマートフォンなどのリソース制約のあるデバイス上で直接動作させることは、低遅延を要求するGUIエージェントにとって重要性を増している。しかし、現在の視覚的グラウンディング手法は一般的に大規模な視覚言語モデル（VLM）（25億パラメータ超）を採用しており、メモリと計算資源の制約からオンデバイス実行には非現実的であるという重大な課題に直面している。この問題に対処するため、本論文はわずか2億3千万パラメータでありながら優れた視覚的グラウンディング精度を達成し、大幅に大規模なモデルと同等の性能を発揮する軽量VLM、GoClickを提案する。既存のデコーダのみのVLMを単純に縮小することは軽量モデル設計の直接的な方法であるが、我々の実験ではこのアプローチが最適な結果をもたらさないことが明らかとなった。代わりに、GUIグラウンディングタスクにおいて、小規模パラメータ条件下でデコーダのみの方式を上回る性能を示すエンコーダ-デコーダアーキテクチャを選択した。さらに、小規模VLMの限られた容量を考慮し、タスクタイプフィルタリングとデータ比率調整を活用して、1080万サンプルの生データセットから高品質な380万サンプルのコアセットを抽出するプログレッシブデータリファインメントパイプラインを開発した。このコアセットを用いてGoClickを学習させることで、グラウンディング精度の顕著な向上がもたらされた。実験の結果、GoClickは複数のGUI要素グラウンディングベンチマークで優れた性能を発揮しつつ、小型サイズと高速な推論速度を維持することを確認した。また、GoClickはデバイス-クラウド連携フレームワークに組み込まれた場合、GUIエージェントの性能を向上させ、クラウドベースのタスクプランナーが正確な要素位置特定を実行し、より高い成功率を達成することを可能にした。我々の手法がGUIエージェントコミュニティにおける有意義な探求となることを期待する。

English

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.

GoClick: 自律的GUI操作のための軽量要素接地モデル

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

要旨

Support