人間と同様にデジタル世界をナビゲートする：GUI エージェントのための普遍的なビジュアルグラウンディング

要旨

マルチモーダルな大規模言語モデル（MLLMs）は、グラフィカルユーザーインターフェース（GUI）エージェントの能力を変革し、制御されたシミュレーションからさまざまなプラットフォームでの複雑な実世界アプリケーションへの移行を容易にしています。ただし、これらのエージェントの効果は、そのグラウンディング能力の堅牢性にかかっています。現在のGUIエージェントは、主にHTMLやアクセシビリティツリーなどのテキストベースの表現を利用していますが、これらはしばしばノイズ、不完全さ、および増加した計算オーバーヘッドをもたらします。本論文では、GUIエージェントに人間らしい具現化を提唱し、環境を完全に視覚的に認識し、GUI上でピクセルレベルの操作を直接行うことを提案しています。重要なのは、GUI要素のさまざまな指示表現をGUI上の座標に正確にマッピングできるビジュアルグラウンディングモデルです。我々は、ウェブベースの合成データとLLaVAアーキテクチャのわずかな適応を含むシンプルな手法が、このようなビジュアルグラウンディングモデルの訓練に驚くほど効果的であることを示しています。これまでで最大のGUIビジュアルグラウンディングデータセットを収集し、130万枚以上のスクリーンショットで1,000万個のGUI要素とそれに対応する指示表現を含むデータセットを使用して、GUIエージェント向けの強力な普遍的なビジュアルグラウンディングモデルであるUGroundを訓練します。3つのカテゴリ（グラウンディング、オフラインエージェント、オンラインエージェント）にわたる6つのベンチマークでの実験結果は、1）UGroundがGUIエージェント向けの既存のビジュアルグラウンディングモデルを最大20%絶対値で上回り、2）UGroundを使用するエージェントが最先端のエージェントを上回ることを示しています。既存のエージェントが追加のテキストベースの入力を使用しているのに対し、私たちのエージェントは視覚認識のみを使用しています。これらの結果は、人間と同様にデジタル世界を航行するGUIエージェントの実現可能性と将来性を強力に支持しています。

English

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

人間と同様にデジタル世界をナビゲートする：GUI エージェントのための普遍的なビジュアルグラウンディング

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

要旨

Support