ShowUI: GUIビジュアルエージェントのための1つのビジョン言語行動モデル

要旨

グラフィカルユーザーインターフェース（GUI）アシスタントの構築は、人間のワークフロー生産性を向上させるための大きな可能性を秘めています。ほとんどのエージェントは言語ベースであり、テキスト豊富なメタ情報（例：HTMLやアクセシビリティツリー）を持つクローズドソースAPIに依存していますが、人間と同様にUIビジュアルを認識する能力に限界があり、GUIビジュアルエージェントの必要性が浮き彫りになっています。本研究では、デジタル世界においてビジョン-言語-アクションモデルであるShowUIを開発しました。このモデルには以下の革新が特徴として組み込まれています：(i) UIに誘導されたビジュアルトークン選択により、スクリーンショットをUIに接続されたグラフとして定式化し、冗長な関係を適応的に特定し、自己注意ブロック中のトークン選択の基準として機能します；(ii) ビジョン-言語-アクションストリーミングを交互に行うことで、GUIタスク内の多様なニーズを柔軟に統合し、ナビゲーションにおけるビジュアルアクション履歴の効果的な管理を可能にし、各スクリーンショットごとにマルチターンのクエリ-アクションシーケンスをペアリングしてトレーニング効率を向上させます；(iii) 慎重なデータキュレーションと再サンプリング戦略を用いた小規模で高品質なGUI指示従属データセット。これらのコンポーネントを備えたShowUIは、256Kのデータを使用する軽量な2Bモデルで、ゼロショットスクリーンショットグラウンディングにおいて強力な75.1%の精度を達成しています。UIに誘導されたトークン選択は、トレーニング中に冗長なビジュアルトークンの33%を削減し、パフォーマンスを1.4倍高速化しています。Web Mind2Web、モバイルAITW、オンラインMiniWob環境を横断するナビゲーション実験は、当社のモデルの効果と潜在性をさらに強調しています。モデルはhttps://github.com/showlab/ShowUIで入手可能です。

English

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.

ShowUI: GUIビジュアルエージェントのための1つのビジョン言語行動モデル

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

要旨

Support