OmniGUI：オムニモーダルスマートフォン環境におけるGUIエージェントのベンチマーク

要旨

現在のグラフィカルユーザインタフェース（GUI）エージェント向けベンチマークは、主に静的スクリーンショットに依存している。しかし、実世界のスマートフォン操作では、エージェントが動作の瞬間と密接に関連する過渡的な音声キューや時間的映像ダイナミクスを処理することが日常的に求められる。このギャップを埋めるため、我々はOmniGUIを導入する。これは、オムニモーダルなスマートフォン環境においてGUIエージェントを評価するために設計された、初のステップレベルベンチマークである。OmniGUIは、各アクションステップにおいて、静的画像、同期音声、動画クリップから構成される連続的かつインターリーブされたマルチモーダル入力を提供する。データセットは、29のアプリケーションにわたる709の専門家によるデモエピソード（2,579のアクションステップ）を含み、客観的なマルチモーダル依存度レベルで体系的にアノテーションされている。専用のオムニモーダルGUIエージェントフレームワークはまだ初期段階にあるため、我々はインターリーブ入力をネイティブに処理可能な基盤的オムニモーダルモデルを選択し、初期ベースラインのエージェントプロキシとして用いる。実験評価の結果、現在のモデルは視覚的に静的なタスクでは能力を示すものの、同期的な時間的および聴覚的信号を必要とする環境ではアクション予測性能が著しく低下することが明らかになった。さらに、アブレーション研究により、タスクに関係のない環境ノイズを処理する際のクロスモーダル干渉など、特定の動作上のボトルネックが特定された。完全なデータセット、評価パイプライン、ベースラインプロンプトは補足資料に提供されている。プロジェクトページ: https://omni-gui.github.io

English

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.