OmniGUI：全模態智慧型手機環境下的GUI代理基準測試

摘要

目前針對圖形使用者介面（GUI）代理的基準測試主要依賴靜態螢幕截圖。然而，現實世界的智慧型手機互動經常需要代理處理與行動時刻緊密耦合的瞬時音訊提示及時間性影片動態。為填補此差距，我們提出OmniGUI，這是首個專為在全模態智慧型手機環境中評估GUI代理而設計的步驟層級基準。OmniGUI提供連續、交錯的多模態輸入，包含每個行動步驟中的靜態影像、同步音訊及影片片段。該資料集涵蓋29個應用程式中709個專家示範的任務（共2,579個行動步驟），並系統性地標註了客觀的多模態依賴程度。由於專屬的全模態GUI代理框架目前仍處於初期階段，我們選擇能原生處理交錯輸入的基礎全模態模型作為初始基線的代理代表。我們的實證評估顯示，雖然現有模型在視覺靜態任務上表現良好，但在需要同步時間與聽覺訊號的環境中，其行動預測效能顯著下降。此外，消融研究隔離出特定的運算瓶頸，特別是在處理與任務無關的環境噪音時所產生的跨模態干擾。完整資料集、評估流程及基線提示均已提供於補充材料中。專案頁面：https://omni-gui.github.io。

English

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.