OmniGUI：在全模态智能手机环境中对GUI代理进行基准测试

摘要

当前用于图形用户界面（GUI）智能体的基准测试主要依赖静态截图。然而，现实世界的智能手机交互场景中，智能体往往需要处理与操作时机紧密耦合的瞬时音频线索和时序视频动态。为弥补这一差距，我们提出OmniGUI——首个面向全模态智能手机环境、专为评估GUI智能体设计的步骤级基准。OmniGUI在每个动作步骤中提供连续交错的多模态输入，包含静态图像、同步音频及视频片段。该数据集涵盖29款应用中的709条专家示范轨迹（2579个动作步骤），并系统标注了客观的多模态依赖层级。由于专门的全模态GUI智能体框架尚处萌芽阶段，我们选取能原生处理交错输入的基础全模态模型作为初始基线的代理智能体。实证评估表明，当前模型在视觉静态任务上表现尚可，但在需要同步时序与听觉信号的环境中，其动作预测性能显著下降。此外，消融实验揭示了具体操作瓶颈，尤其是在处理任务无关的环境噪声时存在的跨模态干扰问题。完整数据集、评估流程及基线提示词均在补充材料中提供。项目页面：https://omni-gui.github.io。

English

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.