OmniGUI: Benchmarken van GUI-agenten in omni-modale smartphone-omgevingen

Samenvatting

Huidige benchmarks voor GUI-agenten (grafische gebruikersinterface) zijn voornamelijk gebaseerd op statische schermafbeeldingen. In de praktijk vereist interactie met smartphones echter dat agenten regelmatig vluchtige audiocues en temporele videodynamiek verwerken, die nauw samenhangen met het moment van handelen. Om deze kloof te overbruggen introduceren we OmniGUI, de eerste stapniveau-benchmark die is ontworpen om GUI-agenten te evalueren in omni-modale smartphone-omgevingen. OmniGUI biedt continue, afwisselende multimodale invoer bestaande uit statische afbeeldingen, synchrone audio en videoclips bij elke actiestap. De dataset omvat 709 door experts gedemonstreerde afleveringen (2.579 actiestappen) in 29 applicaties, systematisch geannoteerd met objectieve multimodale afhankelijkheidsniveaus. Aangezien speciale omni-modale GUI-agent-frameworks zich nog in een pril stadium bevinden, selecteren we fundamentele omni-modale modellen die native afwisselende invoer kunnen verwerken om te dienen als agent-proxy's voor onze initiële baselines. Onze empirische evaluatie laat zien dat, hoewel huidige modellen competent zijn in visueel statische taken, hun actievoorspellingsprestaties aanzienlijk afnemen in omgevingen die synchrone temporele en auditieve signalen vereisen. Bovendien identificeren ablatiestudies specifieke operationele knelpunten, met name cross-modale interferentie bij het verwerken van taak-irrelevante omgevingsruis. De volledige dataset, evaluatiepijplijn en baseline-prompts zijn beschikbaar in het aanvullende materiaal. Projectpagina: https://omni-gui.github.io.

English

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.