OmniGUI: 옴니모달 스마트폰 환경에서의 GUI 에이전트 벤치마킹

초록

그래픽 사용자 인터페이스(GUI) 에이전트를 위한 현재 벤치마크는 대부분 정적 스크린샷에 의존한다. 그러나 실제 스마트폰 상호작용에서는 에이전트가 행동 순간과 밀접하게 결합된 일시적인 청각 신호와 동적인 비디오 정보를 처리해야 하는 경우가 빈번하다. 이러한 격차를 해소하기 위해, 우리는 OmniGUI를 소개한다. 이는 옴니모달 스마트폰 환경에서 GUI 에이전트를 평가하도록 설계된 최초의 단계 수준(step-level) 벤치마크이다. OmniGUI는 각 행동 단계마다 정적 이미지, 동기화된 오디오, 비디오 클립으로 구성된 연속적이고 교차된 멀티모달 입력을 제공한다. 데이터셋은 29개 애플리케이션에 걸친 709개의 전문가 시연 에피소드(2,579개의 행동 단계)를 포함하며, 객관적인 멀티모달 의존성 수준으로 체계적으로 주석이 달려 있다. 전용 옴니모달 GUI 에이전트 프레임워크는 현재 초기 단계이므로, 우리는 교차 입력을 기본적으로 처리할 수 있는 기초 옴니모달 모델을 선별하여 초기 기준선을 위한 에이전트 대리자(proxy)로 사용한다. 실증 평가 결과, 현재 모델은 시각적으로 정적인 작업에서는 능숙하지만, 동기식 시간 및 청각 신호가 필요한 환경에서는 행동 예측 성능이 현저히 저하된다. 또한, 절제 연구(ablation study)를 통해 작업과 무관한 환경 소음을 처리할 때 발생하는 교차 모달 간섭(cross-modal interference)과 같은 특정 작동 상의 병목 현상을 분리하여 확인한다. 전체 데이터셋, 평가 파이프라인 및 기준선 프롬프트는 부록 자료에 제공된다. 프로젝트 페이지: https://omni-gui.github.io.

English

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.