OmegaUse:构建面向自主任务执行的通用图形用户界面代理
OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution
January 28, 2026
作者: Le Zhang, Yixiong Xiao, Xinjiang Lu, Jingjia Cao, Yusai Zhao, Jingbo Zhou, Lang An, Zikan Feng, Wanxiang Sha, Yu Shi, Congxi Xiao, Jian Xiong, Yankai Zhang, Hua Wu, Haifeng Wang
cs.AI
摘要
图形用户界面(GUI)智能体展现出巨大潜力,可使基础模型完成现实世界任务,从而革新人机交互模式并提升人类生产效率。本报告提出OmegaUse——一种支持计算机与手机使用场景、能在移动端与桌面端自主执行任务的通用GUI智能体模型。构建高效GUI智能体模型依赖两大要素:(1)高质量数据;(2)有效训练方法。为此,我们引入精心设计的数据构建流程与解耦式训练范式。数据构建方面,我们整合严格筛选的开源数据集,并提出新型自动化合成框架,通过自底向上自主探索与自顶向下分类引导生成相结合的方式,创造高保真合成数据。训练方法上,为充分发挥数据价值,采用两阶段策略:先通过监督微调(SFT)建立基础交互语法,再采用群组相对策略优化(GRPO)提升空间定位与序列规划能力。为平衡计算效率与智能体推理能力,OmegaUse基于混合专家(MoE)架构构建。针对跨终端离线能力评估,我们推出OS-Nav基准测试套件,涵盖多操作系统:面向中文安卓移动环境的ChiM-Nav,以及专注于Ubuntu系统日常桌面交互的Ubu-Nav。大量实验表明,OmegaUse在现有GUI基准测试中表现卓越:在ScreenSpot-V2上以96.3%的准确率刷新纪录,在AndroidControl上实现79.1%的领先步骤成功率。在OS-Nav测试中,OmegaUse同样表现强劲,于ChiM-Nav达到74.24%的步骤成功率,在Ubu-Nav上取得55.9%的平均成功率。
English
Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.