OmegaUse:构建面向自主任务执行的通用图形用户界面代理
OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution
January 28, 2026
作者: Le Zhang, Yixiong Xiao, Xinjiang Lu, Jingjia Cao, Yusai Zhao, Jingbo Zhou, Lang An, Zikan Feng, Wanxiang Sha, Yu Shi, Congxi Xiao, Jian Xiong, Yankai Zhang, Hua Wu, Haifeng Wang
cs.AI
摘要
圖形用戶界面(GUI)智能體展現出巨大潛力,能讓基礎模型完成現實世界任務,從而革新人機交互模式並提升人類生產效率。本報告提出OmegaUse——一種支持計算機與手機使用場景、可在移動端與桌面端平台自主執行任務的通用型GUI智能體模型。構建高效GUI智能體模型依賴兩大關鍵要素:(1)高質量數據;(2)有效訓練方法。為此,我們引入了精心設計的數據構建流水線與解耦式訓練範式。在數據構建方面,我們整合嚴格篩選的開源數據集,並提出創新的自動化合成框架,通過自底向上的自主探索與自頂向下的分類法引導生成相結合,創建高保真合成數據。訓練策略上,為充分發揮數據價值,採用兩階段方案:先通過監督微調(SFT)建立基礎交互語法,再採用群組相對策略優化(GRPO)增強空間定位與序列規劃能力。為平衡計算效率與智能體推理能力,OmegaUse基於混合專家(MoE)架構構建。針對跨終端離線能力評估,我們推出OS-Nav基準測試套件,覆蓋多操作系統:面向中文安卓移動環境的ChiM-Nav,以及專注Ubuntu系統常規桌面交互的Ubu-Nav。大量實驗表明,OmegaUse在現有GUI基準測試中表現卓越:在ScreenSpot-V2上以96.3%的準確率刷新紀錄,在AndroidControl上達到79.1%的步驟成功率。在OS-Nav測試中,OmegaUse同樣表現優異,於ChiM-Nav實現74.24%步驟成功率,在Ubu-Nav取得55.9%平均成功率。
English
Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.