OmegaUse: 自律的タスク実行のための汎用GUIエージェントの構築

要旨

グラフィカルユーザインタフェース（GUI）エージェントは、基盤モデルが実世界のタスクを完了することを可能にする大きな可能性を示しており、人間とコンピュータの相互作用に革命をもたらし、人間の生産性を向上させることが期待されています。本報告書では、コンピュータ利用とスマートフォン利用のシナリオをサポートし、モバイルおよびデスクトッププラットフォームの両方で自律的なタスク実行を行うための汎用GUIエージェントモデルであるOmegaUseを紹介します。効果的なGUIエージェントモデルを構築するには、2つの要素、(1) 高品質なデータと (2) 効果的な訓練方法が重要です。これらに対処するため、我々は注意深く設計されたデータ構築パイプラインと、分離された訓練パラダイムを提案します。データ構築については、厳選されたオープンソースデータセットを活用するとともに、ボトムアップの自律探索とトップダウンの分類体系に基づく生成を統合し、高精度な合成データを作成する新しい自動合成フレームワークを導入します。訓練については、これらのデータをより効果的に活用するため、基本的な相互作用の構文を確立する教師ありファインチューニング（SFT）と、空間的基礎付けと逐次計画を改善するグループ相対方策最適化（GRPO）からなる2段階の戦略を採用します。計算効率とエージェントの推論能力のバランスを取るため、OmegaUseはMixture-of-Experts（MoE）バックボーン上に構築されています。オフライン設定でのクロス端末能力を評価するため、複数のオペレーティングシステムにまたがるベンチマークスイートOS-Navを導入しました。これは、中国のAndroidモバイル環境を対象としたChiM-Navと、Ubuntu上の日常的なデスクトップ操作に焦点を当てたUbu-Navからなります。大規模な実験により、OmegaUseは既存のGUIベンチマークで非常に高い競争力を発揮し、ScreenSpot-V2で96.3%のState-of-the-Art（SOTA）スコアを、AndroidControlで79.1%のステップ成功率（トップクラス）を達成したことが示されました。OmegaUseはOS-Navでも強力な性能を発揮し、ChiM-Navで74.24%のステップ成功率、Ubu-Navで55.9%の平均成功率に達しました。

English

Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.

OmegaUse: 自律的タスク実行のための汎用GUIエージェントの構築

OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

要旨

Support