Mobile-Agent-v3: GUI自動化のための基本エージェント

要旨

本論文では、GUI-Owlを紹介する。これは、デスクトップとモバイル環境における10のGUIベンチマークで、グラウンディング、質問応答、計画立案、意思決定、手続き的知識をカバーし、オープンソースのエンドツーエンドモデルの中で最先端の性能を達成する基盤的なGUIエージェントモデルである。GUI-Owl-7Bは、AndroidWorldで66.4、OSWorldで29.4を達成した。これを基に、汎用GUIエージェントフレームワークであるMobile-Agent-v3を提案し、AndroidWorldで73.3、OSWorldで37.7と性能をさらに向上させ、オープンソースGUIエージェントフレームワークの新たな最先端を確立した。GUI-Owlは、以下の3つの主要な革新を組み込んでいる：(1) 大規模環境インフラストラクチャ：Android、Ubuntu、macOS、Windowsにまたがるクラウドベースの仮想環境を構築し、Self-Evolving GUI Trajectory Productionフレームワークを可能にした。これにより、自動化されたクエリ生成と正確性検証を通じて高品質なインタラクションデータを生成し、GUI-Owlを活用して軌跡を反復的に洗練し、自己改善ループを形成する。これにより、多様なデータパイプラインをサポートし、手動アノテーションを削減する。(2) 多様な基盤的エージェント能力：UIグラウンディング、計画立案、アクションセマンティクス、推論パターンを統合することで、GUI-Owlはエンドツーエンドの意思決定をサポートし、マルチエージェントシステムにおけるモジュールコンポーネントとして機能できる。(3) スケーラブルな環境RL：完全非同期トレーニングを備えたスケーラブルな強化学習フレームワークを開発し、現実世界との整合性を実現した。また、オンラインRLのためのTrajectory-aware Relative Policy Optimization (TRPO)を導入し、OSWorldで34.9を達成した。GUI-OwlとMobile-Agent-v3は、https://github.com/X-PLUG/MobileAgentでオープンソースとして公開されている。

English

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v3: GUI自動化のための基本エージェント

Mobile-Agent-v3: Foundamental Agents for GUI Automation

要旨

Support