Mobile-Agent-v3：圖形用戶界面自動化的基礎代理

摘要

本文介紹了GUI-Owl，這是一個基礎的GUI代理模型，在十個涵蓋桌面和移動環境的GUI基準測試中，於開源端到端模型中達到了最先進的性能，這些測試包括基礎定位、問答、規劃、決策制定和程序性知識。GUI-Owl-7B在AndroidWorld上取得了66.4分，在OSWorld上取得了29.4分。基於此，我們提出了Mobile-Agent-v3，這是一個通用GUI代理框架，進一步將性能提升至AndroidWorld的73.3分和OSWorld的37.7分，為開源GUI代理框架設定了新的最先進標準。GUI-Owl融合了三項關鍵創新：(1) 大規模環境基礎設施：一個基於雲端的虛擬環境，涵蓋Android、Ubuntu、macOS和Windows，支持我們的自我進化GUI軌跡生成框架。該框架通過自動化查詢生成和正確性驗證生成高質量交互數據，利用GUI-Owl迭代精煉軌跡，形成自我改進的循環。它支持多樣化的數據管道並減少手動註釋。(2) 多樣化的基礎代理能力：通過整合UI定位、規劃、動作語義和推理模式，GUI-Owl支持端到端決策制定，並可作為多代理系統中的模塊化組件。(3) 可擴展的環境強化學習：我們開發了一個可擴展的強化學習框架，具有完全異步訓練以實現與現實世界的對齊。我們還引入了軌跡感知相對策略優化（TRPO）用於在線強化學習，在OSWorld上取得了34.9分。GUI-Owl和Mobile-Agent-v3已在https://github.com/X-PLUG/MobileAgent開源。

English

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v3：圖形用戶界面自動化的基礎代理

Mobile-Agent-v3: Foundamental Agents for GUI Automation

摘要

Support