ChatPaper.aiChatPaper

Mobile-Agent-v3:圖形用戶界面自動化的基礎代理

Mobile-Agent-v3: Foundamental Agents for GUI Automation

August 21, 2025
作者: Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan
cs.AI

摘要

本文介紹了GUI-Owl,這是一個基礎的GUI代理模型,在十個涵蓋桌面和移動環境的GUI基準測試中,於開源端到端模型中達到了最先進的性能,這些測試包括基礎定位、問答、規劃、決策制定和程序性知識。GUI-Owl-7B在AndroidWorld上取得了66.4分,在OSWorld上取得了29.4分。基於此,我們提出了Mobile-Agent-v3,這是一個通用GUI代理框架,進一步將性能提升至AndroidWorld的73.3分和OSWorld的37.7分,為開源GUI代理框架設定了新的最先進標準。GUI-Owl融合了三項關鍵創新:(1) 大規模環境基礎設施:一個基於雲端的虛擬環境,涵蓋Android、Ubuntu、macOS和Windows,支持我們的自我進化GUI軌跡生成框架。該框架通過自動化查詢生成和正確性驗證生成高質量交互數據,利用GUI-Owl迭代精煉軌跡,形成自我改進的循環。它支持多樣化的數據管道並減少手動註釋。(2) 多樣化的基礎代理能力:通過整合UI定位、規劃、動作語義和推理模式,GUI-Owl支持端到端決策制定,並可作為多代理系統中的模塊化組件。(3) 可擴展的環境強化學習:我們開發了一個可擴展的強化學習框架,具有完全異步訓練以實現與現實世界的對齊。我們還引入了軌跡感知相對策略優化(TRPO)用於在線強化學習,在OSWorld上取得了34.9分。GUI-Owl和Mobile-Agent-v3已在https://github.com/X-PLUG/MobileAgent開源。
English
This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.
PDF473August 22, 2025