ChatPaper.aiChatPaper

Mobile-Agent-v3:图形用户界面自动化的基础智能体

Mobile-Agent-v3: Foundamental Agents for GUI Automation

August 21, 2025
作者: Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan
cs.AI

摘要

本文介绍了GUI-Owl,一个基础性的GUI代理模型,在涵盖桌面和移动环境的十项GUI基准测试中,包括定位、问答、规划、决策和程序知识,实现了开源端到端模型中的最先进性能。GUI-Owl-7B在AndroidWorld上达到66.4分,在OSWorld上达到29.4分。在此基础上,我们提出了Mobile-Agent-v3,一个通用GUI代理框架,进一步将性能提升至AndroidWorld上的73.3分和OSWorld上的37.7分,为开源GUI代理框架设立了新的标杆。GUI-Owl融合了三大创新点:(1) 大规模环境基础设施:一个覆盖Android、Ubuntu、macOS和Windows的云端虚拟环境,支撑了我们的自进化GUI轨迹生成框架。该框架通过自动化查询生成与正确性验证,利用GUI-Owl迭代优化轨迹,形成自我提升的闭环,支持多样化的数据管道并减少人工标注。(2) 多样化基础代理能力:通过整合UI定位、规划、动作语义及推理模式,GUI-Owl支持端到端决策,并可作为多代理系统中的模块化组件。(3) 可扩展环境强化学习:我们开发了一个完全异步训练的可扩展强化学习框架,以实现与现实世界的对齐。同时,我们引入了轨迹感知的相对策略优化(TRPO)用于在线强化学习,在OSWorld上取得了34.9分的成绩。GUI-Owl与Mobile-Agent-v3已在https://github.com/X-PLUG/MobileAgent开源。
English
This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.
PDF483August 22, 2025