Mobile-Agent-v3：图形用户界面自动化的基础智能体

摘要

本文介绍了GUI-Owl，一个基础性的GUI代理模型，在涵盖桌面和移动环境的十项GUI基准测试中，包括定位、问答、规划、决策和程序知识，实现了开源端到端模型中的最先进性能。GUI-Owl-7B在AndroidWorld上达到66.4分，在OSWorld上达到29.4分。在此基础上，我们提出了Mobile-Agent-v3，一个通用GUI代理框架，进一步将性能提升至AndroidWorld上的73.3分和OSWorld上的37.7分，为开源GUI代理框架设立了新的标杆。GUI-Owl融合了三大创新点：(1) 大规模环境基础设施：一个覆盖Android、Ubuntu、macOS和Windows的云端虚拟环境，支撑了我们的自进化GUI轨迹生成框架。该框架通过自动化查询生成与正确性验证，利用GUI-Owl迭代优化轨迹，形成自我提升的闭环，支持多样化的数据管道并减少人工标注。(2) 多样化基础代理能力：通过整合UI定位、规划、动作语义及推理模式，GUI-Owl支持端到端决策，并可作为多代理系统中的模块化组件。(3) 可扩展环境强化学习：我们开发了一个完全异步训练的可扩展强化学习框架，以实现与现实世界的对齐。同时，我们引入了轨迹感知的相对策略优化（TRPO）用于在线强化学习，在OSWorld上取得了34.9分的成绩。GUI-Owl与Mobile-Agent-v3已在https://github.com/X-PLUG/MobileAgent开源。

English

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v3：图形用户界面自动化的基础智能体

Mobile-Agent-v3: Foundamental Agents for GUI Automation

摘要

Support