Mobile-Agent-v3: Fundamentele Agents voor GUI-automatisering

Samenvatting

Dit artikel introduceert GUI-Owl, een fundamenteel GUI-agentmodel dat state-of-the-art prestaties behaalt onder open-source end-to-end modellen op tien GUI-benchmarks voor desktop- en mobiele omgevingen, waarbij grounding, vraagbeantwoording, planning, besluitvorming en procedurele kennis worden behandeld. GUI-Owl-7B behaalt 66,4 op AndroidWorld en 29,4 op OSWorld. Hierop voortbouwend stellen we Mobile-Agent-v3 voor, een algemeen GUI-agentframework dat de prestaties verder verbetert naar 73,3 op AndroidWorld en 37,7 op OSWorld, waarmee een nieuwe state-of-the-art wordt gevestigd voor open-source GUI-agentframeworks. GUI-Owl omvat drie belangrijke innovaties: (1) Grootschalige Omgevingsinfrastructuur: een cloudgebaseerde virtuele omgeving die Android, Ubuntu, macOS en Windows omvat, waardoor ons Self-Evolving GUI Trajectory Production-framework mogelijk wordt. Dit genereert hoogwaardige interactiedata via geautomatiseerde querygeneratie en correctheidsvalidatie, waarbij GUI-Owl wordt ingezet om trajecten iteratief te verfijnen, wat een zelfverbeterende lus vormt. Het ondersteunt diverse datapijplijnen en vermindert handmatige annotatie. (2) Diverse Fundamentele Agentcapaciteiten: door UI-grounding, planning, actiesemantiek en redeneerpatronen te integreren, ondersteunt GUI-Owl end-to-end besluitvorming en kan het fungeren als een modulair component in multi-agentsystemen. (3) Schaalbare Omgevings-RL: we ontwikkelen een schaalbaar reinforcement learning-framework met volledig asynchrone training voor real-world alignment. We introduceren ook Trajectory-aware Relative Policy Optimization (TRPO) voor online RL, waarmee 34,9 op OSWorld wordt behaald. GUI-Owl en Mobile-Agent-v3 zijn open-source beschikbaar op https://github.com/X-PLUG/MobileAgent.

English

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v3: Fundamentele Agents voor GUI-automatisering

Mobile-Agent-v3: Foundamental Agents for GUI Automation

Samenvatting

Support