Mobile-Agent-v3: Grundlegende Agenten für die GUI-Automatisierung

papers.abstract

Dieses Paper stellt GUI-Owl vor, ein grundlegendes GUI-Agentenmodell, das state-of-the-art Leistung unter Open-Source-End-to-End-Modellen auf zehn GUI-Benchmarks in Desktop- und Mobilumgebungen erzielt, die Grounding, Fragebeantwortung, Planung, Entscheidungsfindung und prozedurales Wissen abdecken. GUI-Owl-7B erreicht 66,4 auf AndroidWorld und 29,4 auf OSWorld. Darauf aufbauend schlagen wir Mobile-Agent-v3 vor, ein allgemeines GUI-Agenten-Framework, das die Leistung auf 73,3 auf AndroidWorld und 37,7 auf OSWorld weiter verbessert und damit einen neuen State-of-the-art für Open-Source-GUI-Agenten-Frameworks setzt. GUI-Owl integriert drei Schlüsselinnovationen: (1) Groß angelegte Umgebungsinfrastruktur: eine cloud-basierte virtuelle Umgebung, die Android, Ubuntu, macOS und Windows umfasst und unser Self-Evolving GUI Trajectory Production Framework ermöglicht. Dieses generiert hochwertige Interaktionsdaten durch automatisierte Abfragegenerierung und Korrektheitsvalidierung, wobei GUI-Owl genutzt wird, um Trajektorien iterativ zu verfeinern und so eine sich selbst verbessernde Schleife zu bilden. Es unterstützt diverse Datenpipelines und reduziert manuelle Annotation. (2) Vielfältige grundlegende Agenten-Fähigkeiten: Durch die Integration von UI-Grounding, Planung, Aktionssemantik und Denkmustern unterstützt GUI-Owl End-to-End-Entscheidungsfindung und kann als modulare Komponente in Multi-Agenten-Systemen fungieren. (3) Skalierbares Umgebungs-RL: Wir entwickeln ein skalierbares Reinforcement-Learning-Framework mit vollständig asynchronem Training für die Ausrichtung auf reale Anwendungen. Zudem führen wir Trajectory-aware Relative Policy Optimization (TRPO) für Online-RL ein, das 34,9 auf OSWorld erreicht. GUI-Owl und Mobile-Agent-v3 sind unter https://github.com/X-PLUG/MobileAgent Open-Source verfügbar.

English

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v3: Grundlegende Agenten für die GUI-Automatisierung

Mobile-Agent-v3: Foundamental Agents for GUI Automation

papers.abstract

Support