Mobile-Agent-v3: GUI 자동화를 위한 기본 에이전트

초록

본 논문은 GUI-Owl을 소개하며, 이는 데스크톱 및 모바일 환경에서의 10가지 GUI 벤치마크에서 오픈소스 종단 간(end-to-end) 모델 중 최첨단 성능을 달성한 기초 GUI 에이전트 모델입니다. GUI-Owl은 그라운딩(grounding), 질의 응답, 계획 수립, 의사 결정, 절차적 지식을 포괄합니다. GUI-Owl-7B는 AndroidWorld에서 66.4, OSWorld에서 29.4의 성능을 달성했습니다. 이를 기반으로, 우리는 Mobile-Agent-v3를 제안하며, 이는 AndroidWorld에서 73.3, OSWorld에서 37.7의 성능으로 더욱 향상된 범용 GUI 에이전트 프레임워크로, 오픈소스 GUI 에이전트 프레임워크의 새로운 최첨단을 설정했습니다. GUI-Owl은 세 가지 주요 혁신을 통합합니다: (1) 대규모 환경 인프라: Android, Ubuntu, macOS, Windows를 아우르는 클라우드 기반 가상 환경으로, Self-Evolving GUI Trajectory Production 프레임워크를 가능하게 합니다. 이는 자동화된 질의 생성과 정확성 검증을 통해 고품질 상호작용 데이터를 생성하며, GUI-Owl을 활용하여 궤적을 반복적으로 개선함으로써 자기 개선 루프를 형성합니다. 이는 다양한 데이터 파이프라인을 지원하고 수동 주석 작업을 줄입니다. (2) 다양한 기초 에이전트 능력: UI 그라운딩, 계획 수립, 액션 의미론, 추론 패턴을 통합함으로써, GUI-Owl은 종단 간 의사 결정을 지원하며 다중 에이전트 시스템에서 모듈식 구성 요소로 작동할 수 있습니다. (3) 확장 가능한 환경 강화 학습(RL): 우리는 실세계 정렬을 위한 완전 비동기식 훈련을 포함한 확장 가능한 강화 학습 프레임워크를 개발했습니다. 또한 온라인 RL을 위한 Trajectory-aware Relative Policy Optimization(TRPO)을 도입하여 OSWorld에서 34.9의 성능을 달성했습니다. GUI-Owl과 Mobile-Agent-v3는 https://github.com/X-PLUG/MobileAgent에서 오픈소스로 제공됩니다.

English

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Mobile-Agent-v3: GUI 자동화를 위한 기본 에이전트

Mobile-Agent-v3: Foundamental Agents for GUI Automation

초록

Support