마노 보고서

초록

그래픽 사용자 인터페이스(GUI)는 인간-컴퓨터 상호작용의 주요 매체이지만, 시각적 요소의 복잡성, 동적 환경, 그리고 다단계 추론의 필요성으로 인해 GUI 상호작용의 자동화는 여전히 어려운 과제로 남아 있습니다. 기존의 시각-언어 모델(VLM) 기반 방법들은 제한된 해상도, 도메인 불일치, 그리고 불충분한 순차적 의사결정 능력으로 인해 한계를 보입니다. 이러한 문제를 해결하기 위해, 우리는 광범위한 웹 및 컴퓨터 시스템 데이터로 사전 학습된 다중 모달 기반 모델을 기반으로 한 강력한 GUI 에이전트인 Mano를 제안합니다. 우리의 접근 방식은 고해상도 데이터 생성을 위한 새로운 시뮬레이션 환경, 세 단계의 학습 파이프라인(지도 미세 조정, 오프라인 강화 학습, 그리고 온라인 강화 학습), 그리고 오류 복구를 위한 검증 모듈을 통합합니다. Mano는 Mind2Web 및 OSWorld를 포함한 여러 GUI 벤치마크에서 최첨단 성능을 보이며, 성공률과 운영 정확도에서 상당한 개선을 달성합니다. 우리의 연구는 실용적인 GUI 에이전트 배치를 위해 강화 학습과 VLM의 효과적인 통합에 대한 새로운 통찰을 제공하며, 도메인 특화 데이터, 반복적 학습, 그리고 전체적인 보상 설계의 중요성을 강조합니다.

English

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

마노 보고서

Mano Report

초록

Support