マノ・レポート

要旨

グラフィカルユーザーインターフェース（GUI）は人間とコンピュータの主要な相互作用手段であるが、視覚要素の複雑さ、動的な環境、多段階の推論が必要とされることから、GUI操作の自動化は依然として困難な課題となっている。既存の視覚言語モデル（VLM）に基づく手法は、解像度の制限、ドメインの不一致、逐次的意思決定能力の不足といった問題を抱えている。これらの課題に対処するため、我々はManoを提案する。これは、広範なウェブおよびコンピュータシステムデータで事前学習されたマルチモーダル基盤モデルを基盤とした堅牢なGUIエージェントである。我々のアプローチは、高忠実度データ生成のための新しいシミュレーション環境、3段階のトレーニングパイプライン（教師ありファインチューニング、オフライン強化学習、オンライン強化学習）、およびエラー回復のための検証モジュールを統合している。Manoは、Mind2WebやOSWorldを含む複数のGUIベンチマークにおいて最先端の性能を発揮し、成功率と操作精度の大幅な向上を達成した。本研究は、実用的なGUIエージェントの展開における強化学習とVLMの効果的な統合に関する新たな知見を提供し、ドメイン固有のデータ、反復的なトレーニング、包括的な報酬設計の重要性を強調している。

English

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

マノ・レポート

Mano Report

要旨

Support