《Mano报告》
Mano Report
September 22, 2025
作者: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
cs.AI
摘要
图形用户界面(GUI)是人机交互的主要媒介,然而,由于视觉元素的复杂性、动态环境以及多步骤推理的需求,自动化GUI交互仍面临挑战。现有的基于视觉语言模型(VLM)的方法常受限于分辨率不足、领域不匹配及序列决策能力欠缺等问题。为解决这些问题,我们提出了Mano,一个构建于多模态基础模型之上的鲁棒GUI代理,该模型在大量网页和计算机系统数据上进行了预训练。我们的方法整合了一个新颖的模拟环境用于高保真数据生成,一个三阶段训练流程(监督微调、离线强化学习和在线强化学习),以及一个用于错误恢复的验证模块。Mano在多个GUI基准测试中,包括Mind2Web和OSWorld,展现了最先进的性能,在成功率和操作准确性上实现了显著提升。我们的工作为将强化学习与VLM有效结合以实际部署GUI代理提供了新见解,强调了领域特定数据、迭代训练及整体奖励设计的重要性。
English
Graphical user interfaces (GUIs) are the primary medium for human-computer
interaction, yet automating GUI interactions remains challenging due to the
complexity of visual elements, dynamic environments, and the need for
multi-step reasoning. Existing methods based on vision-language models (VLMs)
often suffer from limited resolution, domain mismatch, and insufficient
sequential decisionmaking capability. To address these issues, we propose Mano,
a robust GUI agent built upon a multi-modal foundation model pre-trained on
extensive web and computer system data. Our approach integrates a novel
simulated environment for high-fidelity data generation, a three-stage training
pipeline (supervised fine-tuning, offline reinforcement learning, and online
reinforcement learning), and a verification module for error recovery. Mano
demonstrates state-of-the-art performance on multiple GUI benchmarks, including
Mind2Web and OSWorld, achieving significant improvements in success rate and
operational accuracy. Our work provides new insights into the effective
integration of reinforcement learning with VLMs for practical GUI agent
deployment, highlighting the importance of domain-specific data, iterative
training, and holistic reward design.