ChatPaper.aiChatPaper

MAI-UI技术报告:以现实世界为中心的基础图形用户界面智能体

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

December 26, 2025
作者: Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
cs.AI

摘要

图形用户界面(GUI)智能体的发展有望彻底革新下一代人机交互。基于这一愿景,我们推出MAI-UI系列基础GUI智能体,涵盖从2B、8B、32B到235B-A22B的全尺寸变体。我们识别出现实部署面临的四大挑战:缺乏原生智能体-用户交互机制、纯UI操作的局限性、实用部署架构的缺失以及动态环境下的脆弱性。MAI-UI通过统一方法论解决这些问题:通过自演进数据管道将导航数据扩展至包含用户交互与MCP工具调用、采用基于任务状态路由执行的原生设备-云协作系统,以及配备先进优化技术的在线强化学习框架以扩展并行环境与上下文长度。MAI-UI在GUI定位与移动导航任务中创下多项最新纪录:在定位基准测试中,ScreenSpot-Pro达73.5%、MMBench GUI L2达91.3%、OSWorld-G达70.9%、UI-Vision达49.2%,其中ScreenSpot-Pro成绩超越Gemini-3-Pro与Seed1.8;在移动GUI导航任务中,AndroidWorld上以76.7%刷新纪录,超越UI-Tars-2、Gemini-2.5-Pro与Seed1.8;在MobileWorld上获得41.7%成功率,显著优于端到端GUI模型,并与基于Gemini-3-Pro的智能体框架持平。在线强化学习实验表明,将并行环境从32扩展至512可提升5.2个点,环境步数预算从15增至50可提升4.3个点。最终,原生设备-云协作系统使设备端性能提升33%,云端模型调用减少超40%,同时保障用户隐私。
English
The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.
PDF190December 30, 2025