MAI-UI技術報告:以現實世界為核心的基礎圖形使用者界面代理
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
December 26, 2025
作者: Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
cs.AI
摘要
圖形使用者界面代理的發展可能徹底改變下一代人機互動模式。基於這一願景,我們推出MAI-UI系列基礎GUI代理,涵蓋從2B、8B、32B到235B-A22B的全尺寸規格。我們識別出現實部署面臨的四大挑戰:缺乏原生代理-使用者互動機制、純UI操作的侷限性、實用部署架構的缺失,以及動態環境中的脆弱性。MAI-UI通過統一方法論解決這些問題:自進化數據管道將導航數據擴展至包含使用者互動與MCP工具呼叫、原生設備-雲端協作系統根據任務狀態路由執行流程,以及採用先進優化技術的在線強化學習框架,可擴展平行環境與上下文長度。MAI-UI在GUI基礎任務與移動導航領域創下多項新紀錄:在ScreenSpot-Pro達到73.5%、MMBench GUI L2達91.3%、OSWorld-G達70.9%、UI-Vision達49.2%,其中ScreenSpot-Pro成績超越Gemini-3-Pro與Seed1.8;在AndroidWorld移動導航任務中以76.7%刷新紀錄,優於UI-Tars-2、Gemini-2.5-Pro與Seed1.8;在MobileWorld獲得41.7%成功率,顯著超越端到端GUI模型,並與基於Gemini-3-Pro的代理框架持平。在線強化學習實驗顯示,平行環境從32擴展至512可提升5.2個百分點,環境步數預算從15增至50可提升4.3個百分點。最終,原生設備-雲端協作系統使設備端性能提升33%,雲端模型呼叫減少超40%,同時保障使用者隱私。
English
The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.