ClawGUI：一体化GUI智能体训练、评估与部署框架

摘要

GUI智能体通过视觉界面而非编程API驱动应用程序，利用点击、滑动和键盘输入与任意软件交互，从而覆盖了基于命令行界面智能体无法触及的长尾应用。然而该领域的发展瓶颈主要不在于模型能力，而在于缺乏统一的全栈基础设施：在线强化学习训练受限于环境不稳定性和封闭流程，评估标准在不同研究间存在隐性偏移，训练完成的智能体鲜少能部署至真实用户的实体设备。我们推出开源框架ClawGUI，通过一体化架构解决这三重挑战。ClawGUI-RL首创支持并行虚拟环境与实体设备的开源GUI智能体强化学习基础设施，集成GiGPO算法与过程奖励模型实现细粒度步骤级监督。ClawGUI-Eval在6个基准测试和11+模型上构建全标准化评估流程，与官方基线对比重现度达95.8%。ClawGUI-Agent通过12+聊天平台将训练完成的智能体部署至Android、HarmonyOS和iOS系统，支持混合命令行-图形界面控制及持久化个性记忆。在该管道中端到端训练得到的ClawGUI-2B模型，在MobileWorld纯图形界面测试中达成17.1%的成功率，较同规模MAI-UI-2B基线提升6.0%。

English

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present ClawGUI, an open-source framework addressing these three gaps within a single harness. ClawGUI-RL provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. ClawGUI-Eval enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. ClawGUI-Agent brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, ClawGUI-2B achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.