ZeroGUI:零人力成本自动化在线GUI学习
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
May 29, 2025
作者: Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
cs.AI
摘要
大型视觉-语言模型(VLMs)的快速发展推动了纯视觉GUI代理的进步,这些代理能够感知并操作图形用户界面(GUI),以自主完成用户指令。然而,现有方法通常采用离线学习框架,面临两个核心限制:(1)高度依赖高质量的手动标注进行元素定位和动作监督,(2)对动态交互环境的适应性有限。为解决这些问题,我们提出了ZeroGUI,一种可扩展的在线学习框架,用于在零人力成本下自动化GUI代理训练。具体而言,ZeroGUI整合了:(i)基于VLM的自动任务生成,从当前环境状态中产生多样化的训练目标;(ii)基于VLM的自动奖励评估,无需手工设计评价函数即可判断任务成功与否;(iii)两阶段在线强化学习,持续与GUI环境交互并从中学习。在两种先进GUI代理(UI-TARS和Aguvis)上的实验表明,ZeroGUI在OSWorld和AndroidLab环境中显著提升了性能。代码已发布于https://github.com/OpenGVLab/ZeroGUI。
English
The rapid advancement of large Vision-Language Models (VLMs) has propelled
the development of pure-vision-based GUI Agents, capable of perceiving and
operating Graphical User Interfaces (GUI) to autonomously fulfill user
instructions. However, existing approaches usually adopt an offline learning
framework, which faces two core limitations: (1) heavy reliance on high-quality
manual annotations for element grounding and action supervision, and (2)
limited adaptability to dynamic and interactive environments. To address these
limitations, we propose ZeroGUI, a scalable, online learning framework for
automating GUI Agent training at Zero human cost. Specifically, ZeroGUI
integrates (i) VLM-based automatic task generation to produce diverse training
goals from the current environment state, (ii) VLM-based automatic reward
estimation to assess task success without hand-crafted evaluation functions,
and (iii) two-stage online reinforcement learning to continuously interact with
and learn from GUI environments. Experiments on two advanced GUI Agents
(UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance
across OSWorld and AndroidLab environments. The code is available at
https://github.com/OpenGVLab/ZeroGUI.Summary
AI-Generated Summary