ZeroGUI：零人力成本自动化在线GUI学习

摘要

大型视觉-语言模型（VLMs）的快速发展推动了纯视觉GUI代理的进步，这些代理能够感知并操作图形用户界面（GUI），以自主完成用户指令。然而，现有方法通常采用离线学习框架，面临两个核心限制：（1）高度依赖高质量的手动标注进行元素定位和动作监督，（2）对动态交互环境的适应性有限。为解决这些问题，我们提出了ZeroGUI，一种可扩展的在线学习框架，用于在零人力成本下自动化GUI代理训练。具体而言，ZeroGUI整合了：（i）基于VLM的自动任务生成，从当前环境状态中产生多样化的训练目标；（ii）基于VLM的自动奖励评估，无需手工设计评价函数即可判断任务成功与否；（iii）两阶段在线强化学习，持续与GUI环境交互并从中学习。在两种先进GUI代理（UI-TARS和Aguvis）上的实验表明，ZeroGUI在OSWorld和AndroidLab环境中显著提升了性能。代码已发布于https://github.com/OpenGVLab/ZeroGUI。

English

The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

ZeroGUI：零人力成本自动化在线GUI学习

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

摘要

Support