ZeroGUI:零人力成本自動化線上圖形用戶界面學習
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
May 29, 2025
作者: Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
cs.AI
摘要
大型視覺語言模型(VLMs)的快速發展推動了純視覺GUI代理的進步,這些代理能夠感知並操作圖形用戶界面(GUI),以自主完成用戶指令。然而,現有方法通常採用離線學習框架,面臨兩個核心限制:(1)對高質量人工註釋的重度依賴,用於元素定位和動作監督;(2)在動態和交互環境中的適應性有限。為解決這些限制,我們提出了ZeroGUI,這是一個可擴展的在線學習框架,旨在以零人力成本自動化GUI代理的訓練。具體而言,ZeroGUI整合了(i)基於VLM的自動任務生成,從當前環境狀態產生多樣化的訓練目標;(ii)基於VLM的自動獎勵估計,無需手工設計的評估函數即可判斷任務成功與否;以及(iii)兩階段在線強化學習,持續與GUI環境交互並從中學習。在兩個先進的GUI代理(UI-TARS和Aguvis)上的實驗表明,ZeroGUI在OSWorld和AndroidLab環境中顯著提升了性能。代碼可在https://github.com/OpenGVLab/ZeroGUI 獲取。
English
The rapid advancement of large Vision-Language Models (VLMs) has propelled
the development of pure-vision-based GUI Agents, capable of perceiving and
operating Graphical User Interfaces (GUI) to autonomously fulfill user
instructions. However, existing approaches usually adopt an offline learning
framework, which faces two core limitations: (1) heavy reliance on high-quality
manual annotations for element grounding and action supervision, and (2)
limited adaptability to dynamic and interactive environments. To address these
limitations, we propose ZeroGUI, a scalable, online learning framework for
automating GUI Agent training at Zero human cost. Specifically, ZeroGUI
integrates (i) VLM-based automatic task generation to produce diverse training
goals from the current environment state, (ii) VLM-based automatic reward
estimation to assess task success without hand-crafted evaluation functions,
and (iii) two-stage online reinforcement learning to continuously interact with
and learn from GUI environments. Experiments on two advanced GUI Agents
(UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance
across OSWorld and AndroidLab environments. The code is available at
https://github.com/OpenGVLab/ZeroGUI.Summary
AI-Generated Summary