ZeroGUI：零人力成本自動化線上圖形用戶界面學習

摘要

大型視覺語言模型（VLMs）的快速發展推動了純視覺GUI代理的進步，這些代理能夠感知並操作圖形用戶界面（GUI），以自主完成用戶指令。然而，現有方法通常採用離線學習框架，面臨兩個核心限制：（1）對高質量人工註釋的重度依賴，用於元素定位和動作監督；（2）在動態和交互環境中的適應性有限。為解決這些限制，我們提出了ZeroGUI，這是一個可擴展的在線學習框架，旨在以零人力成本自動化GUI代理的訓練。具體而言，ZeroGUI整合了（i）基於VLM的自動任務生成，從當前環境狀態產生多樣化的訓練目標；（ii）基於VLM的自動獎勵估計，無需手工設計的評估函數即可判斷任務成功與否；以及（iii）兩階段在線強化學習，持續與GUI環境交互並從中學習。在兩個先進的GUI代理（UI-TARS和Aguvis）上的實驗表明，ZeroGUI在OSWorld和AndroidLab環境中顯著提升了性能。代碼可在https://github.com/OpenGVLab/ZeroGUI 獲取。

English

The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

ZeroGUI：零人力成本自動化線上圖形用戶界面學習

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

摘要

Support