ChatPaper.aiChatPaper

通過解耦訓練與自適應數據策展實現GUI代理的高效多輪強化學習

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

September 28, 2025
作者: Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, Xiaochuan Shi, Zedong YU, Yuwei Wu, Xinxiao Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
cs.AI

摘要

基於視覺語言模型(VLM)的圖形用戶界面(GUI)代理在自動化複雜桌面和移動任務方面展現出潛力,但在應用強化學習(RL)時面臨重大挑戰:(1)與GUI環境的多輪交互速度緩慢,影響策略執行;(2)代理與環境之間的高質量交互不足,阻礙策略學習。為應對這些挑戰,我們提出了DART,一個針對GUI代理的解耦式強化學習訓練框架,它以高度解耦的方式協調異構模塊。DART將訓練系統分為四個異步模塊:環境集群、執行服務、數據管理器和訓練器。這一設計實現了非阻塞通信、異步訓練、按執行採樣軌跡以及按工作者模型同步,顯著提升了系統效率:執行階段的GPU利用率提升1.6倍,訓練吞吐量提升1.9倍,環境利用率提升5.5倍。為促進從大量樣本中有效學習,我們引入了一種自適應數據策展方案:(1)預先收集成功軌跡以補充在線採樣中稀疏的成功案例,特別針對挑戰性任務;(2)根據任務難度動態調整執行次數和軌跡長度;(3)選擇性訓練高熵步驟,優先考慮關鍵決策;(4)通過截斷重要性採樣穩定學習,解決策略執行與更新之間的不匹配問題。在OSWorld基準測試中,DART-GUI-7B實現了42.13%的任務成功率,相比基礎模型提升了14.61%,並比開源SOTA高出7.34%。我們將通過computer-use-agents.github.io/dart-gui完全開源我們的訓練框架、數據和模型檢查點,相信這將為代理式強化學習訓練的開源社區做出及時貢獻。
English
Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.
PDF72September 30, 2025