通過解耦訓練與自適應數據策展實現GUI代理的高效多輪強化學習

摘要

基於視覺語言模型（VLM）的圖形用戶界面（GUI）代理在自動化複雜桌面和移動任務方面展現出潛力，但在應用強化學習（RL）時面臨重大挑戰：（1）與GUI環境的多輪交互速度緩慢，影響策略執行；（2）代理與環境之間的高質量交互不足，阻礙策略學習。為應對這些挑戰，我們提出了DART，一個針對GUI代理的解耦式強化學習訓練框架，它以高度解耦的方式協調異構模塊。DART將訓練系統分為四個異步模塊：環境集群、執行服務、數據管理器和訓練器。這一設計實現了非阻塞通信、異步訓練、按執行採樣軌跡以及按工作者模型同步，顯著提升了系統效率：執行階段的GPU利用率提升1.6倍，訓練吞吐量提升1.9倍，環境利用率提升5.5倍。為促進從大量樣本中有效學習，我們引入了一種自適應數據策展方案：（1）預先收集成功軌跡以補充在線採樣中稀疏的成功案例，特別針對挑戰性任務；（2）根據任務難度動態調整執行次數和軌跡長度；（3）選擇性訓練高熵步驟，優先考慮關鍵決策；（4）通過截斷重要性採樣穩定學習，解決策略執行與更新之間的不匹配問題。在OSWorld基準測試中，DART-GUI-7B實現了42.13%的任務成功率，相比基礎模型提升了14.61%，並比開源SOTA高出7.34%。我們將通過computer-use-agents.github.io/dart-gui完全開源我們的訓練框架、數據和模型檢查點，相信這將為代理式強化學習訓練的開源社區做出及時貢獻。

English

Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

通過解耦訓練與自適應數據策展實現GUI代理的高效多輪強化學習

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

摘要

Support