通过解耦训练与自适应数据筛选实现GUI代理的高效多轮强化学习

摘要

基于视觉语言模型（VLM）的图形用户界面（GUI）代理在自动化复杂桌面和移动任务方面展现出潜力，但在应用强化学习（RL）时面临显著挑战：（1）与GUI环境的多轮交互速度缓慢，影响策略部署；（2）代理与环境的高质量交互不足，制约策略学习。为应对这些挑战，我们提出了DART，一种面向GUI代理的解耦式强化学习训练框架，它以高度解耦的方式协调异构模块。DART将训练系统划分为四个异步模块：环境集群、部署服务、数据管理器和训练器。这一设计实现了非阻塞通信、异步训练、按部署轨迹采样及按工作器模型同步，显著提升了系统效率：部署GPU利用率提升1.6倍，训练吞吐量提升1.9倍，环境利用率提升5.5倍。为促进从海量样本中有效学习，我们引入了一种自适应数据筛选机制：（1）预先收集挑战性任务的成功轨迹，以补充在线采样中稀疏的成功案例；（2）根据任务难度动态调整部署次数和轨迹长度；（3）选择性训练高熵步骤，优先处理关键决策；（4）通过截断重要性采样稳定学习，缓解策略部署与更新间的策略失配问题。在OSWorld基准测试中，DART-GUI-7B实现了42.13%的任务成功率，较基础模型绝对提升14.61%，并超越开源SOTA 7.34%。我们承诺通过computer-use-agents.github.io/dart-gui完全开源训练框架、数据及模型检查点，相信这是对开源强化学习训练社区的一次及时贡献。

English

Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

通过解耦训练与自适应数据筛选实现GUI代理的高效多轮强化学习

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

摘要

Support