ChatPaper.aiChatPaper

通过解耦训练与自适应数据筛选实现GUI代理的高效多轮强化学习

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

September 28, 2025
作者: Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, Xiaochuan Shi, Zedong YU, Yuwei Wu, Xinxiao Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
cs.AI

摘要

基于视觉语言模型(VLM)的图形用户界面(GUI)代理在自动化复杂桌面和移动任务方面展现出潜力,但在应用强化学习(RL)时面临显著挑战:(1)与GUI环境的多轮交互速度缓慢,影响策略部署;(2)代理与环境的高质量交互不足,制约策略学习。为应对这些挑战,我们提出了DART,一种面向GUI代理的解耦式强化学习训练框架,它以高度解耦的方式协调异构模块。DART将训练系统划分为四个异步模块:环境集群、部署服务、数据管理器和训练器。这一设计实现了非阻塞通信、异步训练、按部署轨迹采样及按工作器模型同步,显著提升了系统效率:部署GPU利用率提升1.6倍,训练吞吐量提升1.9倍,环境利用率提升5.5倍。为促进从海量样本中有效学习,我们引入了一种自适应数据筛选机制:(1)预先收集挑战性任务的成功轨迹,以补充在线采样中稀疏的成功案例;(2)根据任务难度动态调整部署次数和轨迹长度;(3)选择性训练高熵步骤,优先处理关键决策;(4)通过截断重要性采样稳定学习,缓解策略部署与更新间的策略失配问题。在OSWorld基准测试中,DART-GUI-7B实现了42.13%的任务成功率,较基础模型绝对提升14.61%,并超越开源SOTA 7.34%。我们承诺通过computer-use-agents.github.io/dart-gui完全开源训练框架、数据及模型检查点,相信这是对开源强化学习训练社区的一次及时贡献。
English
Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.
PDF72September 30, 2025