효율적인 GUI 에이전트를 위한 다중 턴 강화 학습: 분리된 훈련과 적응형 데이터 큐레이션을 통한 접근

초록

비전-언어 모델(VLM) 기반 GUI 에이전트는 복잡한 데스크톱 및 모바일 작업 자동화에 유망한 가능성을 보여주지만, 강화 학습(RL)을 적용하는 데 있어 상당한 어려움에 직면하고 있습니다: (1) GUI 환경과의 느린 다중 턴 상호작용으로 인한 정책 롤아웃의 비효율성, (2) 정책 학습을 위한 고품질 에이전트-환경 상호작용의 부족. 이러한 문제를 해결하기 위해, 우리는 GUI 에이전트를 위한 분리형 에이전트 RL 훈련 프레임워크인 DART를 제안합니다. DART는 이기종 모듈을 고도로 분리된 방식으로 조정하며, 훈련 시스템을 환경 클러스터, 롤아웃 서비스, 데이터 관리자, 트레이너의 네 가지 비동기 모듈로 분리합니다. 이 설계는 비차단 통신, 비동기 훈련, 롤아웃 단위 궤적 샘플링, 작업자별 모델 동기화를 가능하게 하여 시스템 효율성을 크게 향상시킵니다: 롤아웃 GPU 활용률 1.6배, 훈련 처리량 1.9배, 환경 활용률 5.5배. 풍부한 샘플로부터 효과적인 학습을 촉진하기 위해, 우리는 적응형 데이터 큐레이션 기법을 도입했습니다: (1) 도전적인 작업에 대한 성공 궤적을 사전 수집하여 온라인 샘플링의 희소한 성공을 보완; (2) 작업 난이도에 따라 롤아웃 횟수와 궤적 길이를 동적으로 조정; (3) 고 엔트로피 단계를 선택적으로 훈련하여 중요한 결정에 우선순위 부여; (4) 정책 롤아웃과 업데이트 간의 불일치를 절단 중요도 샘플링으로 안정화. OSWorld 벤치마크에서 DART-GUI-7B는 42.13%의 작업 성공률을 달성하며, 기본 모델 대비 14.61% 절대적 향상과 오픈소스 SOTA 대비 7.34% 더 높은 성능을 보였습니다. 우리는 훈련 프레임워크, 데이터, 모델 체크포인트를 computer-use-agents.github.io/dart-gui를 통해 완전히 오픈소스로 공개할 예정이며, 이는 에이전트 RL 훈련 오픈소스 커뮤니티에 시의적절한 기여가 될 것으로 믿습니다.

English

Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

효율적인 GUI 에이전트를 위한 다중 턴 강화 학습: 분리된 훈련과 적응형 데이터 큐레이션을 통한 접근

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

초록

Support