CUA-Gym: 컴퓨터 사용 에이전트를 위한 검증 가능한 훈련 환경 및 작업 확장

초록

검증 가능한 보상 기반 강화 학습(RLVR)은 수학, 도구 사용, 소프트웨어 엔지니어링과 같은 분야에서 획기적인 발전을 이끌어냈지만, 컴퓨터 사용 에이전트(CUA)로의 확장은 결정론적 보상을 갖춘 확장 가능한 훈련 데이터의 부족으로 인해 병목 현상을 겪어왔다. CUA를 위한 이러한 데이터를 구축하려면 일관된 작업 명령, 실행 가능한 환경, 그리고 검증 가능한 보상이 필요하다. 그러나 수동으로 선별된 벤치마크는 높은 보상 충실도를 달성하지만 적용 범위가 적고, LLM 판사 기반 데이터셋은 광범위하게 확장되지만 신뢰할 수 있는 검증이 부족하다. 우리는 작업 명령, 환경 상태, 보상 함수를 공동으로 생성하는 확장 가능한 파이프라인인 CUA-Gym을 제시한다. 구체적으로, 생성기(Generator) 에이전트가 초기 및 골든 환경 상태를 구성하고, 별도의 판별기(Discriminator) 에이전트가 작업 사양으로부터 보상 함수를 작성한다. 오케스트레이터(Orchestrator) 에이전트는 실행을 통해 반복적인 라운드를 거쳐 두 에이전트를 구동한다. 생성된 튜플은 최종 필터(LLM 다수결 투표와 에이전트 롤아웃을 결합)를 통과하여 작업별 적대적 루프 이상의 품질을 보장한다. 훈련 환경의 부족 문제를 해결하기 위해, 우리는 실제 소프트웨어 사용 분포에 기반한 고충실도 모의 웹 애플리케이션 모음인 CUA-Gym-Hub를 추가로 합성하여 CUA RLVR 데이터의 규모를 한 차원 확장한다. 이 파이프라인을 사용하여 우리는 110개 환경에 기반한 32,112개의 검증된 RLVR 훈련 튜플 데이터셋인 CUA-Gym을 구축한다. CUA-Gym에서 GSPO로 훈련된 CUA-Gym-A3B와 CUA-Gym-A17B는 OSWorld-Verified에서 각각 62.1%와 72.6%를 달성하여, 비슷한 규모의 기존 오픈소스 CUA를 능가하며, 성능은 데이터 규모와 환경 다양성 모두에서 원활하게 확장된다. 동일한 체크포인트는 보류된 WebArena 벤치마크에서도 개선되어, 훈련 환경을 넘어서는 전이를 나타낸다. 우리는 전체 합성 파이프라인, 데이터셋, CUA-Gym-Hub 환경, 그리고 모델을 오픈소스로 공개할 예정이다.

English

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.