CUA-Gym：面向计算机使用代理的可验证训练环境与任务的规模化扩展

摘要

基于可验证奖励的强化学习（RLVR）已在数学、工具使用和软件工程等领域取得突破性进展，但其向计算机用户代理（CUA）的拓展始终受限于可扩展训练数据的稀缺性——这类数据需具备确定性奖励。构建此类数据要求同时满足任务指令一致性、环境可执行性及奖励可验证性。然而，人工精选基准虽能保证高奖励保真度，但覆盖应用场景有限；基于大语言模型（LLM）评判器的数据集虽可大规模扩展，却缺乏可靠验证。我们提出CUA-Gym，一种可扩展流水线，能够协同生成任务指令、环境状态与奖励函数。具体而言，生成器代理构造初始环境状态与理想环境状态，独立的判别器代理则根据任务规范编写奖励函数。编排器代理驱动两者通过迭代执行循环交互。生成的元组最终需通过结合LLM多数投票与代理推演的综合筛选，确保每项任务的对抗循环之外的质量。为解决训练环境稀缺问题，我们进一步构建CUA-Gym-Hub（基于真实软件使用分布的高保真模拟Web应用套件），将CUA RLVR数据规模提升了一个数量级。利用该流水线，我们构建了包含32,112个可验证RLVR训练元组、覆盖110个环境的CUA-Gym数据集。经GSPO在CUA-Gym上训练后，我们的CUA-Gym-A3B与CUA-Gym-A17B模型在OSWorld-Verified基准上分别取得62.1%与72.6%的准确率，超越同等规模的开源CUA模型，且性能随数据量与环境多样性平滑增长。相同检查点在保留的WebArena基准上同样表现提升，表明训练环境之外的迁移能力。我们将开源完整合成流水线、数据集、CUA-Gym-Hub环境及模型。

English

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.