CUA-Gym：為電腦使用代理擴展可驗證的訓練環境與任務

摘要

具可驗證獎勵的強化學習（RLVR）已在數學、工具使用及軟體工程等領域帶來突破，但其在電腦使用代理（CUA）上的應用卻因缺乏具確定性獎勵且可擴展的訓練資料而遭遇瓶頸。建構這類 CUA 資料需要一致的任務指令、可執行的環境及可驗證的獎勵。然而，人工篩選的基準資料集雖能達到高獎勵忠實度，但涵蓋的應用範圍有限；而以 LLM 作為評判的資料集雖能大規模拓展，卻缺乏可靠驗證。我們提出 CUA-Gym，這是一套可擴展的管線，能協同生成任務指令、環境狀態及獎勵函數。具體而言，由生成器代理建構初始與最終的黃金環境狀態，再另由判別器代理根據任務規格撰寫獎勵函數，並由編排器代理驅動兩者在執行過程中反覆迭代。生成的資料組最終通過一道篩選關卡，結合 LLM 多數決投票與代理執行軌跡，確保品質超越單任務對抗式循環。為因應訓練環境稀缺的問題，我們進一步合成 CUA-Gym-Hub，這是一套以大規模真實軟體使用分布為基底、具高忠實度的模擬網頁應用套件，將 CUA 的 RLVR 資料規模擴大了數個量級。利用此管線，我們建構了 CUA-Gym 資料集，包含 32,112 組經過驗證的 RLVR 訓練資料組，對應 110 個環境。使用 GSPO 在 CUA-Gym 上訓練後，我們的 CUA-Gym-A3B 與 CUA-Gym-A17B 分別在 OSWorld-Verified 基準上達到 62.1% 與 72.6% 的表現，優於同等規模的先進開源 CUA，且效能隨資料量與環境多樣性平滑提升。同一檢查點亦在保留的 WebArena 基準上表現提升，顯示訓練成果可遷移至訓練環境之外。我們將開源完整的合成管線、資料集、CUA-Gym-Hub 環境及模型。

English

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.