ProCUA-SFT技术报告

摘要

训练计算机使用代理（CUAs）——即通过截图和键盘/鼠标操作与图形桌面交互的模型——需要在完整的桌面环境中收集大规模、多样化的轨迹数据。目前最大的公开资源AgentNet（包含2.25万条人类轨迹）在用于监督微调（SFT）时会导致负迁移：在AgentNet上继续训练UI-TARS 7B模型，其OSWorld任务成功率从26.3%下降至8-10%。我们提出ProCUA-SFT数据集，该数据集包含310万个步骤级SFT样本，源自通过2484种应用组合生成的9.3万条合成轨迹。该数据集由一个全自动流水线生成，该流水线：（i）在包含真实世界内容的Live桌面上合成具身任务——包括来自SpreadsheetBench的912个电子表格、来自Zenodo10K的约1万个宽松许可的演示文稿，以及多应用OSWorld配置——并在生成轨迹前通过二元前置条件检查验证每个任务的可行性。单一视觉语言模型（Kimi-K2.5）同时担任目标生成器、前置条件判断器和轨迹执行器，消除了规划器与执行器之间的能力差距。每条轨迹被扩展为精确复现推理时上下文布局的步骤前缀样本。在ProCUA-SFT上对UI-TARS 7B进行单周期微调后，其在OSWorld上的成功率提升至45.0%——相比基础模型提升18.7个百分点，且比AgentNet训练版本高出超35个百分点。ProCUA的子集已被纳入Nemotron 3 Nano Omni模型的训练数据，为其计算机使用能力提供了贡献。

English

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.