ChatPaper.aiChatPaper

ProCUA-SFT技术报告

ProCUA-SFT Technical Report

June 15, 2026
作者: Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong
cs.AI

摘要

训练计算机使用代理(CUAs)——即通过截图和键盘/鼠标操作与图形桌面交互的模型——需要在完整的桌面环境中收集大规模、多样化的轨迹数据。目前最大的公开资源AgentNet(包含2.25万条人类轨迹)在用于监督微调(SFT)时会导致负迁移:在AgentNet上继续训练UI-TARS 7B模型,其OSWorld任务成功率从26.3%下降至8-10%。我们提出ProCUA-SFT数据集,该数据集包含310万个步骤级SFT样本,源自通过2484种应用组合生成的9.3万条合成轨迹。该数据集由一个全自动流水线生成,该流水线:(i)在包含真实世界内容的Live桌面上合成具身任务——包括来自SpreadsheetBench的912个电子表格、来自Zenodo10K的约1万个宽松许可的演示文稿,以及多应用OSWorld配置——并在生成轨迹前通过二元前置条件检查验证每个任务的可行性。单一视觉语言模型(Kimi-K2.5)同时担任目标生成器、前置条件判断器和轨迹执行器,消除了规划器与执行器之间的能力差距。每条轨迹被扩展为精确复现推理时上下文布局的步骤前缀样本。在ProCUA-SFT上对UI-TARS 7B进行单周期微调后,其在OSWorld上的成功率提升至45.0%——相比基础模型提升18.7个百分点,且比AgentNet训练版本高出超35个百分点。ProCUA的子集已被纳入Nemotron 3 Nano Omni模型的训练数据,为其计算机使用能力提供了贡献。
English
Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.