ProCUA-SFT 技術報告
ProCUA-SFT Technical Report
June 15, 2026
作者: Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong
cs.AI
摘要
訓練電腦使用代理(CUAs)——透過螢幕截圖與鍵盤/滑鼠動作與圖形桌面互動的模型——需要在大規模、多樣化的完整桌面環境中收集軌跡資料。現有最大的公開資源 AgentNet(22,500 條人類軌跡)在用於監督式微調(SFT)時會導致負遷移:若在 AgentNet 上持續訓練 UI-TARS 7B,其在 OSWorld 上的成功率將從 26.3% 下降至 8-10%。我們提出 ProCUA-SFT 資料集,包含 310 萬個步驟級 SFT 樣本,從 93,000 條合成軌跡中提煉,涵蓋 2,484 種應用組合。該資料集透過全自動化管線產生,其流程包括:(i)在以真實世界內容播種的即時桌面上合成具接地任務——包含來自 SpreadsheetBench 的 912 個試算表、來自 Zenodo10K 約 10,000 個採用寬鬆授權的簡報檔,以及多應用程式的 OSWorld 配置——並(ii)在實際生成軌跡前,透過二元前置條件檢查驗證每個任務的可行性。單一 VLM(Kimi-K2.5)同時擔任目標生成器、前置條件判斷器與軌跡執行器,消除了規劃器與執行器之間的能力差距。每條軌跡被擴展為步驟前綴樣本,精確重現推理時所見的上下文佈局。在 ProCUA-SFT 上訓練 UI-TARS 7B 一個週期後,其在 OSWorld 上的表現達到 45.0%,較基礎模型提升 18.7 個百分點,並比 AgentNet 訓練的模型高出 35% 以上。ProCUA 的一個子集已被納入 Nemotron 3 Nano Omni 模型的訓練資料中,對其電腦使用能力有所貢獻。
English
Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.