ProCUA-SFT 技術報告

摘要

訓練電腦使用代理（CUAs）——透過螢幕截圖與鍵盤/滑鼠動作與圖形桌面互動的模型——需要在大規模、多樣化的完整桌面環境中收集軌跡資料。現有最大的公開資源 AgentNet（22,500 條人類軌跡）在用於監督式微調（SFT）時會導致負遷移：若在 AgentNet 上持續訓練 UI-TARS 7B，其在 OSWorld 上的成功率將從 26.3% 下降至 8-10%。我們提出 ProCUA-SFT 資料集，包含 310 萬個步驟級 SFT 樣本，從 93,000 條合成軌跡中提煉，涵蓋 2,484 種應用組合。該資料集透過全自動化管線產生，其流程包括：（i）在以真實世界內容播種的即時桌面上合成具接地任務——包含來自 SpreadsheetBench 的 912 個試算表、來自 Zenodo10K 約 10,000 個採用寬鬆授權的簡報檔，以及多應用程式的 OSWorld 配置——並（ii）在實際生成軌跡前，透過二元前置條件檢查驗證每個任務的可行性。單一 VLM（Kimi-K2.5）同時擔任目標生成器、前置條件判斷器與軌跡執行器，消除了規劃器與執行器之間的能力差距。每條軌跡被擴展為步驟前綴樣本，精確重現推理時所見的上下文佈局。在 ProCUA-SFT 上訓練 UI-TARS 7B 一個週期後，其在 OSWorld 上的表現達到 45.0%，較基礎模型提升 18.7 個百分點，並比 AgentNet 訓練的模型高出 35% 以上。ProCUA 的一個子集已被納入 Nemotron 3 Nano Omni 模型的訓練資料中，對其電腦使用能力有所貢獻。

English

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.