视频智能探索:基于未标注视频的计算机使用预训练
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
October 22, 2025
作者: Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
cs.AI
摘要
训练计算机使用代理需要大量的图形用户界面(GUI)交互数据,但大规模手动标注操作轨迹成本过高。我们提出了VideoAgentTrek,一个可扩展的流程,能够自动从公开可用的屏幕录制视频中挖掘训练数据,无需人工标注。我们的方法解决了一个关键挑战:原始视频包含隐式演示但缺乏显式操作标签。为此,我们开发了Video2Action,一个逆向动力学模块(IDM),包含两个组件:(1) 视频定位模型,用于检测并精确定位带有时间边界和上下文的GUI操作;(2) 操作内容识别器,能够高保真地提取如点击坐标和输入文本等结构化参数。应用于39,000个YouTube教程视频,我们的流程自动生成了152万次交互步骤。我们通过持续预训练和后续的监督微调来利用这些数据。在OSWorld-Verified上,我们的方法将任务成功率从9.3%(仅SFT基线)提升至15.8%,相对提高了70%。在AgentNetBench上,步骤准确率从64.1%提升至69.3%。我们的结果表明,被动获取的互联网视频可以转化为高质量的计算器使用代理监督数据,为昂贵的人工标注提供了可扩展的替代方案。
English
Training computer-use agents requires massive amounts of GUI interaction
data, but manually annotating action trajectories at scale is prohibitively
expensive. We present VideoAgentTrek, a scalable pipeline that automatically
mines training data from publicly available screen-recorded videos at web
scale, eliminating the need for manual annotation. Our approach addresses a key
challenge: raw videos contain implicit demonstrations but lack explicit action
labels. To solve this, we develop Video2Action, an inverse dynamics module
(IDM) with two components: (1) a video grounding model that detects and
localizes GUI actions with precise temporal boundaries and context, and (2) an
action-content recognizer that extracts structured parameters like click
coordinates and typed text with high fidelity. Applied to 39,000 YouTube
tutorial videos, our pipeline generates 1.52 million interaction steps
automatically. We leverage this data through continued pretraining followed by
supervised fine-tuning. On OSWorld-Verified, our approach improves task success
rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On
AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results
demonstrate that passive internet videos can be transformed into high-quality
supervision for computer-use agents, providing a scalable alternative to
expensive manual annotation.