ChatPaper.aiChatPaper

VideoAgentTrek:基於未標註影片的電腦使用預訓練

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

October 22, 2025
作者: Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
cs.AI

摘要

訓練電腦使用代理需要大量的圖形用戶界面(GUI)互動數據,但大規模手動標註動作軌跡的成本過高。我們提出了VideoAgentTrek,這是一個可擴展的管道,能夠從公開可用的屏幕錄製視頻中自動挖掘訓練數據,從而消除手動標註的需求。我們的方法解決了一個關鍵挑戰:原始視頻包含隱含的示範,但缺乏明確的動作標籤。為此,我們開發了Video2Action,這是一個逆向動力學模塊(IDM),包含兩個組件:(1) 一個視頻定位模型,能夠檢測並定位具有精確時間邊界和上下文的GUI動作;(2) 一個動作內容識別器,能夠高保真地提取結構化參數,如點擊座標和輸入的文本。應用於39,000個YouTube教學視頻,我們的管道自動生成了152萬個互動步驟。我們通過持續預訓練和監督微調來利用這些數據。在OSWorld-Verified上,我們的方法將任務成功率從9.3%(僅SFT基線)提高到15.8%,相對提升了70%。在AgentNetBench上,步驟準確率從64.1%提高到69.3%。我們的結果表明,被動的互聯網視頻可以轉化為高質量的電腦使用代理監督數據,提供了一種可擴展的替代方案,以取代昂貴的手動標註。
English
Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
PDF162October 23, 2025