Video2GUI：为通用图形用户界面智能体预训练合成大规模交互轨迹

摘要

近年來，多模態大語言模型的進展推動了對圖形使用者介面（GUI）代理的日益關注，但其泛化能力仍受到缺乏涵蓋多樣真實應用場景大規模訓練資料的限制。現有資料集高度依賴昂貴的人工標註，且通常僅限於狹窄領域。為解決此挑戰，我們提出 Video2GUI，一個全自動框架，可直接從未標記的網路影片中提取基於 GUI 的互動軌跡。Video2GUI 採用從粗到細的過濾策略，識別高品質的 GUI 教學影片，並將其轉換為結構化的代理軌跡。我們將此流程應用於 5 億筆影片元數據，構建了 WildGUI 資料集，該大規模資料集包含超過 1,500 個應用程式與網站的 1,200 萬條互動軌跡。在 WildGUI 上預訓練 Qwen2.5-VL 與 Mimo-VL 後，在多個 GUI 定位與動作基準測試中取得 5% 至 20% 的持續提升，達到或超越當前最佳表現。我們將公開釋出 WildGUI 資料集與 Video2GUI 流程，以支援未來 GUI 代理的研究。

English

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.