Video2GUI: 合成大规模交互轨迹用于通用GUI智能体预训练

摘要

近年来，多模态大语言模型的进展推动了对图形用户界面（GUI）代理的研究兴趣，但这类代理的泛化能力仍受限于缺乏覆盖多样化真实应用的大规模训练数据。现有数据集高度依赖昂贵的人工标注，且通常局限于狭窄领域。为解决这一挑战，我们提出Video2GUI——一种全自动框架，可直接从未标记的网络视频中提取基础化的图形用户界面交互轨迹。Video2GUI采用从粗到细的筛选策略，识别高质量的GUI教程视频，并将其转化为结构化的代理轨迹。通过将该流程应用于5亿条视频元数据，我们构建了WildGUI数据集，其中包含1200万条交互轨迹，覆盖1500多个应用和网站。在WildGUI上预训练Qwen2.5-VL和Mimo-VL模型，在多个GUI基础定位与动作基准测试中带来5%~20%的一致性能提升，达到或超越当前最优水平。我们将公开WildGUI数据集及Video2GUI流程，以支持GUI代理领域的未来研究。

English

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.