Video2GUI: 汎用GUIエージェント事前学習のための大規模インタラクション軌跡合成

要旨

マルチモーダル大規模言語モデルの最近の進歩により、グラフィカルユーザーインターフェース（GUI）エージェントへの関心が高まっているが、その汎化能力は多様な実世界アプリケーションにわたる大規模な訓練データの不足によって制約されたままである。既存のデータセットはコストのかかる手動アノテーションに大きく依存しており、典型的には狭い領域に限定されている。この課題に対処するため、我々はVideo2GUIを提案する。これはラベル付けされていないインターネット動画から直接、接地されたGUIインタラクショントラジェクトリを抽出する完全自動フレームワークである。Video2GUIは、粗いものから細かいものへと段階的にフィルタリングする戦略を採用し、高品質なGUIチュートリアル動画を特定し、それらを構造化されたエージェントの軌跡に変換する。このパイプラインを5億件の動画メタデータエントリに適用し、1,500以上のアプリケーションとウェブサイトにわたる1,200万件のインタラクショントラジェクトリを含む大規模データセットWildGUIを構築した。WildGUI上でQwen2.5-VLとMimo-VLを事前学習した結果、複数のGUIグラウンディングおよびアクションベンチマークにおいて一貫して5～20%の改善が見られ、最先端の性能に匹敵またはそれを上回った。我々は、GUIエージェントの将来の研究を支援するため、WildGUIデータセットとVideo2GUIパイプラインの両方を公開する予定である。

English

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.