Video2GUI: 일반화된 GUI 에이전트 사전 학습을 위한 대규모 상호작용 궤적 합성

초록

최근 멀티모달 대규모 언어 모델의 발전으로 그래픽 사용자 인터페이스(GUI) 에이전트에 대한 관심이 증가하고 있지만, 다양한 실제 애플리케이션을 포괄하는 대규모 학습 데이터의 부족으로 일반화 성능이 여전히 제한적이다. 기존 데이터셋은 대부분 비용이 많이 드는 수동 주석에 의존하며, 일반적으로 좁은 도메인에 국한된다. 이러한 문제를 해결하기 위해, 우리는 레이블이 없는 인터넷 동영상에서 직접 GUI 상호작용 궤적을 추출하는 완전 자동화 프레임워크인 Video2GUI를 제안한다. Video2GUI는 coarse-to-fine 필터링 전략을 사용하여 고품질 GUI 튜토리얼 동영상을 식별하고 이를 구조화된 에이전트 궤적으로 변환한다. 이 파이프라인을 5억 개의 동영상 메타데이터 항목에 적용하여, 1,500개 이상의 애플리케이션과 웹사이트를 포괄하는 1,200만 개의 상호작용 궤적으로 구성된 대규모 데이터셋 WildGUI를 구축하였다. WildGUI로 Qwen2.5-VL과 Mimo-VL을 사전 학습한 결과, 여러 GUI grounding 및 행동 벤치마크에서 5-20%의 일관된 성능 향상을 보였으며, 최신 성능과 동등하거나 이를 능가하였다. 향후 GUI 에이전트 연구를 지원하기 위해 WildGUI 데이터셋과 Video2GUI 파이프라인을 공개할 예정이다.

English

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.