Sekai:一個面向世界探索的影片資料集

Sekai: A Video Dataset towards World Exploration

June 18, 2025
作者: Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang
cs.AI

摘要

影片生成技術已取得顯著進展,有望成為互動式世界探索的基石。然而,現有的影片生成數據集並不適合用於世界探索的訓練,因為它們存在一些限制:地點有限、時長短暫、場景靜態,以及缺乏關於探索和世界的註解。在本論文中,我們介紹了Sekai(日語中意為「世界」),這是一個高品質的第一人稱視角全球影片數據集,具有豐富的世界探索註解。它包含來自100多個國家和地區、750個城市的超過5,000小時的步行或無人機視角(FPV和UVA)影片。我們開發了一個高效且有效的工具箱,用於收集、預處理和註解影片,包括位置、場景、天氣、人群密度、字幕和相機軌跡。實驗證明了數據集的品質。並且,我們使用一個子集來訓練一個互動式影片世界探索模型,名為YUME(日語中意為「夢」)。我們相信Sekai將有益於影片生成和世界探索領域,並激發有價值的應用。
English
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
PDF592June 19, 2025