Sekai:面向世界探索的视频数据集
Sekai: A Video Dataset towards World Exploration
June 18, 2025
作者: Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang
cs.AI
摘要
视频生成技术已取得显著进展,有望成为互动世界探索的基石。然而,现有的视频生成数据集并不适合用于世界探索训练,因为它们存在一些局限:地点有限、时长较短、场景静态,且缺乏关于探索和世界的标注。本文中,我们介绍了Sekai(日语中意为“世界”),一个高质量的第一人称视角全球视频数据集,包含丰富的世界探索标注。该数据集由来自750个城市、超过100个国家和地区的步行或无人机视角(FPV和UVA)视频组成,总时长超过5,000小时。我们开发了一个高效且实用的工具箱,用于收集、预处理并标注视频,包括位置、场景、天气、人群密度、字幕以及相机轨迹。实验验证了数据集的质量。此外,我们利用一个子集训练了一个互动视频世界探索模型,命名为YUME(日语中意为“梦”)。我们相信,Sekai将惠及视频生成和世界探索领域,并激发有价值的应用。
English
Video generation techniques have made remarkable progress, promising to be
the foundation of interactive world exploration. However, existing video
generation datasets are not well-suited for world exploration training as they
suffer from some limitations: limited locations, short duration, static scenes,
and a lack of annotations about exploration and the world. In this paper, we
introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person
view worldwide video dataset with rich annotations for world exploration. It
consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from
over 100 countries and regions across 750 cities. We develop an efficient and
effective toolbox to collect, pre-process and annotate videos with location,
scene, weather, crowd density, captions, and camera trajectories. Experiments
demonstrate the quality of the dataset. And, we use a subset to train an
interactive video world exploration model, named YUME (meaning ``dream'' in
Japanese). We believe Sekai will benefit the area of video generation and world
exploration, and motivate valuable applications.