VidGen-1M:用于文本到视频生成的大规模数据集
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
August 5, 2024
作者: Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li
cs.AI
摘要
视频文本对的质量从根本上决定了文本到视频模型的上限。目前,用于训练这些模型的数据集存在显著缺陷,包括低时间一致性、质量低劣的字幕、视频质量不佳以及数据分布不均衡。目前的视频筛选过程依赖于图像模型进行标记和基于手动规则的筛选,导致计算负荷高,留下了不干净的数据。因此,缺乏适用于文本到视频模型的训练数据集。为解决这一问题,我们提出了VidGen-1M,这是一个优秀的文本到视频模型训练数据集。通过粗到精的筛选策略生成,该数据集保证了高质量的视频和详细的字幕,具有出色的时间一致性。将该数据集用于训练视频生成模型后,实验结果超过了其他模型的表现。
English
The quality of video-text pairs fundamentally determines the upper bound of
text-to-video models. Currently, the datasets used for training these models
suffer from significant shortcomings, including low temporal consistency,
poor-quality captions, substandard video quality, and imbalanced data
distribution. The prevailing video curation process, which depends on image
models for tagging and manual rule-based curation, leads to a high
computational load and leaves behind unclean data. As a result, there is a lack
of appropriate training datasets for text-to-video models. To address this
problem, we present VidGen-1M, a superior training dataset for text-to-video
models. Produced through a coarse-to-fine curation strategy, this dataset
guarantees high-quality videos and detailed captions with excellent temporal
consistency. When used to train the video generation model, this dataset has
led to experimental results that surpass those obtained with other models.Summary
AI-Generated Summary