VidProM：一个百万规模的真实即时画廊数据集，用于文本到视频的扩散模型

摘要

Sora的到来标志着文本到视频扩散模型的新时代，带来了视频生成和潜在应用方面的重大进展。然而，Sora以及其他文本到视频扩散模型高度依赖提示，目前尚无公开可用的数据集涵盖文本到视频提示的研究。本文介绍了VidProM，这是首个包含来自真实用户的167万个独特文本到视频提示的大规模数据集。此外，该数据集还包括由四种最先进的扩散模型生成的669万个视频以及一些相关数据。我们首先展示了这一大规模数据集的策划过程，这是一个耗时且昂贵的过程。随后，我们展示了所提出的VidProM与DiffusionDB的区别，后者是用于图像生成的大规模提示库数据集。通过对这些提示的分析，我们确定了专门为文本到视频生成设计的新提示数据集的必要性，并深入了解了真实用户在创建视频时的偏好。我们的大规模且多样化的数据集也激发了许多令人兴奋的新研究领域。例如，为了开发更好、更高效、更安全的文本到视频扩散模型，我们建议探索文本到视频提示工程、高效视频生成以及扩散模型的视频复制检测。我们将收集的数据集VidProM在GitHub和Hugging Face上以CC-BY-NC 4.0许可证公开提供。

English

The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, as well as other text-to-video diffusion models, highly relies on the prompts, and there is no publicly available dataset featuring a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users. Additionally, the dataset includes 6.69 million videos generated by four state-of-the-art diffusion models and some related data. We initially demonstrate the curation of this large-scale dataset, which is a time-consuming and costly process. Subsequently, we show how the proposed VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Based on the analysis of these prompts, we identify the necessity for a new prompt dataset specifically designed for text-to-video generation and gain insights into the preferences of real users when creating videos. Our large-scale and diverse dataset also inspires many exciting new research areas. For instance, to develop better, more efficient, and safer text-to-video diffusion models, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models. We make the collected dataset VidProM publicly available at GitHub and Hugging Face under the CC-BY- NC 4.0 License.

VidProM：一个百万规模的真实即时画廊数据集，用于文本到视频的扩散模型

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

摘要

Support