TIP-I2V：用于图像到视频生成的百万级真实文本和图像提示数据集

摘要

视频生成模型正在彻底改变内容创作，图像到视频模型因其增强的可控性、视觉一致性和实际应用而受到越来越多的关注。然而，尽管这些模型很受欢迎，但它们依赖用户提供的文本和图像提示，目前还没有专门用于研究这些提示的数据集。本文介绍了TIP-I2V，这是第一个针对图像到视频生成的超过170万个独特用户提供的文本和图像提示的大规模数据集。此外，我们提供了来自五种最先进的图像到视频模型生成的相应视频。我们首先概述了策划这一大规模数据集的耗时和昂贵过程。接下来，我们将TIP-I2V与两个流行的提示数据集VidProM（文本到视频）和DiffusionDB（文本到图像）进行比较，突出了基本信息和语义信息的差异。该数据集推动了图像到视频研究的进展。例如，为了开发更好的模型，研究人员可以使用TIP-I2V中的提示来分析用户偏好，并评估他们训练模型的多维性能；为了增强模型的安全性，他们可以专注于解决图像到视频模型引起的误导问题。TIP-I2V激发的新研究以及与现有数据集的差异强调了专门的图像到视频提示数据集的重要性。该项目可在https://tip-i2v.github.io 上公开获取。

English

Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at https://tip-i2v.github.io.

TIP-I2V：用于图像到视频生成的百万级真实文本和图像提示数据集

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

摘要

Support