VidProM：一個百萬規模的真實提示-圖庫數據集，用於文本到視頻擴散模型

摘要

Sora 的出現標誌著文本到視頻擴散模型的新時代，帶來了視頻生成和潛在應用方面的重大進展。然而，Sora 和其他文本到視頻擴散模型高度依賴提示，並且目前沒有公開可用的數據集包含對文本到視頻提示的研究。本文介紹了 VidProM，這是第一個包含來自真實用戶的 1.67 百萬個獨特文本到視頻提示的大規模數據集。此外，該數據集還包括由四種最先進的擴散模型生成的 669 萬個視頻以及一些相關數據。我們首先展示了這一大規模數據集的策劃過程，這是一個耗時且昂貴的過程。接著，我們展示了所提出的 VidProM 如何與 DiffusionDB 進行了比較，後者是一個用於圖像生成的大規模提示庫數據集。通過對這些提示的分析，我們確定了專門為文本到視頻生成設計的新提示數據集的必要性，並深入了解了真實用戶在創建視頻時的偏好。我們的大規模且多樣化的數據集還激發了許多令人興奮的新研究領域。例如，為了開發更好、更高效、更安全的文本到視頻擴散模型，我們建議探索文本到視頻提示工程、高效視頻生成以及擴散模型的視頻拷貝檢測。我們將收集的數據集 VidProM 在 GitHub 和 Hugging Face 上以 CC-BY-NC 4.0 許可證公開發布。

English

The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, as well as other text-to-video diffusion models, highly relies on the prompts, and there is no publicly available dataset featuring a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users. Additionally, the dataset includes 6.69 million videos generated by four state-of-the-art diffusion models and some related data. We initially demonstrate the curation of this large-scale dataset, which is a time-consuming and costly process. Subsequently, we show how the proposed VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Based on the analysis of these prompts, we identify the necessity for a new prompt dataset specifically designed for text-to-video generation and gain insights into the preferences of real users when creating videos. Our large-scale and diverse dataset also inspires many exciting new research areas. For instance, to develop better, more efficient, and safer text-to-video diffusion models, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models. We make the collected dataset VidProM publicly available at GitHub and Hugging Face under the CC-BY- NC 4.0 License.

VidProM：一個百萬規模的真實提示-圖庫數據集，用於文本到視頻擴散模型

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

摘要

Support