VidProM：テキストからビデオへの拡散モデル向け百万規模リアルプロンプトギャラリーデータセット

要旨

Soraの登場は、テキストからビデオを生成する拡散モデルにとって新たな時代の到来を告げるものであり、ビデオ生成とその応用可能性において大きな進展をもたらしました。しかし、Soraや他のテキストからビデオを生成する拡散モデルは、プロンプトに大きく依存しており、テキストからビデオを生成するためのプロンプトを研究した公開データセットは存在しません。本論文では、実ユーザーによる167万件のユニークなテキストからビデオを生成するプロンプトを含む、初の大規模データセットであるVidProMを紹介します。さらに、このデータセットには、4つの最先端の拡散モデルによって生成された669万件のビデオと関連データが含まれています。まず、この大規模データセットの構築が時間とコストを要するプロセスであることを示します。次に、提案するVidProMが、画像生成のための大規模プロンプトギャラリーデータセットであるDiffusionDBとどのように異なるかを示します。これらのプロンプトの分析に基づいて、テキストからビデオを生成するために特別に設計された新しいプロンプトデータセットの必要性を明らかにし、実ユーザーがビデオを作成する際の嗜好について洞察を得ます。私たちの大規模で多様なデータセットは、多くの新たな研究分野を刺激します。例えば、より優れた、効率的で安全なテキストからビデオを生成する拡散モデルを開発するために、テキストからビデオを生成するプロンプトエンジニアリング、効率的なビデオ生成、拡散モデルのためのビデオコピー検出の探求を提案します。収集したデータセットVidProMは、CC-BY-NC 4.0ライセンスの下でGitHubとHugging Faceで公開しています。

English

The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, as well as other text-to-video diffusion models, highly relies on the prompts, and there is no publicly available dataset featuring a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users. Additionally, the dataset includes 6.69 million videos generated by four state-of-the-art diffusion models and some related data. We initially demonstrate the curation of this large-scale dataset, which is a time-consuming and costly process. Subsequently, we show how the proposed VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Based on the analysis of these prompts, we identify the necessity for a new prompt dataset specifically designed for text-to-video generation and gain insights into the preferences of real users when creating videos. Our large-scale and diverse dataset also inspires many exciting new research areas. For instance, to develop better, more efficient, and safer text-to-video diffusion models, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models. We make the collected dataset VidProM publicly available at GitHub and Hugging Face under the CC-BY- NC 4.0 License.

VidProM：テキストからビデオへの拡散モデル向け百万規模リアルプロンプトギャラリーデータセット

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

要旨

Support