VidProM: 텍스트-투-비디오 확산 모델을 위한 백만 규모의 실제 프롬프트-갤러리 데이터셋

초록

Sora의 등장은 텍스트-투-비디오 확산 모델에 새로운 시대를 열며, 비디오 생성과 잠재적 응용 분야에서 상당한 진전을 가져왔습니다. 그러나 Sora를 비롯한 다른 텍스트-투-비디오 확산 모델들은 프롬프트에 크게 의존하며, 텍스트-투-비디오 프롬프트 연구를 위한 공개 데이터셋은 아직 존재하지 않습니다. 본 논문에서는 실제 사용자로부터 수집된 167만 개의 고유한 텍스트-투-비디오 프롬프트로 구성된 최초의 대규모 데이터셋인 VidProM을 소개합니다. 또한, 이 데이터셋에는 4개의 최첨단 확산 모델로 생성된 669만 개의 비디오와 관련 데이터가 포함되어 있습니다. 우리는 이 대규모 데이터셋의 구축 과정을 처음으로 보여주며, 이는 시간과 비용이 많이 드는 작업임을 입증합니다. 이후, 제안된 VidProM이 이미지 생성을 위한 대규모 프롬프트 갤러리 데이터셋인 DiffusionDB와 어떻게 다른지 설명합니다. 이러한 프롬프트 분석을 바탕으로, 텍스트-투-비디오 생성을 위해 특별히 설계된 새로운 프롬프트 데이터셋의 필요성을 확인하고, 실제 사용자들이 비디오를 생성할 때의 선호도를 파악합니다. 우리의 대규모이고 다양한 데이터셋은 또한 많은 흥미로운 새로운 연구 분야를 영감으로 제공합니다. 예를 들어, 더 나은, 더 효율적이고 안전한 텍스트-투-비디오 확산 모델을 개발하기 위해, 텍스트-투-비디오 프롬프트 엔지니어링, 효율적인 비디오 생성, 그리고 확산 모델을 위한 비디오 복제 탐지 등의 연구를 제안합니다. 우리는 수집된 데이터셋 VidProM을 CC-BY-NC 4.0 라이선스 하에 GitHub와 Hugging Face에 공개합니다.

English

The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, as well as other text-to-video diffusion models, highly relies on the prompts, and there is no publicly available dataset featuring a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 million unique text-to-video prompts from real users. Additionally, the dataset includes 6.69 million videos generated by four state-of-the-art diffusion models and some related data. We initially demonstrate the curation of this large-scale dataset, which is a time-consuming and costly process. Subsequently, we show how the proposed VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Based on the analysis of these prompts, we identify the necessity for a new prompt dataset specifically designed for text-to-video generation and gain insights into the preferences of real users when creating videos. Our large-scale and diverse dataset also inspires many exciting new research areas. For instance, to develop better, more efficient, and safer text-to-video diffusion models, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models. We make the collected dataset VidProM publicly available at GitHub and Hugging Face under the CC-BY- NC 4.0 License.

VidProM: 텍스트-투-비디오 확산 모델을 위한 백만 규모의 실제 프롬프트-갤러리 데이터셋

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models

초록

Support