TIP-I2V: 이미지-비디오 생성을 위한 백만 규모의 실제 텍스트 및 이미지 프롬프트 데이터셋

초록

비디오 생성 모델은 콘텐츠 제작에 혁신을 가져오고 있으며, 특히 향상된 제어성, 시각적 일관성, 그리고 실용적인 응용 가능성으로 인해 이미지-투-비디오 모델이 점점 더 주목받고 있습니다. 그러나 이러한 모델들은 사용자가 제공한 텍스트와 이미지 프롬프트에 의존하고 있으며, 현재 이러한 프롬프트를 연구하기 위한 전용 데이터셋이 존재하지 않습니다. 본 논문에서는 이미지-투-비디오 생성을 위해 특별히 설계된 170만 개 이상의 고유한 사용자 제공 텍스트 및 이미지 프롬프트로 구성된 대규모 데이터셋인 TIP-I2V를 소개합니다. 또한, 최신 이미지-투-비디오 모델 5개에서 생성된 해당 비디오도 함께 제공합니다. 먼저, 이 대규모 데이터셋을 구축하는 데 소요된 시간과 비용에 대해 설명합니다. 다음으로, TIP-I2V를 두 개의 인기 있는 프롬프트 데이터셋인 VidProM(텍스트-투-비디오) 및 DiffusionDB(텍스트-투-이미지)와 비교하여 기본 정보와 의미적 정보의 차이점을 강조합니다. 이 데이터셋은 이미지-투-비디오 연구의 발전을 가능하게 합니다. 예를 들어, 더 나은 모델을 개발하기 위해 연구자들은 TIP-I2V의 프롬프트를 사용하여 사용자 선호도를 분석하고 훈련된 모델의 다차원적 성능을 평가할 수 있으며, 모델의 안전성을 강화하기 위해 이미지-투-비디오 모델로 인한 잘못된 정보 문제를 해결하는 데 집중할 수 있습니다. TIP-I2V에서 영감을 받은 새로운 연구와 기존 데이터셋과의 차이점은 전용 이미지-투-비디오 프롬프트 데이터셋의 중요성을 강조합니다. 이 프로젝트는 https://tip-i2v.github.io에서 공개적으로 이용 가능합니다.

English

Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at https://tip-i2v.github.io.

TIP-I2V: 이미지-비디오 생성을 위한 백만 규모의 실제 텍스트 및 이미지 프롬프트 데이터셋

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

초록

Support