텍스트에서 비디오로의 생성을 위한 대규모 데이터셋인 VidGen-1M

초록

비디오-텍스트 쌍의 품질은 기본적으로 텍스트-비디오 모델의 상한선을 결정합니다. 현재 이러한 모델을 훈련하는 데 사용되는 데이터셋은 저질의 시간적 일관성, 저품질 캡션, 저품질 비디오, 그리고 데이터 분포의 불균형과 같은 중요한 결함을 가지고 있습니다. 이미지 모델을 사용하여 태깅하고 수동 규칙 기반의 선별에 의존하는 현재의 비디오 선별 과정은 높은 계산 부하를 유발하고 불결한 데이터를 남깁니다. 결과적으로 텍스트-비디오 모델을 위한 적합한 훈련 데이터셋이 부족합니다. 이 문제를 해결하기 위해 우리는 텍스트-비디오 모델을 위한 우수한 훈련 데이터셋인 VidGen-1M을 제안합니다. 이 데이터셋은 코스투파인 선별 전략을 통해 생성되었으며 뛰어난 시간적 일관성을 가진 고품질 비디오와 상세한 캡션을 보장합니다. 이 데이터셋을 사용하여 비디오 생성 모델을 훈련하면 다른 모델보다 우수한 실험 결과를 얻을 수 있습니다.

English

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

텍스트에서 비디오로의 생성을 위한 대규모 데이터셋인 VidGen-1M

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

초록

Support