VidGen-1M:一個用於文本轉視頻生成的大規模數據集
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
August 5, 2024
作者: Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li
cs.AI
摘要
影片文本對的品質基本上決定了文本到影片模型的上限。目前,用於訓練這些模型的數據集存在顯著缺陷,包括低時間一致性、質量低劣的標題、視頻質量不佳和數據分佈不均。主流的影片策展過程依賴於圖像模型進行標記和基於手動規則的策展,這導致高計算負載並且留下不乾淨的數據。因此,缺乏適用於文本到影片模型的適當訓練數據集。為解決這個問題,我們提出了VidGen-1M,這是一個優質的文本到影片模型訓練數據集。通過粗到精的策展策略製作,該數據集確保了高質量的影片和詳細的標題,具有優秀的時間一致性。當用於訓練影片生成模型時,該數據集已經產生了超越其他模型的實驗結果。
English
The quality of video-text pairs fundamentally determines the upper bound of
text-to-video models. Currently, the datasets used for training these models
suffer from significant shortcomings, including low temporal consistency,
poor-quality captions, substandard video quality, and imbalanced data
distribution. The prevailing video curation process, which depends on image
models for tagging and manual rule-based curation, leads to a high
computational load and leaves behind unclean data. As a result, there is a lack
of appropriate training datasets for text-to-video models. To address this
problem, we present VidGen-1M, a superior training dataset for text-to-video
models. Produced through a coarse-to-fine curation strategy, this dataset
guarantees high-quality videos and detailed captions with excellent temporal
consistency. When used to train the video generation model, this dataset has
led to experimental results that surpass those obtained with other models.Summary
AI-Generated Summary