ChatPaper.aiChatPaper

OpenVid-1M:一個用於文本轉視頻生成的大規模高質量數據集

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

July 2, 2024
作者: Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai
cs.AI

摘要

最近,由於大型多模型 Sora 的出現,文本到視頻(T2V)生成引起了相當大的關注。然而,T2V 生成仍然面臨兩個重要挑戰:1)缺乏精確的開源高質量數據集。先前流行的視頻數據集,例如 WebVid-10M 和 Panda-70M,要麼質量低要麼對大多數研究機構來說太大。因此,為了 T2V 生成,收集精確高質量的文本-視頻對具有挑戰性但至關重要。2)忽略充分利用文本信息。最近的 T2V 方法專注於視覺Transformer,使用簡單的交叉注意力模塊進行視頻生成,未能徹底從文本提示中提取語義信息。為了解決這些問題,我們介紹了 OpenVid-1M,這是一個具有表達性標題的精確高質量數據集。這個開放場景數據集包含超過100萬個文本-視頻對,有助於進行T2V生成的研究。此外,我們從OpenVid-1M中精選了433K個1080p視頻,創建了OpenVidHD-0.4M,推進了高清視頻生成。此外,我們提出了一種新穎的多模態視頻擴散Transformer(MVDiT),能夠從視覺標記中挖掘結構信息和從文本標記中提取語義信息。大量實驗和消融研究驗證了OpenVid-1M相對於先前數據集的優越性以及我們MVDiT的有效性。
English
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Summary

AI-Generated Summary

PDF556November 28, 2024