ChatPaper.aiChatPaper

OpenVid-1M:一个用于文本到视频生成的大规模高质量数据集

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

July 2, 2024
作者: Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, Ying Tai
cs.AI

摘要

最近,文本到视频(T2V)生成因大型多模型Sora而备受关注。然而,T2V生成仍面临两个重要挑战:1)缺乏精确的开源高质量数据集。先前流行的视频数据集,如WebVid-10M和Panda-70M,要么质量低要么对大多数研究机构来说太大。因此,收集精确高质量的文本-视频对对于T2V生成来说是具有挑战性但至关重要的。2)忽视充分利用文本信息。最近的T2V方法专注于视觉Transformer,使用简单的交叉注意力模块进行视频生成,无法充分提取文本提示中的语义信息。为解决这些问题,我们介绍了OpenVid-1M,一个具有富有表现力标题的精确高质量数据集。这个开放场景数据集包含超过100万个文本-视频对,促进了T2V生成的研究。此外,我们从OpenVid-1M中筛选出433K个1080p视频,创建了OpenVidHD-0.4M,推动了高清视频生成的发展。此外,我们提出了一种新颖的多模态视频扩散Transformer(MVDiT),能够从视觉标记中挖掘结构信息和从文本标记中提取语义信息。大量实验证实了OpenVid-1M相对于先前数据集的优越性以及我们的MVDiT的有效性。
English
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Summary

AI-Generated Summary

PDF556November 28, 2024