xGen-VideoSyn-1:使用压缩表示进行高保真度的文本到视频合成
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
August 22, 2024
作者: Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong
cs.AI
摘要
我们提出了xGen-VideoSyn-1,这是一个文本到视频(T2V)生成模型,能够从文本描述中生成逼真的场景。借鉴了最近的进展,如OpenAI的Sora,我们探索了潜在扩散模型(LDM)架构,并引入了视频变分自动编码器(VidVAE)。VidVAE在空间和时间上压缩视频数据,显著减少了视觉标记的长度和生成长序列视频所需的计算需求。为了进一步解决计算成本,我们提出了一个分割和合并策略,以保持视频片段之间的时间一致性。我们的扩散Transformer(DiT)模型融合了空间和时间自注意力层,实现了在不同时间范围和宽高比之间的强大泛化能力。我们从一开始设计了数据处理流水线,并收集了超过1300万高质量的视频文本对。该流水线包括多个步骤,如剪辑、文本检测、运动估计、美学评分,以及基于我们内部视频-LLM模型的密集字幕生成。训练VidVAE和DiT模型分别需要约40和642个H100天。我们的模型支持端到端的超过14秒720p视频生成,并展示了与最先进的T2V模型相竞争的性能。
English
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of
producing realistic scenes from textual descriptions. Building on recent
advancements, such as OpenAI's Sora, we explore the latent diffusion model
(LDM) architecture and introduce a video variational autoencoder (VidVAE).
VidVAE compresses video data both spatially and temporally, significantly
reducing the length of visual tokens and the computational demands associated
with generating long-sequence videos. To further address the computational
costs, we propose a divide-and-merge strategy that maintains temporal
consistency across video segments. Our Diffusion Transformer (DiT) model
incorporates spatial and temporal self-attention layers, enabling robust
generalization across different timeframes and aspect ratios. We have devised a
data processing pipeline from the very beginning and collected over 13M
high-quality video-text pairs. The pipeline includes multiple steps such as
clipping, text detection, motion estimation, aesthetics scoring, and dense
captioning based on our in-house video-LLM model. Training the VidVAE and DiT
models required approximately 40 and 642 H100 days, respectively. Our model
supports over 14-second 720p video generation in an end-to-end way and
demonstrates competitive performance against state-of-the-art T2V models.Summary
AI-Generated Summary