xGen-VideoSyn-1：使用压缩表示进行高保真度的文本到视频合成

摘要

我们提出了xGen-VideoSyn-1，这是一个文本到视频（T2V）生成模型，能够从文本描述中生成逼真的场景。借鉴了最近的进展，如OpenAI的Sora，我们探索了潜在扩散模型（LDM）架构，并引入了视频变分自动编码器（VidVAE）。VidVAE在空间和时间上压缩视频数据，显著减少了视觉标记的长度和生成长序列视频所需的计算需求。为了进一步解决计算成本，我们提出了一个分割和合并策略，以保持视频片段之间的时间一致性。我们的扩散Transformer（DiT）模型融合了空间和时间自注意力层，实现了在不同时间范围和宽高比之间的强大泛化能力。我们从一开始设计了数据处理流水线，并收集了超过1300万高质量的视频文本对。该流水线包括多个步骤，如剪辑、文本检测、运动估计、美学评分，以及基于我们内部视频-LLM模型的密集字幕生成。训练VidVAE和DiT模型分别需要约40和642个H100天。我们的模型支持端到端的超过14秒720p视频生成，并展示了与最先进的T2V模型相竞争的性能。

English

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

xGen-VideoSyn-1：使用压缩表示进行高保真度的文本到视频合成

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

摘要

Support