文本到视频生成的分层时空解耦

摘要

尽管扩散模型展示了生成逼真图像的强大能力，但生成逼真且多样化的视频仍处于起步阶段。其中一个关键原因是当前方法将空间内容和时间动态交织在一起，导致文本到视频生成（T2V）的复杂度显著增加。在这项工作中，我们提出了HiGen，一种基于扩散模型的方法，通过从结构层面和内容层面两个角度解耦视频的空间和时间因素，从而提高性能。在结构层面，我们将T2V任务分解为两个步骤，包括空间推理和时间推理，使用统一的去噪器。具体而言，在空间推理过程中，我们利用文本生成空间连贯的先验，然后在时间推理过程中从这些先验中生成时间连贯的运动。在内容层面，我们从输入视频的内容中提取两种微妙线索，分别可以表达运动和外观变化。这两种线索然后指导模型的训练以生成视频，实现灵活的内容变化并增强时间稳定性。通过解耦范式，HiGen能够有效降低这一任务的复杂度，并生成具有语义准确性和运动稳定性的逼真视频。大量实验证明了HiGen相对于最先进的T2V方法具有卓越的性能。

English

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.

文本到视频生成的分层时空解耦

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

摘要

Support