快照视频：用于文本到视频合成的尺度时空变换器

摘要

当代用于生成图像的模型展现出卓越的质量和多功能性。受到这些优势的影响，研究界重新运用这些模型来生成视频。由于视频内容具有高度冗余性，我们认为单纯地将图像模型的进展带入视频生成领域会降低动作保真度、视觉质量，并影响可扩展性。在这项工作中，我们构建了 Snap Video，这是一个以视频为先的模型，系统地解决了这些挑战。为此，我们首先扩展了EDM框架，以考虑空间和时间上的冗余像素，并自然地支持视频生成。其次，我们发现 U-Net，在图像生成背后的得力工具，在生成视频时扩展性较差，需要大量的计算开销。因此，我们提出了一种基于Transformer的新架构，训练速度比U-Net快3.31倍（推理速度快约4.5倍）。这使我们能够高效地训练一个拥有数十亿参数的文本到视频模型，首次在多个基准测试中取得了最先进的结果，并生成了质量更高、时间一致性更强和动作复杂性更高的视频。用户研究表明，我们的模型在很大程度上优于最新的方法。请访问我们的网站https://snap-research.github.io/snapvideo/。

English

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

快照视频：用于文本到视频合成的尺度时空变换器

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

摘要

Support