用于高分辨率视频生成的分层补丁扩散模型

摘要

扩散模型在图像和视频合成方面表现出色。然而，将它们扩展到高分辨率输入具有挑战性，需要将扩散管道重组为多个独立组件，从而限制了可扩展性并使下游应用变得复杂。这在训练过程中非常高效，并实现了对高分辨率视频的端到端优化。我们以两种原则方式改进了PDMs。首先，为了强化各个补丁之间的一致性，我们开发了深度上下文融合——一种从低尺度到高尺度补丁以分层方式传播上下文信息的架构技术。其次，为了加快训练和推断速度，我们提出了自适应计算，该方法将更多的网络容量和计算资源分配给粗略图像细节。最终的模型在UCF-101 256^2的条件类视频生成中取得了新的FVD得分66.32和Inception得分87.68的最新成果，超过了最近方法超过100%。接着，我们展示了它可以从基础的36×64低分辨率生成器快速微调，用于高分辨率64×288×512的文本到视频合成。据我们所知，我们的模型是第一个完全端到端训练在如此高分辨率上的基于扩散的架构。项目网页：https://snap-research.github.io/hpdm。

English

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 256^2, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base 36times 64 low-resolution generator for high-resolution 64 times 288 times 512 text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.

用于高分辨率视频生成的分层补丁扩散模型

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

摘要

Support