用於高解析度視頻生成的階層式補丁擴散模型

摘要

擴散模型在圖像和視頻合成方面展現出卓越的性能。然而，將其擴展至高分辨率輸入具有挑戰性，需要將擴散管道重組為多個獨立組件，從而限制了可擴展性並使下游應用變得複雜。這在訓練過程中非常高效，並實現了對高分辨率視頻的端到端優化。我們以兩種原則方式改進了PDMs。首先，為了強化各個區塊之間的一致性，我們開發了深度上下文融合——一種從低尺度到高尺度區塊以階層方式傳播上下文信息的結構技術。其次，為了加速訓練和推斷，我們提出了自適應計算，該方法將更多的網絡容量和計算資源分配給粗略的圖像細節。最終模型在UCF-101 256^2的類條件視頻生成中取得了新的最先進FVD得分為66.32和Inception Score為87.68，超過了最近方法超過100%。然後，我們展示它可以從基礎36x64低分辨率生成器快速微調，用於高分辨率64x288x512文本到視頻合成。據我們所知，我們的模型是第一個完全端到端訓練的基於擴散的架構，可以在如此高的分辨率上進行訓練。項目網頁：https://snap-research.github.io/hpdm。

English

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 256^2, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base 36times 64 low-resolution generator for high-resolution 64 times 288 times 512 text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.

用於高解析度視頻生成的階層式補丁擴散模型

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

摘要

Support