母巢扩散模型

摘要

扩散模型是生成高质量图像和视频的事实标准方法，但由于计算和优化挑战，学习高维模型仍然是一项艰巨任务。现有方法通常通过在像素空间训练级联模型或使用单独训练的自动编码器的下采样潜空间来解决。本文介绍了Matryoshka扩散模型（MDM），这是一个用于高分辨率图像和视频合成的端到端框架。我们提出了一个扩散过程，同时在多个分辨率上对输入进行去噪，并使用NestedUNet架构，其中小尺度输入的特征和参数嵌套在大尺度输入的内部。此外，MDM实现了从低分辨率到高分辨率的渐进式训练计划，这导致了在高分辨率生成方面的显著优化改进。我们在各种基准测试上展示了我们方法的有效性，包括类别条件图像生成、高分辨率文本到图像和文本到视频应用。值得注意的是，我们可以在高达1024x1024像素分辨率下训练单个像素空间模型，展示了在仅包含1200万图像的CC12M数据集上使用强零样本泛化的能力。

English

Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.