마트료시카 확산 모델

초록

디퓨전 모델은 고품질 이미지와 비디오 생성을 위한 사실상의 표준 접근법이지만, 고차원 모델 학습은 계산적 및 최적화적 도전으로 인해 여전히 어려운 과제로 남아 있습니다. 기존 방법들은 픽셀 공간에서 캐스케이드 모델을 학습하거나 별도로 훈련된 오토인코더의 다운샘플된 잠재 공간을 사용하는 경우가 많습니다. 본 논문에서는 고해상도 이미지 및 비디오 합성을 위한 종단 간(end-to-end) 프레임워크인 Matryoshka 디퓨전 모델(MDM)을 소개합니다. 우리는 다중 해상도에서 입력을 공동으로 노이즈 제거하는 디퓨전 프로세스를 제안하며, 소규모 입력을 위한 특징과 매개변수가 대규모 입력의 특징과 매개변수 내에 중첩된 NestedUNet 아키텍처를 사용합니다. 또한, MDM은 낮은 해상도에서 높은 해상도로의 점진적인 훈련 스케줄을 가능하게 하여 고해상도 생성을 위한 최적화에서 상당한 개선을 이끌어냅니다. 우리는 클래스 조건부 이미지 생성, 고해상도 텍스트-이미지, 텍스트-비디오 응용 프로그램을 포함한 다양한 벤치마크에서 우리의 접근법의 효과를 입증합니다. 특히, 단일 픽셀 공간 모델을 최대 1024x1024 픽셀 해상도로 훈련할 수 있으며, 1200만 장의 이미지만 포함된 CC12M 데이터셋을 사용하여 강력한 제로샷 일반화 능력을 보여줍니다.

English

Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.