高速高解像度画像合成のための潜在敵対的拡散蒸留

要旨

拡散モデルは画像および動画合成における進歩の主要な駆動力であるが、推論速度が遅いという課題を抱えている。蒸留法、特に最近導入された敵対的拡散蒸留（ADD）は、モデルを多段階推論から単一段階推論に移行することを目指しているが、固定された事前学習済みDINOv2識別器に依存するため、高コストで最適化が困難である。本論文では、ADDの限界を克服する新たな蒸留手法である潜在敵対的拡散蒸留（LADD）を提案する。ピクセルベースのADDとは異なり、LADDは事前学習済みの潜在拡散モデルから生成的特徴を利用する。このアプローチにより、学習が簡素化され、性能が向上し、高解像度かつ多アスペクト比の画像合成が可能となる。我々はLADDをStable Diffusion 3（8B）に適用し、SD3-Turboを開発した。これは、非ガイド付きサンプリングをわずか4ステップで行うだけで、最先端のテキストから画像への生成器の性能に匹敵する高速モデルである。さらに、そのスケーリング挙動を体系的に調査し、画像編集やインペインティングなど様々な応用におけるLADDの有効性を実証する。

English

Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

高速高解像度画像合成のための潜在敵対的拡散蒸留

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

要旨

Support