InstaFlow：一步足以实现高质量基于扩散的文本到图像生成

摘要

扩散模型以其出色的质量和创造力彻底改变了文本到图像生成。然而，其多步采样过程被认为速度较慢，通常需要数十个推断步骤才能获得令人满意的结果。先前试图通过蒸馏来提高其采样速度并减少计算成本的尝试未能实现功能齐全的一步模型。在本文中，我们探讨了一种名为Rectified Flow的最新方法，迄今仅应用于小型数据集。Rectified Flow的核心在于其重新流程，该过程将概率流的轨迹拉直，改进了噪声和图像之间的耦合，并通过学生模型促进了蒸馏过程。我们提出了一种新颖的文本条件管道，将稳定扩散（SD）转化为超快速一步模型，在其中我们发现重新流在改善噪声和图像之间的分配中起到了关键作用。利用我们的新管道，我们创造了据我们所知首个具有SD级图像质量的一步扩散式文本到图像生成器，其在MS COCO 2017-5k上实现了23.3的FID（Frechet Inception Distance），明显超过了先前的最先进技术，渐进蒸馏，FID从37.2提高到23.3。通过利用一个具有17亿参数的扩展网络，我们进一步将FID提高到22.4。我们将我们的一步模型称为InstaFlow。在MS COCO 2014-30k上，InstaFlow在仅0.09秒内获得了13.1的FID，是小于0.1秒范围内最佳的，胜过了最近的StyleGAN-T（在0.1秒内为13.9）。值得注意的是，InstaFlow的训练仅耗费199个A100 GPU天。项目页面：https://github.com/gnobitab/InstaFlow。

English

Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its reflow procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin (37.2 rightarrow 23.3 in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models InstaFlow. On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in leq 0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~https://github.com/gnobitab/InstaFlow.

InstaFlow：一步足以实现高质量基于扩散的文本到图像生成

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

摘要

Support