InstaFlow：一步足以實現高質量基於擴散的文本到圖像生成

摘要

擴散模型以其卓越的質量和創造力，徹底改變了文本到圖像生成的方法。然而，其多步驟取樣過程被認為速度緩慢，通常需要數十個推理步驟才能獲得滿意的結果。先前試圖通過蒸餾來提高取樣速度並降低計算成本的嘗試未能實現功能性的一步模型。本文探討了一種名為Rectified Flow的最新方法，迄今僅應用於小型數據集。Rectified Flow的核心在於其重新流程，該流程使概率流的軌跡變得直線，改進了噪聲和圖像之間的耦合，並促進了通過學生模型進行蒸餾的過程。我們提出了一種新穎的文本條件管道，將穩定擴散（SD）轉換為超快速的一步模型，在其中我們發現重新流在改善噪聲和圖像之間的分配中發揮了關鍵作用。利用我們的新管道，我們創建了據我們所知，具有SD級圖像質量的第一個一步擴散式文本到圖像生成器，實現了在MS COCO 2017-5k上的FID（Frechet Inception Distance）為23.3，明顯優於先前的最先進技術，逐步蒸餾，FID從37.2降至23.3。通過利用具有17億參數的擴展網絡，我們進一步將FID提高到22.4。我們將我們的一步模型稱為InstaFlow。在MS COCO 2014-30k上，InstaFlow在僅0.09秒內實現了13.1的FID，是小於0.1秒範圍內最優秀的，勝過了最近的StyleGAN-T（在0.1秒內為13.9）。值得注意的是，InstaFlow的訓練僅需199個A100 GPU天。項目頁面：https://github.com/gnobitab/InstaFlow。

English

Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its reflow procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin (37.2 rightarrow 23.3 in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models InstaFlow. On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in leq 0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~https://github.com/gnobitab/InstaFlow.

InstaFlow：一步足以實現高質量基於擴散的文本到圖像生成

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

摘要

Support