InstaFlow: 高品質な拡散ベースのテキストから画像生成において、たった1ステップで十分

要旨

拡散モデルは、その卓越した品質と創造性によりテキストから画像を生成する技術に革命をもたらしました。しかし、その多段階のサンプリングプロセスは遅いことが知られており、満足のいく結果を得るためには数十回の推論ステップを必要とすることがしばしばあります。これまで、蒸留を通じてサンプリング速度を向上させ、計算コストを削減しようとする試みは、機能的なワンステップモデルの実現に成功していませんでした。本論文では、これまで小規模なデータセットにのみ適用されてきたRectified Flowという最近の手法を探求します。Rectified Flowの核心は、確率流の軌跡を直線化し、ノイズと画像間の結合を洗練させ、学生モデルによる蒸留プロセスを促進するリフロー手順にあります。我々は、Stable Diffusion（SD）を超高速ワンステップモデルに変換するための新しいテキスト条件付きパイプラインを提案し、その中でリフローがノイズと画像間の割り当てを改善する上で重要な役割を果たすことを発見しました。この新しいパイプラインを活用し、我々の知る限り、SDレベルの画像品質を持つ最初のワンステップ拡散ベースのテキストから画像生成器を作成し、MS COCO 2017-5kでFID（Frechet Inception Distance）23.3を達成し、従来の最先端技術であるプログレッシブ蒸留を大幅に上回りました（FID 37.2 → 23.3）。1.7Bパラメータを持つ拡張ネットワークを利用することで、FIDをさらに22.4に改善しました。我々はこのワンステップモデルをInstaFlowと呼びます。MS COCO 2014-30kでは、InstaFlowはわずか0.09秒でFID 13.1を達成し、≤ 0.1秒の領域で最高の結果を示し、最近のStyleGAN-T（0.1秒で13.9）を上回りました。特に、InstaFlowのトレーニングにはわずか199 A100 GPU日しかかかりません。プロジェクトページ: https://github.com/gnobitab/InstaFlow。

English

Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its reflow procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin (37.2 rightarrow 23.3 in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models InstaFlow. On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in leq 0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~https://github.com/gnobitab/InstaFlow.

InstaFlow: 高品質な拡散ベースのテキストから画像生成において、たった1ステップで十分

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

要旨

Support