フローの整合：連続時間フローマップ蒸留のスケーリング

要旨

拡散モデルやフローベースモデルは、最先端の生成モデリング手法として登場したが、多くのサンプリングステップを必要とする。一貫性モデルはこれらのモデルを効率的なワンステップ生成器に蒸留できるが、フローベースや拡散ベースの手法とは異なり、ステップ数を増やすと性能が必然的に低下することを、我々は理論的および実験的に示す。フローマップは、任意の2つのノイズレベルを1ステップで接続することでこれらのアプローチを一般化し、すべてのステップ数で効果を発揮する。本論文では、フローマップの訓練のための2つの新しい連続時間目的関数を導入し、既存の一貫性およびフローマッチング目的関数を一般化する新たな訓練技術を提案する。さらに、蒸留中に低品質モデルをガイダンスとして使用するオートガイダンスが性能を向上させ、敵対的ファインチューニングによりさらなる向上が可能であり、サンプルの多様性をほとんど損なわずに達成できることを示す。我々は、Align Your Flowと呼ばれるフローマップモデルを、困難な画像生成ベンチマークで広範に検証し、ImageNet 64x64および512x512において、小さく効率的なニューラルネットワークを使用して、最先端の少ステップ生成性能を達成した。最後に、テキスト条件付き合成において、既存の非敵対的訓練された少ステップサンプラーをすべて上回るテキストから画像へのフローマップモデルを示す。

English

Diffusion- and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically. Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity. We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.

フローの整合：連続時間フローマップ蒸留のスケーリング

Align Your Flow: Scaling Continuous-Time Flow Map Distillation

要旨

Support