f-ダイバージェンス分布マッチングを用いたワンステップ拡散モデル

要旨

拡散モデルからのサンプリングは、反復的なプロセスが遅いため、特にインタラクティブなアプリケーションにおいて実用的な展開を妨げています。生成速度を向上させるために、最近のアプローチでは、多段階の拡散モデルを変分スコア蒸留を用いて単一段階の学生ジェネレータに蒸留し、学生が生成するサンプルの分布を教師の分布に一致させます。しかし、これらのアプローチでは分布マッチングに逆Kullback-Leibler（KL）ダイバージェンスを使用しており、これはモードを追求する性質があることが知られています。本論文では、fダイバージェンス最小化フレームワークを用いて分布マッチングアプローチを一般化し、f-distillと名付けました。このフレームワークは、モードカバレッジとトレーニング分散の異なるトレードオフを持つさまざまなダイバージェンスをカバーします。教師と学生の分布間のfダイバージェンスの勾配を導出し、それがそれらのスコア差とそれらの密度比によって決定される重み関数の積として表されることを示します。この重み関数は、モードをあまり追求しないダイバージェンスを使用する場合、教師分布において密度が高いサンプルを自然に強調します。逆KLダイバージェンスを使用した一般的な変分スコア蒸留アプローチが、我々のフレームワーク内の特殊なケースであることを観察します。実験的には、順KLやJensen-Shannonダイバージェンスなどの代替fダイバージェンスが、画像生成タスクにおいて現在の最良の変分スコア蒸留法を上回ることを示します。特に、Jensen-Shannonダイバージェンスを使用した場合、f-distillはImageNet64におけるワンステップ生成性能とMS-COCOにおけるゼロショットテキストから画像生成において現在の最先端の性能を達成します。プロジェクトページ: https://research.nvidia.com/labs/genair/f-distill

English

Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel f-divergence minimization framework, termed f-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the f-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative f-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, f-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill

f-ダイバージェンス分布マッチングを用いたワンステップ拡散モデル

One-step Diffusion Models with f-Divergence Distribution Matching

要旨

Support