高速画像合成のための改良型分布マッチング蒸留法

要旨

最近のアプローチでは、拡散モデルを効率的なワンステップ生成器に蒸留することが有望視されています。その中でも、Distribution Matching Distillation (DMD) は、教師モデルのサンプリング軌跡との一対一対応を強制することなく、分布レベルで教師モデルと一致するワンステップ生成器を生成します。しかし、DMD は安定した訓練を確保するために、教師モデルが決定論的サンプラーを用いて多数のステップで生成したノイズ-画像ペアの大規模なセットを使用して計算される追加の回帰損失を必要とします。これは大規模なテキストから画像への合成においてコストがかかり、学生モデルの品質を教師モデルの元のサンプリング経路に過度に結びつけることになります。本論文では、この制限を解消し、DMD の訓練を改善する一連の技術である DMD2 を紹介します。まず、回帰損失と高コストなデータセット構築の必要性を排除します。その結果生じる不安定性は、偽の批評家が生成サンプルの分布を正確に推定していないことに起因することを示し、これを解決するために二つの時間スケールの更新ルールを提案します。次に、GAN 損失を蒸留プロセスに統合し、生成サンプルと実画像を識別します。これにより、学生モデルを実データで訓練することが可能になり、教師モデルからの不完全な実スコア推定を緩和し、品質を向上させます。最後に、訓練手順を変更して多段階サンプリングを可能にします。この設定における訓練-推論時の入力不一致問題を特定し、訓練時に推論時の生成器サンプルをシミュレートすることで対処します。これらの改善を組み合わせることで、ワンステップ画像生成において新たなベンチマークを設定し、ImageNet-64x64 で FID スコア 1.28、ゼロショット COCO 2014 で 8.35 を達成し、推論コストを 500 分の 1 に削減しながら元の教師モデルを上回りました。さらに、SDXL を蒸留することでメガピクセル画像を生成できることを示し、数ステップ手法の中でも卓越した視覚的品質を実証しました。

English

Recent approaches have shown promises distilling diffusion models into efficient one-step generators. Among them, Distribution Matching Distillation (DMD) produces one-step generators that match their teacher in distribution, without enforcing a one-to-one correspondence with the sampling trajectories of their teachers. However, to ensure stable training, DMD requires an additional regression loss computed using a large set of noise-image pairs generated by the teacher with many steps of a deterministic sampler. This is costly for large-scale text-to-image synthesis and limits the student's quality, tying it too closely to the teacher's original sampling paths. We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. We show that the resulting instability is due to the fake critic not estimating the distribution of generated samples accurately and propose a two time-scale update rule as a remedy. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images. This lets us train the student model on real data, mitigating the imperfect real score estimation from the teacher model, and enhancing quality. Lastly, we modify the training procedure to enable multi-step sampling. We identify and address the training-inference input mismatch problem in this setting, by simulating inference-time generator samples during training time. Taken together, our improvements set new benchmarks in one-step image generation, with FID scores of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the original teacher despite a 500X reduction in inference cost. Further, we show our approach can generate megapixel images by distilling SDXL, demonstrating exceptional visual quality among few-step methods.

高速画像合成のための改良型分布マッチング蒸留法

Improved Distribution Matching Distillation for Fast Image Synthesis

要旨

Support