無条件事前分布が重要！微調整済み拡散モデルの条件付き生成を改善する

要旨

Classifier-Free Guidance (CFG) は、条件付き拡散モデルの訓練における基本的な技術です。CFG に基づく訓練の一般的な方法は、単一のネットワークを使用して条件付きおよび無条件のノイズ予測を学習し、条件付けのために小さなドロップアウト率を適用することです。しかし、訓練において無条件ノイズを限られた帯域幅で同時に学習することは、無条件の場合に対して不十分な事前分布をもたらすことが観察されています。さらに重要なことに、これらの不十分な無条件ノイズ予測は、条件付き生成の品質を低下させる深刻な原因となります。ほとんどの CFG ベースの条件付きモデルが、より優れた無条件生成能力を持つベースモデルをファインチューニングして訓練されているという事実に着想を得て、まず、CFG の無条件ノイズをベースモデルが予測したノイズに置き換えるだけで、条件付き生成を大幅に改善できることを示します。さらに、ファインチューニングされたモデルが訓練されたものとは異なる拡散モデルを無条件ノイズの置き換えに使用できることも示します。私たちの主張は、Zero-1-to-3、Versatile Diffusion、DiT、DynamiCrafter、InstructPix2Pix を含む、画像および動画生成のための一連の CFG ベースの条件付きモデルを用いて実験的に検証されています。

English

Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

無条件事前分布が重要！微調整済み拡散モデルの条件付き生成を改善する

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

要旨

Support