无条件先验至关重要！提升微调扩散模型的条件生成能力

摘要

无分类器引导（Classifier-Free Guidance, CFG）是训练条件扩散模型的一项基础技术。基于CFG训练的常规做法是使用单一网络同时学习条件与无条件噪声预测，并以较低的丢弃率处理条件信息。然而，我们观察到，在训练中联合学习无条件噪声时，由于带宽受限，导致无条件情况下的先验表现不佳。更重要的是，这些欠佳的无条件噪声预测成为降低条件生成质量的重要原因。受启发于大多数基于CFG的条件模型通过微调具有更优无条件生成能力的基模型进行训练的事实，我们首先证明，仅用基模型预测的无条件噪声替换CFG中的无条件噪声，即可显著提升条件生成效果。此外，我们还展示了，除了微调模型所基于的扩散模型外，其他扩散模型也可用于无条件噪声的替换。我们通过一系列基于CFG的条件模型，包括Zero-1-to-3、Versatile Diffusion、DiT、DynamiCrafter和InstructPix2Pix，在图像与视频生成任务中实验验证了上述观点。

English

Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

无条件先验至关重要！提升微调扩散模型的条件生成能力

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

摘要

Support