無條件先驗至關重要！提升微調擴散模型的條件生成能力

摘要

無分類器指導（Classifier-Free Guidance, CFG）是訓練條件擴散模型的一項基礎技術。基於CFG的訓練通常採用單一網絡來同時學習條件與無條件的噪聲預測，並對條件輸入施加較小的dropout率。然而，我們觀察到，在訓練中聯合學習無條件噪聲時，由於帶寬受限，導致無條件情況下的先驗效果不佳。更重要的是，這些低質量的無條件噪聲預測嚴重影響了條件生成的質量。受到多數基於CFG的條件模型通過微調具有更好無條件生成能力的基礎模型來訓練的啟發，我們首先展示，僅需將CFG中的無條件噪聲替換為基礎模型預測的噪聲，即可顯著提升條件生成的效果。此外，我們還證明，用於無條件噪聲替換的擴散模型不必與微調模型所基於的模型相同。我們通過一系列基於CFG的條件模型，包括Zero-1-to-3、Versatile Diffusion、DiT、DynamiCrafter和InstructPix2Pix，在圖像與視頻生成任務中實驗驗證了這一主張。

English

Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

無條件先驗至關重要！提升微調擴散模型的條件生成能力

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

摘要

Support