無條件先驗至關重要!提升微調擴散模型的條件生成能力
Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
March 26, 2025
作者: Prin Phunyaphibarn, Phillip Y. Lee, Jaihoon Kim, Minhyuk Sung
cs.AI
摘要
無分類器指導(Classifier-Free Guidance, CFG)是訓練條件擴散模型的一項基礎技術。基於CFG的訓練通常採用單一網絡來同時學習條件與無條件的噪聲預測,並對條件輸入施加較小的dropout率。然而,我們觀察到,在訓練中聯合學習無條件噪聲時,由於帶寬受限,導致無條件情況下的先驗效果不佳。更重要的是,這些低質量的無條件噪聲預測嚴重影響了條件生成的質量。受到多數基於CFG的條件模型通過微調具有更好無條件生成能力的基礎模型來訓練的啟發,我們首先展示,僅需將CFG中的無條件噪聲替換為基礎模型預測的噪聲,即可顯著提升條件生成的效果。此外,我們還證明,用於無條件噪聲替換的擴散模型不必與微調模型所基於的模型相同。我們通過一系列基於CFG的條件模型,包括Zero-1-to-3、Versatile Diffusion、DiT、DynamiCrafter和InstructPix2Pix,在圖像與視頻生成任務中實驗驗證了這一主張。
English
Classifier-Free Guidance (CFG) is a fundamental technique in training
conditional diffusion models. The common practice for CFG-based training is to
use a single network to learn both conditional and unconditional noise
prediction, with a small dropout rate for conditioning. However, we observe
that the joint learning of unconditional noise with limited bandwidth in
training results in poor priors for the unconditional case. More importantly,
these poor unconditional noise predictions become a serious reason for
degrading the quality of conditional generation. Inspired by the fact that most
CFG-based conditional models are trained by fine-tuning a base model with
better unconditional generation, we first show that simply replacing the
unconditional noise in CFG with that predicted by the base model can
significantly improve conditional generation. Furthermore, we show that a
diffusion model other than the one the fine-tuned model was trained on can be
used for unconditional noise replacement. We experimentally verify our claim
with a range of CFG-based conditional models for both image and video
generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and
InstructPix2Pix.Summary
AI-Generated Summary