VARD:基於價值強化學習的擴散模型高效密集微調
VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL
May 21, 2025
作者: Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, Fajie Yuan
cs.AI
摘要
擴散模型已成為跨多個領域的強大生成工具,然而針對預訓練模型進行定制以展現特定理想屬性仍具挑戰性。雖然強化學習(RL)提供了一種有前景的解決方案,但現有方法難以同時實現穩定、高效的微調並支持不可微分的獎勵函數。此外,這些方法依賴於稀疏獎勵,在生成過程的中間步驟中提供的監督不足,往往導致生成質量欠佳。為解決這些限制,需要在整個擴散過程中提供密集且可微分的信號。因此,我們提出了基於價值的強化擴散(VARD):這是一種新穎的方法,首先學習一個價值函數來預測從中間狀態獲得的獎勵期望,然後利用該價值函數結合KL正則化,在整個生成過程中提供密集監督。我們的方法保持了與預訓練模型的接近性,同時通過反向傳播實現了有效且穩定的訓練。實驗結果表明,我們的方法促進了更好的軌跡引導,提高了訓練效率,並擴展了RL在針對複雜、不可微分獎勵函數優化的擴散模型中的適用性。
English
Diffusion models have emerged as powerful generative tools across various
domains, yet tailoring pre-trained models to exhibit specific desirable
properties remains challenging. While reinforcement learning (RL) offers a
promising solution,current methods struggle to simultaneously achieve stable,
efficient fine-tuning and support non-differentiable rewards. Furthermore,
their reliance on sparse rewards provides inadequate supervision during
intermediate steps, often resulting in suboptimal generation quality. To
address these limitations, dense and differentiable signals are required
throughout the diffusion process. Hence, we propose VAlue-based Reinforced
Diffusion (VARD): a novel approach that first learns a value function
predicting expection of rewards from intermediate states, and subsequently uses
this value function with KL regularization to provide dense supervision
throughout the generation process. Our method maintains proximity to the
pretrained model while enabling effective and stable training via
backpropagation. Experimental results demonstrate that our approach facilitates
better trajectory guidance, improves training efficiency and extends the
applicability of RL to diffusion models optimized for complex,
non-differentiable reward functions.Summary
AI-Generated Summary