VARD:基于价值强化学习的高效密集微调扩散模型
VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL
May 21, 2025
作者: Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, Fajie Yuan
cs.AI
摘要
扩散模型已成为跨领域强大的生成工具,然而,针对特定理想属性定制预训练模型仍具挑战。尽管强化学习(RL)提供了一种有前景的解决方案,现有方法在实现稳定、高效微调的同时支持不可微分奖励方面仍面临困难。此外,它们对稀疏奖励的依赖在中间步骤中提供的监督不足,往往导致生成质量欠佳。为克服这些局限,整个扩散过程需要密集且可微分的信号。因此,我们提出了基于价值的强化扩散(VARD):一种新颖方法,该方法首先学习一个价值函数,预测从中间状态获得的奖励期望,随后结合KL正则化利用该价值函数,在整个生成过程中提供密集监督。我们的方法在保持与预训练模型接近的同时,通过反向传播实现了有效且稳定的训练。实验结果表明,我们的方法促进了更好的轨迹引导,提高了训练效率,并扩展了RL在针对复杂、不可微分奖励函数优化的扩散模型中的应用范围。
English
Diffusion models have emerged as powerful generative tools across various
domains, yet tailoring pre-trained models to exhibit specific desirable
properties remains challenging. While reinforcement learning (RL) offers a
promising solution,current methods struggle to simultaneously achieve stable,
efficient fine-tuning and support non-differentiable rewards. Furthermore,
their reliance on sparse rewards provides inadequate supervision during
intermediate steps, often resulting in suboptimal generation quality. To
address these limitations, dense and differentiable signals are required
throughout the diffusion process. Hence, we propose VAlue-based Reinforced
Diffusion (VARD): a novel approach that first learns a value function
predicting expection of rewards from intermediate states, and subsequently uses
this value function with KL regularization to provide dense supervision
throughout the generation process. Our method maintains proximity to the
pretrained model while enabling effective and stable training via
backpropagation. Experimental results demonstrate that our approach facilitates
better trajectory guidance, improves training efficiency and extends the
applicability of RL to diffusion models optimized for complex,
non-differentiable reward functions.Summary
AI-Generated Summary