使用強化學習訓練擴散模型
Training Diffusion Models with Reinforcement Learning
May 22, 2023
作者: Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, Sergey Levine
cs.AI
摘要
擴散模型是一類具有彈性的生成模型,通過對對數似然目標的近似進行訓練。然而,大多數擴散模型的應用案例並不關心似然,而是關注於人類感知的圖像質量或藥物有效性等下游目標。本文探討了強化學習方法,用於直接優化擴散模型以達到這些目標。我們描述了如何將去噪定位為多步決策問題,從而產生了一類策略梯度算法,我們稱之為去噪擴散策略優化(DDPO),這些算法比替代的基於獎勵加權似然方法更為有效。在實證方面,DDPO 能夠使文本到圖像擴散模型適應難以通過提示表達的目標,例如圖像壓縮性,以及源自人類反饋的目標,例如美學質量。最後,我們展示了 DDPO 可以通過來自視覺語言模型的反饋來改善提示-圖像對齊,而無需進行額外的數據收集或人工標註。
English
Diffusion models are a class of flexible generative models trained with an
approximation to the log-likelihood objective. However, most use cases of
diffusion models are not concerned with likelihoods, but instead with
downstream objectives such as human-perceived image quality or drug
effectiveness. In this paper, we investigate reinforcement learning methods for
directly optimizing diffusion models for such objectives. We describe how
posing denoising as a multi-step decision-making problem enables a class of
policy gradient algorithms, which we refer to as denoising diffusion policy
optimization (DDPO), that are more effective than alternative reward-weighted
likelihood approaches. Empirically, DDPO is able to adapt text-to-image
diffusion models to objectives that are difficult to express via prompting,
such as image compressibility, and those derived from human feedback, such as
aesthetic quality. Finally, we show that DDPO can improve prompt-image
alignment using feedback from a vision-language model without the need for
additional data collection or human annotation.