使用強化學習訓練擴散模型

摘要

擴散模型是一類具有彈性的生成模型，通過對對數似然目標的近似進行訓練。然而，大多數擴散模型的應用案例並不關心似然，而是關注於人類感知的圖像質量或藥物有效性等下游目標。本文探討了強化學習方法，用於直接優化擴散模型以達到這些目標。我們描述了如何將去噪定位為多步決策問題，從而產生了一類策略梯度算法，我們稱之為去噪擴散策略優化（DDPO），這些算法比替代的基於獎勵加權似然方法更為有效。在實證方面，DDPO 能夠使文本到圖像擴散模型適應難以通過提示表達的目標，例如圖像壓縮性，以及源自人類反饋的目標，例如美學質量。最後，我們展示了 DDPO 可以通過來自視覺語言模型的反饋來改善提示-圖像對齊，而無需進行額外的數據收集或人工標註。

English

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation.

使用強化學習訓練擴散模型

Training Diffusion Models with Reinforcement Learning

摘要

Support