ChatPaper.aiChatPaper

將文本到圖像擴散模型與獎勵反向傳播進行對齊

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

October 5, 2023
作者: Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki
cs.AI

摘要

最近,基於非常大規模的無監督或弱監督的文本到圖像訓練數據集,文本到圖像擴散模型已經成為圖像生成的前沿,由於其無監督訓練,控制其在下游任務中的行為,如最大化人類感知的圖像質量、圖像文本對齊或道德圖像生成,是困難的。最近的研究通過使用普通強化學習對擴散模型進行下游獎勵函數的微調,這種方法以梯度估算器的高變異性而聞名。在本文中,我們提出了AlignProp,一種通過對去噪過程的獎勵梯度進行端對端反向傳播,將擴散模型與下游獎勵函數對齊的方法。儘管這種反向傳播的天真實現需要存儲現代文本到圖像模型的偏導數而需要過高的內存資源,但AlignProp通過微調低秩適配器權重模塊並使用梯度檢查點,使其內存使用量可行。我們在將AlignProp應用於微調擴散模型以達到各種目標,如圖像文本語義對齊、美學、可壓縮性和對存在的物體數量的可控性以及它們的組合方面進行了測試。我們展示AlignProp在比起其他方法更少的訓練步驟中實現了更高的獎勵,同時在概念上更簡單,使其成為優化擴散模型以適應感興趣的可微分獎勵函數的直接選擇。代碼和可視化結果可在https://align-prop.github.io/找到。
English
Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest. Code and Visualization results are available at https://align-prop.github.io/.
PDF224December 15, 2024