利用奖励反向传播对齐文本到图像扩散模型
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
October 5, 2023
作者: Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki
cs.AI
摘要
最近,基于非常大规模的无监督或弱监督文本到图像训练数据集的支持,文本到图像扩散模型已经成为图像生成的前沿,由于其无监督训练,控制其在下游任务中的行为,如最大化人类感知的图像质量、图像文本对齐或伦理图像生成,是困难的。最近的研究通过使用普通强化学习对扩散模型进行微调到下游奖励函数,这种方法以梯度估计器的高方差而臭名昭著。在本文中,我们提出了AlignProp,一种通过奖励梯度的端到端反向传播来将扩散模型与下游奖励函数对齐的方法。虽然这种反向传播的朴素实现需要存储现代文本到图像模型的偏导数而需要昂贵的内存资源,但AlignProp通过微调低秩适配器权重模块并使用梯度检查点,使其内存使用变得可行。我们在将扩散模型微调到各种目标上测试了AlignProp,如图像文本语义对齐、美学、可压缩性和对象数量的可控性,以及它们的组合。我们展示AlignProp在更少的训练步骤中实现了更高的奖励,同时在概念上更简单,使其成为优化扩散模型以获得感兴趣的可微分奖励函数的直接选择。代码和可视化结果可在https://align-prop.github.io/找到。
English
Text-to-image diffusion models have recently emerged at the forefront of
image generation, powered by very large-scale unsupervised or weakly supervised
text-to-image training datasets. Due to their unsupervised training,
controlling their behavior in downstream tasks, such as maximizing
human-perceived image quality, image-text alignment, or ethical image
generation, is difficult. Recent works finetune diffusion models to downstream
reward functions using vanilla reinforcement learning, notorious for the high
variance of the gradient estimators. In this paper, we propose AlignProp, a
method that aligns diffusion models to downstream reward functions using
end-to-end backpropagation of the reward gradient through the denoising
process. While naive implementation of such backpropagation would require
prohibitive memory resources for storing the partial derivatives of modern
text-to-image models, AlignProp finetunes low-rank adapter weight modules and
uses gradient checkpointing, to render its memory usage viable. We test
AlignProp in finetuning diffusion models to various objectives, such as
image-text semantic alignment, aesthetics, compressibility and controllability
of the number of objects present, as well as their combinations. We show
AlignProp achieves higher rewards in fewer training steps than alternatives,
while being conceptually simpler, making it a straightforward choice for
optimizing diffusion models for differentiable reward functions of interest.
Code and Visualization results are available at https://align-prop.github.io/.