鹦鹉:用于文本到图像生成的帕累托最优多目标强化学习框架
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
January 11, 2024
作者: Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang
cs.AI
摘要
最近的研究表明,在文本到图像(T2I)生成中,利用带有优质奖励的强化学习(RL)可以提高生成图像的质量。然而,简单地聚合多个奖励可能导致某些指标的过度优化和其他指标的退化,手动找到最佳权重也具有挑战性。一种有效的策略是共同优化RL中用于T2I生成的多个奖励。本文介绍了Parrot,这是一个新颖的用于T2I生成的多奖励RL框架。通过批次式帕累托最优选择,Parrot在T2I生成的RL优化过程中自动识别不同奖励之间的最佳权衡。此外,Parrot采用了一种联合优化方法,用于T2I模型和提示扩展网络,促进了生成具有质量意识的文本提示,从而进一步提高了最终图像质量。为了抵消由于提示扩展而导致的原始用户提示的潜在灾难性遗忘,我们在推断时引入了原始提示中心引导,确保生成的图像忠实于用户输入。大量实验和用户研究表明,Parrot在各种质量标准上优于几种基线方法,包括美学、人类偏好、图像情感和文本-图像对齐。
English
Recent works demonstrate that using reinforcement learning (RL) with quality
rewards can enhance the quality of generated images in text-to-image (T2I)
generation. However, a simple aggregation of multiple rewards may cause
over-optimization in certain metrics and degradation in others, and it is
challenging to manually find the optimal weights. An effective strategy to
jointly optimize multiple rewards in RL for T2I generation is highly desirable.
This paper introduces Parrot, a novel multi-reward RL framework for T2I
generation. Through the use of the batch-wise Pareto optimal selection, Parrot
automatically identifies the optimal trade-off among different rewards during
the RL optimization of the T2I generation. Additionally, Parrot employs a joint
optimization approach for the T2I model and the prompt expansion network,
facilitating the generation of quality-aware text prompts, thus further
enhancing the final image quality. To counteract the potential catastrophic
forgetting of the original user prompt due to prompt expansion, we introduce
original prompt centered guidance at inference time, ensuring that the
generated image remains faithful to the user input. Extensive experiments and a
user study demonstrate that Parrot outperforms several baseline methods across
various quality criteria, including aesthetics, human preference, image
sentiment, and text-image alignment.