图像自成奖赏:基于对抗奖励的图像生成强化学习
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
November 25, 2025
作者: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou
cs.AI
摘要
可靠的奖励函数对于图像生成中的强化学习至关重要。当前大多数强化学习方法依赖于预训练的偏好模型,这些模型通过输出标量奖励来近似人类偏好。然而,这些奖励往往难以准确捕捉人类感知,且容易遭受奖励破解——即更高的分数并不对应更好的图像质量。为此,我们提出Adv-GRPO框架,该框架采用对抗性奖励机制,通过迭代更新奖励模型和生成器来解决问题。奖励模型以参考图像作为正样本进行监督训练,能有效规避破解风险。与通过KL散度正则化约束参数更新的方式不同,我们学习的奖励直接通过视觉输出指导生成器,从而产生更高质量的图像。
尽管优化现有奖励函数可缓解奖励破解问题,但其固有偏差依然存在。例如PickScore可能降低图像质量,而基于OCR的奖励常会损害美学保真度。为解决这一问题,我们将图像本身作为奖励信号,利用参考图像和视觉基础模型(如DINO)提供丰富的视觉奖励。这些密集的视觉信号替代单一标量奖励,在图像质量、美学评价和任务特定指标上实现持续提升。此外,结合参考样本与基础模型奖励的方法支持分布迁移和灵活的风格定制。在人类评估中,我们的方法在图像质量和美学维度分别以70.0%和72.4%的胜率超越Flow-GRPO与SD3。代码和模型均已开源。
English
A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.