ChatPaper.aiChatPaper

圖像自成獎勵:基於對抗獎勵的強化學習圖像生成方法 (註:標題採用學術論文常見的對仗結構,將原文的隱喻轉化為明確的技術表述。"Adversarial Reward"譯為「對抗獎勵」符合生成對抗網絡(GAN)領域的術語規範,同時通過副標題形式清晰呈現核心方法論。)

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

November 25, 2025
作者: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou
cs.AI

摘要

可靠的獎勵函數對於影像生成領域的強化學習至關重要。當前大多數強化學習方法依賴預訓練的偏好模型,這些模型通過輸出標量獎勵來近似人類偏好。然而這類獎勵往往難以捕捉人類感知,且易受獎勵破解影響——即高分數並未對應更高影像品質。為解決此問題,我們提出Adv-GRPO框架,該對抗式獎勵強化學習系統會迭代更新獎勵模型與生成器。獎勵模型以參考影像作為正樣本進行監督訓練,能有效抵禦破解攻擊。有別於通過KL正則化約束參數更新的傳統方法,我們學習的獎勵直接通過視覺輸出引導生成器,從而產生更高品質的影像。現有獎勵函數的優化雖能緩解獎勵破解,但其固有偏差依然存在:例如PickScore可能降低影像品質,而基於OCR的獎勵常損害美學保真度。對此,我們將影像本身作為獎勵載體,利用參考影像與視覺基礎模型(如DINO)提供豐富的視覺獎勵。這些密集的視覺信號(而非單一標量)使影像品質、美學表現及任務特定指標均獲得持續提升。最後我們證明,結合參考樣本與基礎模型獎勵可實現分佈遷移與靈活風格定製。在人類評估中,本方法在影像品質與美學維度分別以70.0%和72.4%的勝率超越Flow-GRPO與SD3。程式碼與模型均已開源。
English
A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
PDF283February 7, 2026