ChatPaper.aiChatPaper

基於驗證器的強化學習在影像編輯中的應用

Leveraging Verifier-Based Reinforcement Learning in Image Editing

April 30, 2026
作者: Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang
cs.AI

摘要

儘管基於人類回饋的強化學習(RLHF)已成為文字到圖像生成的關鍵範式,但其在圖像編輯領域的應用仍鮮有探索。關鍵瓶頸在於缺乏適用於所有編輯任務的穩健通用獎勵模型。現有的編輯獎勵模型通常僅給出整體評分而缺乏細部檢查,既忽略了不同指令要求,也導致獎勵偏差。為解決此問題,我們認為關鍵在於從簡單評分器轉向推理驗證器。我們提出Edit-R1框架,該框架構建基於思維鏈(CoT)驗證器的推理獎勵模型(RRM),並將其應用於下游圖像編輯任務。Edit-RRM將指令分解為獨立原則,針對每項原則評估編輯後的圖像,並將這些檢查結果匯聚成可解釋的細粒度獎勵。為構建此RRM,我們首先採用監督微調(SFT)作為「冷啟動」來生成CoT獎勵軌跡。接著引入群組對比偏好優化(GCPO)——一種利用人類成對偏好數據來強化點式RRM的強化學習算法。在建立RRM後,我們使用GRPO訓練編輯模型,儘管該獎勵模型不可微分但功能強大。大量實驗表明,我們的Edit-RRM作為編輯專用獎勵模型,超越了Seed-1.5-VL和Seed-1.6-VL等強大視覺語言模型,並觀察到明顯的規模化趨勢:從30億參數到70億參數,性能持續提升。此外,Edit-R1為FLUX.1-kontext等編輯模型帶來增益,彰顯其在增強圖像編輯能力方面的有效性。
English
While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
PDF151May 2, 2026