基于验证器的强化学习在图像编辑中的应用
Leveraging Verifier-Based Reinforcement Learning in Image Editing
April 30, 2026
作者: Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang
cs.AI
摘要
尽管基于人类反馈的强化学习(RLHF)已成为文本到图像生成的关键范式,但其在图像编辑领域的应用仍鲜有探索。关键瓶颈在于缺乏适用于所有编辑任务的鲁棒通用奖励模型。现有编辑奖励模型通常仅给出整体评分而缺乏细粒度检查,既忽略了不同指令要求,又导致奖励偏差。为解决这一问题,我们认为关键在于从简单评分器转向推理验证器。我们提出Edit-R1框架,构建基于思维链(CoT)验证器的推理奖励模型(RRM),并将其应用于下游图像编辑任务。Edit-RRM将编辑指令分解为不同原则,逐项评估编辑图像与原则的符合程度,最终聚合为可解释的细粒度奖励值。为构建此类RRM,我们首先采用监督微调(SFT)作为"冷启动"生成CoT奖励轨迹,随后提出组对比偏好优化(GCPO)——一种利用人类成对偏好数据强化点式RRM的强化学习算法。在构建RRM后,我们通过GRPO算法训练编辑模型,尽管该奖励模型不可微分但功能强大。大量实验表明,我们的Edit-RRM作为专用编辑奖励模型,超越了Seed-1.5-VL、Seed-1.6-VL等强大视觉语言模型,并呈现明显的规模效应——模型参数从30亿增至70亿时性能持续提升。此外,Edit-R1为FLUX.1-kontext等编辑模型带来显著增益,彰显其在增强图像编辑能力方面的有效性。
English
While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.