EditScore：通过高保真奖励建模解锁图像编辑的在线强化学习

摘要

指令引导的图像编辑已取得显著进展，然而现有模型在处理复杂指令时仍面临挑战，往往需要多次尝试才能获得理想结果。强化学习（RL）为此提供了有前景的解决方案，但其在图像编辑中的应用因缺乏高保真、高效的奖励信号而严重受阻。本研究中，我们提出了一套全面的方法论来突破这一障碍，核心在于开发一款尖端的专用奖励模型。我们首先引入了EditReward-Bench，这是一个系统评估奖励模型编辑质量的综合基准。基于此基准，我们开发了EditScore，一系列用于评估指令引导图像编辑质量的奖励模型（7B-72B）。通过精细的数据整理与筛选，EditScore有效匹配了学习专有视觉语言模型（VLMs）的表现。此外，结合专为EditScore生成特性设计的有效自集成策略，我们最大规模的变体甚至在基准测试中超越了GPT-5。随后，我们证明了高保真奖励模型是解锁图像编辑在线RL的关键。实验表明，即便最大的开源VLMs也无法提供有效的学习信号，而EditScore则能实现高效且稳健的策略优化。将我们的框架应用于强大的基础模型OmniGen2，最终模型展现出显著且一致的性能提升。总体而言，本研究首次系统地从基准测试到奖励建模再到RL训练，为图像编辑领域开辟了一条路径，证明了高保真、领域专用的奖励模型是释放RL在该领域全部潜力的关键。

English

Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

EditScore：通过高保真奖励建模解锁图像编辑的在线强化学习

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

摘要

Support