EditScore:通过高保真奖励建模解锁图像编辑的在线强化学习
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
September 28, 2025
作者: Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
cs.AI
摘要
指令引导的图像编辑已取得显著进展,然而现有模型在处理复杂指令时仍面临挑战,往往需要多次尝试才能获得理想结果。强化学习(RL)为此提供了有前景的解决方案,但其在图像编辑中的应用因缺乏高保真、高效的奖励信号而严重受阻。本研究中,我们提出了一套全面的方法论来突破这一障碍,核心在于开发一款尖端的专用奖励模型。我们首先引入了EditReward-Bench,这是一个系统评估奖励模型编辑质量的综合基准。基于此基准,我们开发了EditScore,一系列用于评估指令引导图像编辑质量的奖励模型(7B-72B)。通过精细的数据整理与筛选,EditScore有效匹配了学习专有视觉语言模型(VLMs)的表现。此外,结合专为EditScore生成特性设计的有效自集成策略,我们最大规模的变体甚至在基准测试中超越了GPT-5。随后,我们证明了高保真奖励模型是解锁图像编辑在线RL的关键。实验表明,即便最大的开源VLMs也无法提供有效的学习信号,而EditScore则能实现高效且稳健的策略优化。将我们的框架应用于强大的基础模型OmniGen2,最终模型展现出显著且一致的性能提升。总体而言,本研究首次系统地从基准测试到奖励建模再到RL训练,为图像编辑领域开辟了一条路径,证明了高保真、领域专用的奖励模型是释放RL在该领域全部潜力的关键。
English
Instruction-guided image editing has achieved remarkable progress, yet
current models still face challenges with complex instructions and often
require multiple samples to produce a desired result. Reinforcement Learning
(RL) offers a promising solution, but its adoption in image editing has been
severely hindered by the lack of a high-fidelity, efficient reward signal. In
this work, we present a comprehensive methodology to overcome this barrier,
centered on the development of a state-of-the-art, specialized reward model. We
first introduce EditReward-Bench, a comprehensive benchmark to systematically
evaluate reward models on editing quality. Building on this benchmark, we
develop EditScore, a series of reward models (7B-72B) for evaluating the
quality of instruction-guided image editing. Through meticulous data curation
and filtering, EditScore effectively matches the performance of learning
proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy
tailored for the generative nature of EditScore, our largest variant even
surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity
reward model is the key to unlocking online RL for image editing. Our
experiments show that, while even the largest open-source VLMs fail to provide
an effective learning signal, EditScore enables efficient and robust policy
optimization. Applying our framework to a strong base model, OmniGen2, results
in a final model that shows a substantial and consistent performance uplift.
Overall, this work provides the first systematic path from benchmarking to
reward modeling to RL training in image editing, showing that a high-fidelity,
domain-specialized reward model is the key to unlocking the full potential of
RL in this domain.