ChatPaper.aiChatPaper

EditScore:通过高保真奖励建模解锁图像编辑的在线强化学习

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

September 28, 2025
作者: Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
cs.AI

摘要

指令引导的图像编辑已取得显著进展,然而现有模型在处理复杂指令时仍面临挑战,往往需要多次尝试才能获得理想结果。强化学习(RL)为此提供了有前景的解决方案,但其在图像编辑中的应用因缺乏高保真、高效的奖励信号而严重受阻。本研究中,我们提出了一套全面的方法论来突破这一障碍,核心在于开发一款尖端的专用奖励模型。我们首先引入了EditReward-Bench,这是一个系统评估奖励模型编辑质量的综合基准。基于此基准,我们开发了EditScore,一系列用于评估指令引导图像编辑质量的奖励模型(7B-72B)。通过精细的数据整理与筛选,EditScore有效匹配了学习专有视觉语言模型(VLMs)的表现。此外,结合专为EditScore生成特性设计的有效自集成策略,我们最大规模的变体甚至在基准测试中超越了GPT-5。随后,我们证明了高保真奖励模型是解锁图像编辑在线RL的关键。实验表明,即便最大的开源VLMs也无法提供有效的学习信号,而EditScore则能实现高效且稳健的策略优化。将我们的框架应用于强大的基础模型OmniGen2,最终模型展现出显著且一致的性能提升。总体而言,本研究首次系统地从基准测试到奖励建模再到RL训练,为图像编辑领域开辟了一条路径,证明了高保真、领域专用的奖励模型是释放RL在该领域全部潜力的关键。
English
Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.
PDF252September 30, 2025