EditScore：透過高保真獎勵建模開啟圖像編輯的線上強化學習

摘要

指令引導的圖像編輯已取得顯著進展，然而當前模型在處理複雜指令時仍面臨挑戰，且往往需要多次嘗試才能產生理想結果。強化學習（RL）提供了一個有前景的解決方案，但其在圖像編輯中的應用因缺乏高保真、高效的獎勵信號而嚴重受阻。在本研究中，我們提出了一套全面的方法論來克服這一障礙，其核心在於開發一個最先進的專用獎勵模型。我們首先引入了EditReward-Bench，這是一個用於系統評估獎勵模型在編輯質量上的綜合基準。基於此基準，我們開發了EditScore，一系列用於評估指令引導圖像編輯質量的獎勵模型（7B-72B）。通過精心的數據策劃和過濾，EditScore有效匹配了學習專有視覺語言模型（VLMs）的性能。此外，結合針對EditScore生成特性量身定制的有效自集成策略，我們的最大變體甚至在基準測試中超越了GPT-5。我們隨後證明，高保真的獎勵模型是解鎖圖像編輯在線RL的關鍵。實驗表明，即使最大的開源VLMs也無法提供有效的學習信號，而EditScore則能實現高效且穩健的策略優化。將我們的框架應用於強大的基礎模型OmniGen2，最終模型展現出顯著且一致的性能提升。總體而言，這項工作首次系統性地從基準測試到獎勵建模再到RL訓練，在圖像編輯領域開闢了一條路徑，證明了高保真、領域專用的獎勵模型是充分發揮RL在該領域潛力的關鍵。

English

Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

EditScore：透過高保真獎勵建模開啟圖像編輯的線上強化學習

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

摘要

Support