ChatPaper.aiChatPaper

EditScore:透過高保真獎勵建模開啟圖像編輯的線上強化學習

EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

September 28, 2025
作者: Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
cs.AI

摘要

指令引導的圖像編輯已取得顯著進展,然而當前模型在處理複雜指令時仍面臨挑戰,且往往需要多次嘗試才能產生理想結果。強化學習(RL)提供了一個有前景的解決方案,但其在圖像編輯中的應用因缺乏高保真、高效的獎勵信號而嚴重受阻。在本研究中,我們提出了一套全面的方法論來克服這一障礙,其核心在於開發一個最先進的專用獎勵模型。我們首先引入了EditReward-Bench,這是一個用於系統評估獎勵模型在編輯質量上的綜合基準。基於此基準,我們開發了EditScore,一系列用於評估指令引導圖像編輯質量的獎勵模型(7B-72B)。通過精心的數據策劃和過濾,EditScore有效匹配了學習專有視覺語言模型(VLMs)的性能。此外,結合針對EditScore生成特性量身定制的有效自集成策略,我們的最大變體甚至在基準測試中超越了GPT-5。我們隨後證明,高保真的獎勵模型是解鎖圖像編輯在線RL的關鍵。實驗表明,即使最大的開源VLMs也無法提供有效的學習信號,而EditScore則能實現高效且穩健的策略優化。將我們的框架應用於強大的基礎模型OmniGen2,最終模型展現出顯著且一致的性能提升。總體而言,這項工作首次系統性地從基準測試到獎勵建模再到RL訓練,在圖像編輯領域開闢了一條路徑,證明了高保真、領域專用的獎勵模型是充分發揮RL在該領域潛力的關鍵。
English
Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.
PDF252September 30, 2025