EditReward：一個與人類價值觀對齊的指令引導圖像編輯獎勵模型

摘要

近期，我們見證了基於自然語言指令的圖像編輯領域取得重大進展。多個閉源模型，如GPT-Image-1、Seedream和Google-Nano-Banana，已展現出極具前景的成果。然而，開源模型仍顯落後，主要瓶頸在於缺乏可靠的獎勵模型來擴展高質量的合成訓練數據。為解決這一關鍵瓶頸，我們構建了\mname，並利用我們新的大規模人類偏好數據集進行訓練，該數據集由訓練有素的專家嚴格按照包含超過20萬對偏好數據的協議精心標註。\mname在指令引導的圖像編輯任務中展現出與人類偏好高度一致。實驗表明，\mname在GenAI-Bench、AURORA-Bench、ImagenHub及我們新推出的\benchname等基準測試中，達到了與人類判斷的最佳相關性，超越了多種VLM-as-judge模型。此外，我們利用\mname從現有的噪聲較大的ShareGPT-4o-Image數據集中篩選出高質量子集，並在此子集上訓練Step1X-Edit，結果顯示相比於在全集上訓練，性能有顯著提升。這證明了\mname作為獎勵模型，能夠擴展高質量的圖像編輯訓練數據。其強大的對齊能力還暗示了其在基於強化學習的後訓練及圖像編輯模型測試時擴展等進階應用中的潛力。\mname及其訓練數據集將被公開，以助力社區構建更多高質量的圖像編輯訓練數據集。

English

Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.

EditReward：一個與人類價值觀對齊的指令引導圖像編輯獎勵模型

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

摘要

Support