EditReward：一种面向人类对齐的指令引导图像编辑奖励模型

摘要

近期，我们在自然语言指令驱动的图像编辑领域见证了显著进展。诸如GPT-Image-1、Seedream和Google-Nano-Banana等闭源模型展现了极为乐观的发展态势。然而，开源模型仍显滞后，主要瓶颈在于缺乏可靠的奖励模型来扩展高质量的合成训练数据。针对这一关键瓶颈，我们构建了\mname，该模型基于我们新构建的大规模人类偏好数据集进行训练，该数据集由训练有素的专家按照严格协议精心标注，包含超过20万条偏好对。\mname在指令引导的图像编辑任务中展现出与人类偏好的高度一致性。实验表明，\mname在GenAI-Bench、AURORA-Bench、ImagenHub及我们新推出的\benchname等基准测试中，达到了与人类判断最先进的相关性，超越了众多VLM-as-judge模型。此外，我们利用\mname从现有噪声较大的ShareGPT-4o-Image数据集中筛选出高质量子集，并在此基础上训练Step1X-Edit，相较于全数据集训练，其性能显著提升。这证明了\mname作为奖励模型在扩展高质量图像编辑训练数据方面的能力。其强大的对齐性还暗示了其在基于强化学习的模型后训练及测试时扩展等高级应用中的潜力。\mname及其训练数据集将公开发布，以助力社区构建更多高质量的图像编辑训练数据集。

English

Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.

EditReward：一种面向人类对齐的指令引导图像编辑奖励模型

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

摘要

Support