OneReward：基于多任务人类偏好学习的统一掩码引导图像生成

摘要

本文介绍了一种名为OneReward的统一强化学习框架，该框架仅通过一个奖励模型，便能在不同评估标准下提升模型在多项任务中的生成能力。我们采用单一视觉语言模型（VLM）作为生成奖励模型，该模型能够针对特定任务和评估标准区分优劣，从而有效应用于多任务生成模型，尤其是在数据多样且任务目标各异的情境下。我们将OneReward应用于掩码引导的图像生成，该任务可进一步细分为图像填充、图像扩展、对象移除和文本渲染等多个子任务，均涉及以二值掩码作为编辑区域。尽管这些特定领域任务共享相同的条件范式，但其底层数据分布和评估指标存在显著差异。现有方法通常依赖于任务特定的监督微调（SFT），这限制了模型的泛化能力和训练效率。基于OneReward，我们开发了Seedream 3.0 Fill，这是一个通过多任务强化学习直接在预训练基础模型上训练的掩码引导生成模型，无需进行任务特定的SFT。实验结果表明，我们的统一编辑模型在多个评估维度上均优于商业和开源竞争对手，如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。代码和模型可在以下网址获取：https://one-reward.github.io。

English

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only One Reward model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

OneReward：基于多任务人类偏好学习的统一掩码引导图像生成

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

摘要

Support