OneReward：基於多任務人類偏好學習的統一遮罩引導圖像生成

摘要

本文介紹了OneReward，這是一個統一的強化學習框架，它僅使用一個獎勵模型就能提升模型在多種任務下根據不同評估標準的生成能力。通過採用單一的視覺語言模型（VLM）作為生成獎勵模型，該模型能夠針對特定任務和評估標準區分優勝者與落後者，從而有效地應用於多任務生成模型，尤其是在數據多樣化且任務目標各異的場景中。我們將OneReward應用於掩碼引導的圖像生成，這可以進一步細分為圖像填充、圖像擴展、物體移除和文本渲染等子任務，這些任務都涉及使用二值掩碼作為編輯區域。儘管這些特定領域的任務共享相同的條件化範式，但它們在底層數據分佈和評估指標上存在顯著差異。現有方法通常依賴於任務特定的監督微調（SFT），這限制了模型的泛化能力和訓練效率。基於OneReward，我們開發了Seedream 3.0 Fill，這是一個通過多任務強化學習直接在預訓練基礎模型上訓練的掩碼引導生成模型，無需進行任務特定的SFT。實驗結果表明，我們的統一編輯模型在多個評估維度上均優於商業和開源競爭對手，如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。代碼和模型可在以下網址獲取：https://one-reward.github.io

English

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only One Reward model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io