OneReward: マルチタスク人間選好学習による統一マスク誘導画像生成

要旨

本論文では、OneRewardを紹介する。これは、単一の報酬モデルを用いて、異なる評価基準下での複数タスクにおけるモデルの生成能力を強化する統一的な強化学習フレームワークである。単一の視覚言語モデル（VLM）を生成報酬モデルとして採用することで、与えられたタスクと評価基準に対して勝者と敗者を識別し、多様なデータと異なるタスク目的を持つ文脈において、特にマルチタスク生成モデルに効果的に適用できる。OneRewardをマスク誘導画像生成に利用し、これはさらに画像補完、画像拡張、オブジェクト除去、テキストレンダリングなどのサブタスクに分割され、編集領域としてバイナリマスクを伴う。これらのドメイン固有のタスクは同じ条件付けパラダイムを共有しているが、基盤となるデータ分布と評価指標は大きく異なる。既存の手法はタスク固有の教師あり微調整（SFT）に依存することが多く、汎化性と学習効率が制限される。OneRewardを基盤として、事前学習済みのベースモデル上で直接マルチタスク強化学習により訓練されたマスク誘導生成モデルであるSeedream 3.0 Fillを開発し、タスク固有のSFTの必要性を排除した。実験結果は、我々の統一編集モデルが、Ideogram、Adobe Photoshop、FLUX Fill [Pro]などの商用およびオープンソースの競合モデルを、複数の評価次元で一貫して上回ることを示している。コードとモデルは以下で公開されている：https://one-reward.github.io

English

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only One Reward model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

OneReward: マルチタスク人間選好学習による統一マスク誘導画像生成

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

要旨

Support