OneReward: 다중 작업 인간 선호도 학습을 통한 통합 마스크 기반 이미지 생성

초록

본 논문에서는 단일 보상 모델(One Reward model)만을 사용하여 다양한 평가 기준 하에서 여러 작업에 걸친 모델의 생성 능력을 향상시키는 통합 강화 학습 프레임워크인 OneReward를 소개한다. 단일 시각-언어 모델(VLM)을 생성 보상 모델로 활용함으로써, 주어진 작업과 평가 기준에 대해 승자와 패자를 구분할 수 있어 다양한 데이터와 작업 목표가 존재하는 상황에서 다중 작업 생성 모델에 효과적으로 적용될 수 있다. 우리는 OneReward를 마스크 기반 이미지 생성에 활용하며, 이는 이미지 채우기, 이미지 확장, 객체 제거, 텍스트 렌더링과 같은 여러 하위 작업으로 나뉘며, 편집 영역으로 이진 마스크를 사용한다. 이러한 도메인 특화 작업들은 동일한 조건 설정 패러다임을 공유하지만, 기본 데이터 분포와 평가 지표에서는 상당한 차이를 보인다. 기존 방법들은 작업별 지도 미세 조정(SFT)에 의존하는 경우가 많아 일반화와 학습 효율성이 제한된다. OneReward를 기반으로, 우리는 사전 훈련된 기본 모델에서 직접 다중 작업 강화 학습을 통해 훈련된 마스크 기반 생성 모델인 Seedream 3.0 Fill을 개발하여 작업별 SFT의 필요성을 제거했다. 실험 결과는 우리의 통합 편집 모델이 Ideogram, Adobe Photoshop, FLUX Fill [Pro]와 같은 상용 및 오픈소스 경쟁 제품들을 여러 평가 차원에서 일관되게 능가함을 보여준다. 코드와 모델은 https://one-reward.github.io에서 확인할 수 있다.

English

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only One Reward model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

OneReward: 다중 작업 인간 선호도 학습을 통한 통합 마스크 기반 이미지 생성

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

초록

Support