高速なフォトリアリスティックなテキストから画像への生成には報酬だけで十分である

要旨

複雑なテキストプロンプトや人間の好みに生成画像を適合させることは、AI生成コンテンツ（AIGC）における中心的な課題です。報酬強化型拡散蒸留が、テキストから画像へのモデルの制御性と忠実度を向上させる有望なアプローチとして登場する中で、私たちは根本的なパラダイムシフトを確認しました。条件がより具体的になり、報酬信号が強くなるにつれて、報酬自体が生成における支配的な力となります。一方で、拡散損失は過剰に高価な正則化の形態として機能します。私たちの仮説を徹底的に検証するために、正則化された報酬最大化による新しい条件付き生成アプローチであるR0を導入します。R0は、トリッキーな拡散蒸留損失に依存する代わりに、画像生成をデータ空間における最適化問題として扱う新しい視点を提案します。これは、高い構成的報酬を持つ有効な画像を探索することを目的としています。生成器のパラメータ化と適切な正則化技術の革新的な設計により、R0を使用して最先端の少ステップテキストから画像生成モデルを大規模にトレーニングします。私たちの結果は、複雑な条件のシナリオにおいて報酬が支配的な役割を果たすことを示すことで、拡散事後トレーニングと条件付き生成に関する従来の知恵に挑戦します。私たちの発見が、AIGCの広範な分野における人間中心および報酬中心の生成パラダイムのさらなる研究に貢献することを願っています。コードはhttps://github.com/Luo-Yihong/R0で利用可能です。

English

Aligning generated images to complicated text prompts and human preferences is a central challenge in Artificial Intelligence-Generated Content (AIGC). With reward-enhanced diffusion distillation emerging as a promising approach that boosts controllability and fidelity of text-to-image models, we identify a fundamental paradigm shift: as conditions become more specific and reward signals stronger, the rewards themselves become the dominant force in generation. In contrast, the diffusion losses serve as an overly expensive form of regularization. To thoroughly validate our hypothesis, we introduce R0, a novel conditional generation approach via regularized reward maximization. Instead of relying on tricky diffusion distillation losses, R0 proposes a new perspective that treats image generations as an optimization problem in data space which aims to search for valid images that have high compositional rewards. By innovative designs of the generator parameterization and proper regularization techniques, we train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation by demonstrating that rewards play a dominant role in scenarios with complex conditions. We hope our findings can contribute to further research into human-centric and reward-centric generation paradigms across the broader field of AIGC. Code is available at https://github.com/Luo-Yihong/R0.

高速なフォトリアリスティックなテキストから画像への生成には報酬だけで十分である

Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

要旨

Support