TDM-R1: 非微分可能な報酬によるFew-Step拡散モデルの強化学習

要旨

数ステップ生成モデルは、大幅に低コストで強力な画像・動画生成を可能にしたが、数ステップモデル向けの汎用的な強化学習（RL）パラダイムは未解決の問題である。既存の数ステップ拡散モデル向けRL手法は、微分可能な報酬モデルによる誤差逆伝播に強く依存しており、人間の二値的嗜好性や物体数など、非微分可能な報酬を含む現実世界の重要な報酬信号の大半を排除してしまう。非微分可能な報酬を適切に組み込み、数ステップ生成モデルを改善するため、我々は主要な数ステップモデルであるTrajectory Distribution Matching (TDM) に基づく新しい強化学習パラダイム、TDM-R1を提案する。TDM-R1は学習プロセスを代理報酬学習と生成器学習に分離する。さらに、TDMの決定論的生成軌跡に沿ったステップ単位の報酬信号を取得する実用的な手法を開発し、数ステップモデルの汎用報酬への適応能力を大幅に改善する統一RL事後学習法を実現した。テキスト描画、視覚的品質、選好順応にわたる広範な実験を実施し、全ての結果がTDM-R1が数ステップテキストto画像モデル向けの強力なRLパラダイムであり、ドメイン内・ドメイン外指標の両方で最先端のRL性能を達成することを示す。さらに、TDM-R1は最近の強力なZ-Imageモデルにも効果的にスケールし、4NFEのみで100-NFE版及び数ステップ版の両方を一貫して上回る。プロジェクトページ: https://github.com/Luo-Yihong/TDM-R1

English

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

TDM-R1: 非微分可能な報酬によるFew-Step拡散モデルの強化学習

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

要旨

Support