細粒度な人間の選好と完全な拡散軌道を直接的に整合させる

要旨

最近の研究では、微分可能な報酬を用いて拡散モデルを直接人間の好みに合わせる手法の有効性が実証されています。しかし、これらの手法には2つの主要な課題があります。(1) 報酬スコアリングのために多段階のノイズ除去と勾配計算に依存しており、計算コストが高く、最適化が少数の拡散ステップに限定されること、(2) フォトリアリズムや正確な照明効果といった望ましい美的品質を達成するために、報酬モデルの継続的なオフライン適応が必要となることです。多段階ノイズ除去の制限に対処するため、我々はDirect-Alignを提案します。この手法では、拡散状態がノイズとターゲット画像の補間であるという方程式を活用し、任意のタイムステップから元の画像を効果的に復元するためのノイズ事前分布を事前に定義します。これにより、後期のタイムステップでの過剰最適化を効果的に回避します。さらに、Semantic Relative Preference Optimization (SRPO)を導入し、報酬をテキスト条件付き信号として定式化します。このアプローチにより、ポジティブおよびネガティブなプロンプト拡張に応じて報酬をオンラインで調整することが可能となり、オフラインでの報酬微調整への依存を軽減します。最適化されたノイズ除去とオンライン報酬調整を用いてFLUX.1.devモデルを微調整することで、人間による評価におけるリアリズムと美的品質を3倍以上向上させました。

English

Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.