ステップ認識型選好最適化：各ステップにおける選好とノイズ除去性能の整合

要旨

最近、Direct Preference Optimization (DPO)は、大規模言語モデル(LLM)のアラインメントから、テキストから画像への拡散モデルを人間の好みに合わせることにその成功を拡大しています。既存のDPO手法の多くは、すべての拡散ステップが最終生成画像と一貫した選好順序を共有することを前提としていますが、この前提はステップ固有のノイズ除去性能を無視しており、選好ラベルは各ステップの貢献に合わせて調整されるべきであると私たちは主張します。この制限に対処するため、私たちはStep-aware Preference Optimization (SPO)を提案します。これは、ステップごとのノイズ除去性能を独立して評価し調整する新しいポストトレーニングアプローチであり、ステップを意識した選好モデルとステップごとのリサンプラーを使用して、正確なステップを意識した監督を確保します。具体的には、各ノイズ除去ステップで、画像のプールをサンプリングし、適切な勝敗ペアを見つけ、最も重要なこととして、プールから単一の画像をランダムに選択して次のノイズ除去ステップを初期化します。このステップごとのリサンプラープロセスにより、次の勝敗画像ペアが同じ画像から来ることを保証し、勝敗比較を前のステップから独立させます。各ステップでの選好を評価するために、ノイズのある画像とクリーンな画像の両方に適用できる別個のステップを意識した選好モデルをトレーニングします。Stable Diffusion v1.5とSDXLを用いた実験では、SPOが複雑で詳細なプロンプトに合わせた生成画像のアラインメントと美的感覚の向上において、最新のDiffusion-DPOを大幅に上回り、トレーニング効率も20倍以上向上することを示しています。コードとモデル: https://rockeycoss.github.io/spo.github.io/

English

Recently, Direct Preference Optimization (DPO) has extended its success from aligning large language models (LLMs) to aligning text-to-image diffusion models with human preferences. Unlike most existing DPO methods that assume all diffusion steps share a consistent preference order with the final generated images, we argue that this assumption neglects step-specific denoising performance and that preference labels should be tailored to each step's contribution. To address this limitation, we propose Step-aware Preference Optimization (SPO), a novel post-training approach that independently evaluates and adjusts the denoising performance at each step, using a step-aware preference model and a step-wise resampler to ensure accurate step-aware supervision. Specifically, at each denoising step, we sample a pool of images, find a suitable win-lose pair, and, most importantly, randomly select a single image from the pool to initialize the next denoising step. This step-wise resampler process ensures the next win-lose image pair comes from the same image, making the win-lose comparison independent of the previous step. To assess the preferences at each step, we train a separate step-aware preference model that can be applied to both noisy and clean images. Our experiments with Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms the latest Diffusion-DPO in aligning generated images with complex, detailed prompts and enhancing aesthetics, while also achieving more than 20x times faster in training efficiency. Code and model: https://rockeycoss.github.io/spo.github.io/

ステップ認識型選好最適化：各ステップにおける選好とノイズ除去性能の整合

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

要旨

Support