단계 인식 선호도 최적화: 각 단계에서의 잡음 제거 성능과 선호도 정렬

초록

최근, 직접 선호도 최적화(Direct Preference Optimization, DPO)는 대규모 언어 모델(LLMs)을 인간의 선호도에 맞추는 데 성공한 것을 넘어, 텍스트-이미지 확산 모델을 인간의 선호도에 맞추는 데까지 그 성과를 확장하고 있습니다. 기존의 대부분의 DPO 방법들은 모든 확산 단계가 최종 생성된 이미지와 일관된 선호도 순서를 가진다고 가정하지만, 우리는 이러한 가정이 각 단계별 디노이징 성능을 간과하며, 선호도 레이블이 각 단계의 기여에 맞게 조정되어야 한다고 주장합니다. 이러한 한계를 해결하기 위해, 우리는 단계별 선호도 최적화(Step-aware Preference Optimization, SPO)라는 새로운 사후 훈련 접근 방식을 제안합니다. SPO는 각 단계의 디노이징 성능을 독립적으로 평가하고 조정하며, 단계별 선호도 모델과 단계별 리샘플러를 사용하여 정확한 단계별 지도를 보장합니다. 구체적으로, 각 디노이징 단계에서 우리는 이미지 풀을 샘플링하고 적절한 승-패 쌍을 찾으며, 가장 중요한 것은 풀에서 단일 이미지를 무작위로 선택하여 다음 디노이징 단계를 초기화하는 것입니다. 이 단계별 리샘플러 프로세스는 다음 승-패 이미지 쌍이 동일한 이미지에서 나오도록 하여, 승-패 비교가 이전 단계와 독립적이게 만듭니다. 각 단계의 선호도를 평가하기 위해, 우리는 노이즈가 있는 이미지와 깨끗한 이미지 모두에 적용 가능한 별도의 단계별 선호도 모델을 훈련합니다. Stable Diffusion v1.5와 SDXL을 사용한 실험에서 SPO는 복잡하고 상세한 프롬프트에 맞춰 생성된 이미지를 정렬하고 미학적 품질을 향상시키는 데 있어 최신 Diffusion-DPO를 크게 능가하며, 훈련 효율성에서도 20배 이상 빠른 성과를 보였습니다. 코드와 모델: https://rockeycoss.github.io/spo.github.io/

English

Recently, Direct Preference Optimization (DPO) has extended its success from aligning large language models (LLMs) to aligning text-to-image diffusion models with human preferences. Unlike most existing DPO methods that assume all diffusion steps share a consistent preference order with the final generated images, we argue that this assumption neglects step-specific denoising performance and that preference labels should be tailored to each step's contribution. To address this limitation, we propose Step-aware Preference Optimization (SPO), a novel post-training approach that independently evaluates and adjusts the denoising performance at each step, using a step-aware preference model and a step-wise resampler to ensure accurate step-aware supervision. Specifically, at each denoising step, we sample a pool of images, find a suitable win-lose pair, and, most importantly, randomly select a single image from the pool to initialize the next denoising step. This step-wise resampler process ensures the next win-lose image pair comes from the same image, making the win-lose comparison independent of the previous step. To assess the preferences at each step, we train a separate step-aware preference model that can be applied to both noisy and clean images. Our experiments with Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms the latest Diffusion-DPO in aligning generated images with complex, detailed prompts and enhancing aesthetics, while also achieving more than 20x times faster in training efficiency. Code and model: https://rockeycoss.github.io/spo.github.io/

단계 인식 선호도 최적화: 각 단계에서의 잡음 제거 성능과 선호도 정렬

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

초록

Support