직접적 선호 최적화를 활용한 확산 모델 정렬

초록

대형 언어 모델(LLMs)은 인간 피드백 강화 학습(RLHF) 방법을 통해 인간 비교 데이터를 사용하여 미세 조정됨으로써 사용자 선호도와 더 잘 일치하도록 개선됩니다. 이와 달리, 텍스트-이미지 확산 모델에서는 인간 선호도 학습이 널리 탐구되지 않았으며, 기존의 최선의 접근 방식은 시각적 매력과 텍스트 정렬을 개선하기 위해 사전 훈련된 모델을 신중하게 선별된 고품질 이미지와 캡션을 사용하여 미세 조정하는 것입니다. 본 연구에서는 인간 비교 데이터를 직접 최적화하여 확산 모델을 인간 선호도에 맞추는 방법인 Diffusion-DPO를 제안합니다. Diffusion-DPO는 최근 개발된 직접 선호도 최적화(DPO)를 기반으로 하며, DPO는 RLHF의 더 간단한 대안으로서 분류 목표 하에서 인간 선호도를 가장 잘 만족시키는 정책을 직접 최적화합니다. 우리는 DPO를 확산 모델의 가능성 개념에 맞게 재구성하고, 증거 하한을 활용하여 미분 가능한 목표를 도출합니다. 851K 크라우드소싱된 쌍별 선호도 데이터셋인 Pick-a-Pic을 사용하여 최신 Stable Diffusion XL(SDXL)-1.0 모델의 기본 모델을 Diffusion-DPO로 미세 조정합니다. 우리의 미세 조정된 기본 모델은 인간 평가에서 기본 SDXL-1.0과 추가 정제 모델을 포함한 더 큰 SDXL-1.0 모델을 모두 크게 능가하며, 시각적 매력과 프롬프트 정렬을 개선합니다. 또한, 인간 선호도에 대한 훈련과 비슷한 성능을 보이는 AI 피드백을 사용하는 변형을 개발하여 확산 모델 정렬 방법의 확장 가능성을 열었습니다.

English

Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.

직접적 선호 최적화를 활용한 확산 모델 정렬

Diffusion Model Alignment Using Direct Preference Optimization

초록

Support