DRAGON: 분포적 보상을 통한 확산 생성 모델 최적화

초록

우리는 원하는 결과를 향해 미디어 생성 모델을 미세 조정하기 위한 다목적 프레임워크인 DRAGON(Distributional RewArds for Generative OptimizatioN)을 소개합니다. 기존의 인간 피드백을 활용한 강화 학습(RLHF)이나 직접 선호도 최적화(DPO)와 같은 쌍별 선호도 접근 방식과 비교할 때, DRAGON은 더욱 유연합니다. 이는 개별 예제 또는 그 분포를 평가하는 보상 함수를 최적화할 수 있어, 인스턴스 단위, 인스턴스-대-분포, 그리고 분포-대-분포 보상 함수와 호환됩니다. 이러한 다용도성을 활용하여, 우리는 인코더와 참조 예제 집합을 선택하여 모범 분포를 생성하는 새로운 보상 함수를 구성합니다. CLAP과 같은 교차 모달리티 인코더를 사용할 경우, 참조 예제는 다른 모달리티(예: 텍스트 대 오디오)일 수 있습니다. 그런 다음, DRAGON은 온라인 및 온-정책 생성물을 수집하고, 이를 점수화하여 긍정적 데모 세트와 부정적 세트를 구성하며, 두 세트 간의 대비를 활용하여 보상을 극대화합니다. 평가를 위해, 우리는 사용자 정의 음악 미학 모델, CLAP 점수, Vendi 다양성, 그리고 Frechet 오디오 거리(FAD)를 포함한 20가지 다양한 보상 함수로 오디오 도메인의 텍스트-대-음악 확산 모델을 미세 조정했습니다. 또한, 인스턴스 단위(곡별)와 전체 데이터셋 FAD 설정을 비교하면서 여러 FAD 인코더와 참조 세트를 제거 실험했습니다. 모든 20가지 목표 보상에 대해, DRAGON은 평균 81.45%의 승률을 달성했습니다. 더욱이, 모범 세트 기반의 보상 함수는 실제로 생성물을 개선하며, 모델 기반 보상과 비교할 만합니다. 적절한 모범 세트를 사용하면, DRAGON은 인간 선호도 주석을 학습하지 않고도 60.95%의 인간 투표 음악 품질 승률을 달성합니다. 이처럼, DRAGON은 인간이 인지하는 품질을 개선하기 위한 보상 함수 설계 및 최적화의 새로운 접근 방식을 보여줍니다. 사운드 예제는 https://ml-dragon.github.io/web에서 확인할 수 있습니다.

English

We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.

DRAGON: 분포적 보상을 통한 확산 생성 모델 최적화

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

초록

Support