DRAGON：分佈式獎勵優化擴散生成模型

摘要

我們提出了分佈式獎勵生成優化框架（DRAGON），這是一個用於微調媒體生成模型以達成期望結果的通用框架。與傳統的基於人類反饋的強化學習（RLHF）或成對偏好方法（如直接偏好優化DPO）相比，DRAGON更具靈活性。它能夠優化評估單個樣本或樣本分佈的獎勵函數，使其兼容於廣泛的實例級、實例到分佈以及分佈到分佈的獎勵類型。利用這一靈活性，我們通過選擇編碼器和一組參考樣本來構建新穎的獎勵函數，從而創建一個示範分佈。當使用跨模態編碼器（如CLAP）時，參考樣本可能來自不同的模態（例如，文本與音頻）。隨後，DRAGON收集在線和策略生成的樣本，對其進行評分以構建正例示範集和負例集，並利用兩者之間的對比來最大化獎勵。在評估階段，我們使用20種不同的獎勵函數對一個音頻領域的文本到音樂擴散模型進行微調，包括自定義的音樂美學模型、CLAP分數、Vendi多樣性以及Frechet音頻距離（FAD）。我們進一步比較了實例級（每首歌曲）和全數據集FAD設置，同時對多種FAD編碼器和參考集進行了消融實驗。在所有20個目標獎勵上，DRAGON實現了81.45%的平均勝率。此外，基於示範集的獎勵函數確實提升了生成質量，並與基於模型的獎勵相當。在合適的示範集下，DRAGON在未經人類偏好註釋訓練的情況下，達到了60.95%的人類投票音樂質量勝率。因此，DRAGON展示了一種新的設計和優化獎勵函數的方法，以提升人類感知的質量。音頻示例請訪問https://ml-dragon.github.io/web。

English

We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon.github.io/web.

DRAGON：分佈式獎勵優化擴散生成模型

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

摘要

Support